Data engineering with AWS build and implement complex data pipelines using AWS

Start your AWS data engineering journey with this easy-to-follow, hands-on guide and get to grips with foundational concepts through to building data engineering pipelines using AWS Key Features Learn about common data architectures and modern approaches to generating value from big data Explore AWS...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Eagar, Gareth, author (author)
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Birmingham ; Mumbai : Packt Publishing 2021.
Edición:	1st edition
Materias:	Amazon Web Services (Firm) Cloud computing. Big data.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009645675606719

Tabla de Contenidos:

Cover
Title page
Copyright and Credits
Contributors
Table of Contents
Preface
Section 1: AWS Data Engineering Concepts and Trends
Chapter 1: An Introduction to Data Engineering
Technical requirements
The rise of big data as a corporate asset
The challenges of ever-growing datasets
Data engineers - the big data enablers
Understanding the role of the data engineer
Understanding the role of the data scientist
Understanding the role of the data analyst
Understanding other common data-related roles
The benefits of the cloud when building big data analytic solutions
Hands-on - creating and accessing your AWS account
Creating a new AWS account
Accessing your AWS account
Summary
Chapter 2: Data Management Architectures for Analytics
Technical requirements
The evolution of data management for analytics
Databases and data warehouses
Dealing with big, unstructured data
A lake on the cloud and a house on that lake
Understanding data warehouses and data marts - fountains of truth
Distributed storage and massively parallel processing
Columnar data storage and efficient data compression
Dimensional modeling in data warehouses
Understanding the role of data marts
Feeding data into the warehouse - ETL and ELT pipelines
Building data lakes to tame the variety and volume of big data
Data lake logical architecture
Bringing together the best of both worlds with the lake house architecture
Data lakehouse implementations
Building a data lakehouse on AWS
Hands-on - configuring the AWS Command Line Interface tool and creating an S3 bucket
Installing and configuring the AWS CLI
Creating a new Amazon S3 bucket
Summary
Chapter 3: The AWS Data Engineer's Toolkit
Technical requirements
AWS services for ingesting data.
Overview of Amazon Database Migration Service (DMS)
Overview of Amazon Kinesis for streaming data ingestion
Overview of Amazon MSK for streaming data ingestion
Overview of Amazon AppFlow for ingesting data from SaaS services
Overview of Amazon Transfer Family for ingestion using FTP/SFTP protocols
Overview of Amazon DataSync for ingesting from on-premises storage
Overview of the AWS Snow family of devices for large data transfers
AWS services for transforming data
Overview of AWS Lambda for light transformations
Overview of AWS Glue for serverless Spark processing
Overview of Amazon EMR for Hadoop ecosystem processing
AWS services for orchestrating big data pipelines
Overview of AWS Glue workflows for orchestrating Glue components
Overview of AWS Step Functions for complex workflows
Overview of Amazon managed workflows for Apache Airflow
AWS services for consuming data
Overview of Amazon Athena for SQL queries in the data lake
Overview of Amazon Redshift and Redshift Spectrum for data warehousing and data lakehouse architectures
Overview of Amazon QuickSight for visualizing data
Hands-on - triggering an AWS Lambda function when a new file arrives in an S3 bucket
Creating a Lambda layer containing the AWS Data Wrangler library
Creating new Amazon S3 buckets
Creating an IAM policy and role for your Lambda function
Creating a Lambda function
Configuring our Lambda function to be triggered by an S3 upload
Summary
Chapter 4: Data Cataloging, Security, and Governance
Technical requirements
Getting data security and governance right
Common data regulatory requirements
Core data protection concepts
Personal data
Encryption
Anonymized data
Pseudonymized data/tokenization
Authentication
Authorization
Putting these concepts together.
Cataloging your data to avoid the data swamp
How to avoid the data swamp
The AWS Glue/Lake Formation data catalog
AWS services for data encryption and security monitoring
AWS Key Management Service (KMS)
Amazon Macie
Amazon GuardDuty
AWS services for managing identity and permissions
AWS Identity and Access Management (IAM) service
Using AWS Lake Formation to manage data lake access
Hands-on - configuring Lake Formation permissions
Creating a new user with IAM permissions
Transitioning to managing fine-grained permissions with AWS Lake Formation
Summary
Section 2: Architecting and Implementing Data Lakes and Data Lake Houses
Chapter 5: Architecting Data Engineering Pipelines
Technical requirements
Approaching the data pipeline architecture
Architecting houses and architecting pipelines
Whiteboarding as an information-gathering tool
Conducting a whiteboarding session
Identifying data consumers and understanding their requirements
Identifying data sources and ingesting data
Identifying data transformations and optimizations
File format optimizations
Data standardization
Data quality checks
Data partitioning
Data denormalization
Data cataloging
Whiteboarding data transformation
Loading data into data marts
Wrapping up the whiteboarding session
Hands-on - architecting a sample pipeline
Detailed notes from the project "Bright Light" whiteboarding meeting of GP Widgets, Inc
Summary
Chapter 6: Ingesting Batch and Streaming Data
Technical requirements
Understanding data sources
Data variety
Data volume
Data velocity
Data veracity
Data value
Questions to ask
Ingesting data from a relational database
AWS Database Migration Service (DMS)
AWS Glue
Other ways to ingest data from a database.
Deciding on the best approach for ingesting from a database
Ingesting streaming data
Amazon Kinesis versus Amazon Managed Streaming for Kafka (MSK)
Hands-on - ingesting data with AWS DMS
Creating a new MySQL database instance
Loading the demo data using an Amazon EC2 instance
Creating an IAM policy and role for DMS
Configuring DMS settings and performing a full load from MySQL to S3
Querying data with Amazon Athena
Hands-on - ingesting streaming data
Configuring Kinesis Data Firehose for streaming delivery to Amazon S3
Configuring Amazon Kinesis Data Generator (KDG)
Adding newly ingested data to the Glue Data Catalog
Querying the data with Amazon Athena
Summary
Chapter 7: Transforming Data to Optimize for Analytics
Technical requirements
Transformations - making raw data more valuable
Cooking, baking, and data transformations
Transformations as part of a pipeline
Types of data transformation tools
Apache Spark
Hadoop and MapReduce
SQL
GUI-based tools
Data preparation transformations
Protecting PII data
Optimizing the file format
Optimizing with data partitioning
Data cleansing
Business use case transforms
Data denormalization
Enriching data
Pre-aggregating data
Extracting metadata from unstructured data
Working with change data capture (CDC) data
Traditional approaches - data upserts and SQL views
Modern approaches - the transactional data lake
Hands-on - joining datasets with AWS Glue Studio
Creating a new data lake zone - the curated zone
Creating a new IAM role for the Glue job
Configuring a denormalization transform using AWS Glue Studio
Finalizing the denormalization transform job to write to S3
Create a transform job to join streaming and film data using AWS Glue Studio
Summary.
Chapter 8: Identifying and Enabling Data Consumers
Technical requirements
Understanding the impact of data democratization
A growing variety of data consumers
Meeting the needs of business users with data visualization
AWS tools for business users
Meeting the needs of data analysts with structured reporting
AWS tools for data analysts
Meeting the needs of data scientists and ML models
AWS tools used by data scientists to work with data
Hands-on - creating data transformations with AWS Glue DataBrew
Configuring new datasets for AWS Glue DataBrew
Creating a new Glue DataBrew project
Building your Glue DataBrew recipe
Creating a Glue DataBrew job
Summary
Chapter 9: Loading Data into a Data Mart
Technical requirements
Extending analytics with data warehouses/data marts
Cold data
Warm data
Hot data
What not to do - anti-patterns for a data warehouse
Using a data warehouse as a transactional datastore
Using a data warehouse as a data lake
Using data warehouses for real-time, record-level use cases
Storing unstructured data
Redshift architecture review and storage deep dive
Data distribution across slices
Redshift Zone Maps and sorting data
Designing a high-performance data warehouse
Selecting the optimal Redshift node type
Selecting the optimal table distribution style and sort key
Selecting the right data type for columns
Selecting the optimal table type
Moving data between a data lake and Redshift
Optimizing data ingestion in Redshift
Exporting data from Redshift to the data lake
Hands-on - loading data into an Amazon Redshift cluster and running queries
Uploading our sample data to Amazon S3
IAM roles for Redshift
Creating a Redshift cluster
Creating external tables for querying data in S3.
Creating a schema for a local Redshift table.

Data engineering with AWS build and implement complex data pipelines using AWS

Ejemplares similares