Data Engineering with AWS Acquire the Skills to Design and Build AWS-Based Data Transformation Pipelines Like a Pro

This book, authored by a seasoned Senior Data Architect with 25 years of experience, aims to help you achieve proficiency in using the AWS ecosystem for data engineering. This revised edition provides updates in every chapter to cover the latest AWS services and features, takes a refreshed look at d...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Eagar, Gareth, author (author)
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Birmingham, England : Packt Publishing Ltd [2023]
Edición:	Second edition
Colección:	Expert insight.
Materias:	Amazon Web Services (Firm) Cloud computing. Big data.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009781237506719

Tabla de Contenidos:

Cover
Copyright
Contributors
Table of Contents
Preface
Section 1: AWS Data Engineering Concepts and Trends
Chapter 1: An Introduction to Data Engineering
Technical requirements
The rise of big data as a corporate asset
The challenges of ever-growing datasets
The role of the data engineer as a big data enabler
Understanding the role of the data engineer
Understanding the role of the data scientist
Understanding the role of the data analyst
Understanding other common data-related roles
The benefits of the cloud when building big data analytic solutions
Hands-on - creating and accessing your AWS account
Creating a new AWS account
Accessing your AWS account
Summary
Chapter 2: Data Management Architectures for Analytics
Technical requirements
The evolution of data management for analytics
Databases and data warehouses
Dealing with big, unstructured data
Cloud-based solutions for big data analytics
A deeper dive into data warehouse concepts and architecture
Dimensional modeling in data warehouses
Understanding the role of data marts
Distributed storage and massively parallel processing
Columnar data storage and efficient data compression
Feeding data into the warehouse - ETL and ELT pipelines
An overview of data lake architecture and concepts
Data lake logical architecture
The storage layer and storage zones
Catalog and search layers
Ingestion layer
The processing layer
The consumption layer
Data lake architecture summary
Bringing together the best of data warehouses and data lakes
The data lake house approach
New data lake table formats
Federated queries across database engines
Hands-on - using the AWS Command Line Interface (CLI) to create Simple Storage Service (S3) buckets
Accessing the AWS CLI.
Using AWS CloudShell to access the CLI
Creating new Amazon S3 buckets
Summary
Chapter 3: The AWS Data Engineer's Toolkit
Technical requirements
An overview of AWS services for ingesting data
Amazon Database Migration Service (DMS)
Amazon Kinesis for streaming data ingestion
Amazon Kinesis Agent
Amazon Kinesis Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Data Analytics
Amazon Kinesis Video Streams
Amazon MSK for streaming data ingestion
Amazon AppFlow for ingesting data from SaaS services
AWS Transfer Family for ingestion using FTP/SFTP protocols
AWS DataSync for ingesting from on premises and multicloud storage services
The AWS Snow family of devices for large data transfers
AWS Glue for data ingestion
An overview of AWS services for transforming data
AWS Lambda for light transformations
AWS Glue for serverless data processing
Serverless ETL processing
AWS Glue DataBrew
AWS Glue Data Catalog
AWS Glue crawlers
Amazon EMR for Hadoop ecosystem processing
An overview of AWS services for orchestrating big data pipelines
AWS Glue workflows for orchestrating Glue components
AWS Step Functions for complex workflows
Amazon Managed Workflows for Apache Airflow (MWAA)
An overview of AWS services for consuming data
Amazon Athena for SQL queries in the data lake
Amazon Redshift and Redshift Spectrum for data warehousing and data lakehouse architectures
Overview of Amazon QuickSight for visualizing data
Hands-on - triggering an AWS Lambda function when a new file arrives in an S3 bucket
Creating a Lambda layer containing the AWS SDK for pandas library
Creating an IAM policy and role for your Lambda function
Creating a Lambda function
Configuring our Lambda function to be triggered by an S3 upload
Summary.
Chapter 4: Data Governance, Security, and Cataloging
Technical requirements
The many different aspects of data governance
Data security, access, and privacy
Common data regulatory requirements
Core data protection concepts
Personally identifiable information (PII)
Personal data
Encryption
Anonymized data
Pseudonymized data/tokenization
Authentication
Authorization
Putting these concepts together
Data quality, data profiling, and data lineage
Data quality
Data profiling
Data lineage
Business and technical data catalogs
Implementing a data catalog to avoid creating a data swamp
Business data catalogs
Technical data catalogs
AWS services that help with data governance
The AWS Glue/Lake Formation technical data catalog
AWS Glue DataBrew for profiling datasets
AWS Glue Data Quality
AWS Key Management Service (KMS) for data encryption
Amazon Macie for detecting PII data in Amazon S3 objects
The AWS Glue Studio Detect PII transform for detecting PII data in datasets
Amazon GuardDuty for detecting threats in an AWS account
AWS Identity and Access Management (IAM) service
Using AWS Lake Formation to manage data lake access
Permissions management before Lake Formation
Permissions management using AWS Lake Formation
Hands-on - configuring Lake Formation permissions
Creating a new user with IAM permissions
Transitioning to managing fine-grained permissions with AWS Lake Formation
Activating Lake Formation permissions for a database and table
Granting Lake Formation permissions
Summary
Section 2: Architecting and Implementing Data Engineering Pipelines and Transformations
Chapter 5: Architecting Data Engineering Pipelines
Technical requirements
Approaching the data pipeline architecture
Architecting houses and pipelines.
Whiteboarding as an information-gathering tool
Conducting a whiteboarding session
Identifying data consumers and understanding their requirements
Identifying data sources and ingesting data
Identifying data transformations and optimizations
File format optimizations
Data standardization
Data quality checks
Data partitioning
Data denormalization
Data cataloging
Whiteboarding data transformation
Loading data into data marts
Wrapping up the whiteboarding session
Hands-on - architecting a sample pipeline
Detailed notes from the project "Bright Light" whiteboarding meeting of GP Widgets, Inc
Meeting notes
Summary
Chapter 6: Ingesting Batch and Streaming Data
Technical requirements
Understanding data sources
Data variety
Structured data
Semi-structured data
Unstructured data
Data volume
Data velocity
Data veracity
Data value
Questions to ask
Ingesting data from a relational database
AWS DMS
AWS Glue
Full one-off loads from one or more tables
Initial full loads from a table, and subsequent loads of new records
Creating AWS Glue jobs with AWS Lake Formation
Other ways to ingest data from a database
Deciding on the best approach to ingesting from a database
The size of the database
Database load
Data ingestion frequency
Technical requirements and compatibility
Ingesting streaming data
Amazon Kinesis versus Amazon Managed Streaming for Kafka (MSK)
Serverless services versus managed services
Open-source flexibility versus proprietary software with strong AWS integration
At-least-once messaging versus exactly once messaging
A single processing engine versus niche tools
Deciding on a streaming ingestion tool
Hands-on - ingesting data with AWS DMS
Deploying MySQL and an EC2 data loader via CloudFormation.
Creating an IAM policy and role for DMS
Configuring DMS settings and performing a full load from MySQL to S3
Querying data with Amazon Athena
Hands-on - ingesting streaming data
Configuring Kinesis Data Firehose for streaming delivery to Amazon S3
Configuring Amazon Kinesis Data Generator (KDG)
Adding newly ingested data to the Glue Data Catalog
Querying the data with Amazon Athena
Summary
Chapter 7: Transforming Data to Optimize for Analytics
Technical requirements
Overview of how transformations can create value
Cooking, baking, and data transformations
Transformations as part of a pipeline
Types of data transformation tools
Apache Spark
Hadoop and MapReduce
SQL
GUI-based tools
Common data preparation transformations
Protecting PII data
Optimizing the file format
Optimizing with data partitioning
Data cleansing
Common business use case transformations
Data denormalization
Enriching data
Pre-aggregating data
Extracting metadata from unstructured data
Working with Change Data Capture (CDC) data
Traditional approaches - data upserts and SQL views
Modern approaches - Open Table Formats (OTFs)
Apache Iceberg
Apache Hudi
Databricks Delta Lake
Hands-on - joining datasets with AWS Glue Studio
Creating a new data lake zone - the curated zone
Creating a new IAM role for the Glue job
Configuring a denormalization transform using AWS Glue Studio
Finalizing the denormalization transform job to write to S3
Create a transform job to join streaming and film data using AWS Glue Studio
Summary
Chapter 8: Identifying and Enabling Data Consumers
Technical requirements
Understanding the impact of data democratization
A growing variety of data consumers
How a data mesh helps data consumers.
Meeting the needs of business users with data visualization.

Data Engineering with AWS Acquire the Skills to Design and Build AWS-Based Data Transformation Pipelines Like a Pro

Ejemplares similares