Data Engineering with AWS Acquire the Skills to Design and Build AWS-Based Data Transformation Pipelines Like a Pro

This book, authored by a seasoned Senior Data Architect with 25 years of experience, aims to help you achieve proficiency in using the AWS ecosystem for data engineering. This revised edition provides updates in every chapter to cover the latest AWS services and features, takes a refreshed look at d...

Descripción completa

Detalles Bibliográficos
Otros Autores: Eagar, Gareth, author (author)
Formato: Libro electrónico
Idioma:Inglés
Publicado: Birmingham, England : Packt Publishing Ltd [2023]
Edición:Second edition
Colección:Expert insight.
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009781237506719
Tabla de Contenidos:
  • Cover
  • Copyright
  • Contributors
  • Table of Contents
  • Preface
  • Section 1: AWS Data Engineering Concepts and Trends
  • Chapter 1: An Introduction to Data Engineering
  • Technical requirements
  • The rise of big data as a corporate asset
  • The challenges of ever-growing datasets
  • The role of the data engineer as a big data enabler
  • Understanding the role of the data engineer
  • Understanding the role of the data scientist
  • Understanding the role of the data analyst
  • Understanding other common data-related roles
  • The benefits of the cloud when building big data analytic solutions
  • Hands-on - creating and accessing your AWS account
  • Creating a new AWS account
  • Accessing your AWS account
  • Summary
  • Chapter 2: Data Management Architectures for Analytics
  • Technical requirements
  • The evolution of data management for analytics
  • Databases and data warehouses
  • Dealing with big, unstructured data
  • Cloud-based solutions for big data analytics
  • A deeper dive into data warehouse concepts and architecture
  • Dimensional modeling in data warehouses
  • Understanding the role of data marts
  • Distributed storage and massively parallel processing
  • Columnar data storage and efficient data compression
  • Feeding data into the warehouse - ETL and ELT pipelines
  • An overview of data lake architecture and concepts
  • Data lake logical architecture
  • The storage layer and storage zones
  • Catalog and search layers
  • Ingestion layer
  • The processing layer
  • The consumption layer
  • Data lake architecture summary
  • Bringing together the best of data warehouses and data lakes
  • The data lake house approach
  • New data lake table formats
  • Federated queries across database engines
  • Hands-on - using the AWS Command Line Interface (CLI) to create Simple Storage Service (S3) buckets
  • Accessing the AWS CLI.
  • Using AWS CloudShell to access the CLI
  • Creating new Amazon S3 buckets
  • Summary
  • Chapter 3: The AWS Data Engineer's Toolkit
  • Technical requirements
  • An overview of AWS services for ingesting data
  • Amazon Database Migration Service (DMS)
  • Amazon Kinesis for streaming data ingestion
  • Amazon Kinesis Agent
  • Amazon Kinesis Firehose
  • Amazon Kinesis Data Streams
  • Amazon Kinesis Data Analytics
  • Amazon Kinesis Video Streams
  • Amazon MSK for streaming data ingestion
  • Amazon AppFlow for ingesting data from SaaS services
  • AWS Transfer Family for ingestion using FTP/SFTP protocols
  • AWS DataSync for ingesting from on premises and multicloud storage services
  • The AWS Snow family of devices for large data transfers
  • AWS Glue for data ingestion
  • An overview of AWS services for transforming data
  • AWS Lambda for light transformations
  • AWS Glue for serverless data processing
  • Serverless ETL processing
  • AWS Glue DataBrew
  • AWS Glue Data Catalog
  • AWS Glue crawlers
  • Amazon EMR for Hadoop ecosystem processing
  • An overview of AWS services for orchestrating big data pipelines
  • AWS Glue workflows for orchestrating Glue components
  • AWS Step Functions for complex workflows
  • Amazon Managed Workflows for Apache Airflow (MWAA)
  • An overview of AWS services for consuming data
  • Amazon Athena for SQL queries in the data lake
  • Amazon Redshift and Redshift Spectrum for data warehousing and data lakehouse architectures
  • Overview of Amazon QuickSight for visualizing data
  • Hands-on - triggering an AWS Lambda function when a new file arrives in an S3 bucket
  • Creating a Lambda layer containing the AWS SDK for pandas library
  • Creating an IAM policy and role for your Lambda function
  • Creating a Lambda function
  • Configuring our Lambda function to be triggered by an S3 upload
  • Summary.
  • Chapter 4: Data Governance, Security, and Cataloging
  • Technical requirements
  • The many different aspects of data governance
  • Data security, access, and privacy
  • Common data regulatory requirements
  • Core data protection concepts
  • Personally identifiable information (PII)
  • Personal data
  • Encryption
  • Anonymized data
  • Pseudonymized data/tokenization
  • Authentication
  • Authorization
  • Putting these concepts together
  • Data quality, data profiling, and data lineage
  • Data quality
  • Data profiling
  • Data lineage
  • Business and technical data catalogs
  • Implementing a data catalog to avoid creating a data swamp
  • Business data catalogs
  • Technical data catalogs
  • AWS services that help with data governance
  • The AWS Glue/Lake Formation technical data catalog
  • AWS Glue DataBrew for profiling datasets
  • AWS Glue Data Quality
  • AWS Key Management Service (KMS) for data encryption
  • Amazon Macie for detecting PII data in Amazon S3 objects
  • The AWS Glue Studio Detect PII transform for detecting PII data in datasets
  • Amazon GuardDuty for detecting threats in an AWS account
  • AWS Identity and Access Management (IAM) service
  • Using AWS Lake Formation to manage data lake access
  • Permissions management before Lake Formation
  • Permissions management using AWS Lake Formation
  • Hands-on - configuring Lake Formation permissions
  • Creating a new user with IAM permissions
  • Transitioning to managing fine-grained permissions with AWS Lake Formation
  • Activating Lake Formation permissions for a database and table
  • Granting Lake Formation permissions
  • Summary
  • Section 2: Architecting and Implementing Data Engineering Pipelines and Transformations
  • Chapter 5: Architecting Data Engineering Pipelines
  • Technical requirements
  • Approaching the data pipeline architecture
  • Architecting houses and pipelines.
  • Whiteboarding as an information-gathering tool
  • Conducting a whiteboarding session
  • Identifying data consumers and understanding their requirements
  • Identifying data sources and ingesting data
  • Identifying data transformations and optimizations
  • File format optimizations
  • Data standardization
  • Data quality checks
  • Data partitioning
  • Data denormalization
  • Data cataloging
  • Whiteboarding data transformation
  • Loading data into data marts
  • Wrapping up the whiteboarding session
  • Hands-on - architecting a sample pipeline
  • Detailed notes from the project "Bright Light" whiteboarding meeting of GP Widgets, Inc
  • Meeting notes
  • Summary
  • Chapter 6: Ingesting Batch and Streaming Data
  • Technical requirements
  • Understanding data sources
  • Data variety
  • Structured data
  • Semi-structured data
  • Unstructured data
  • Data volume
  • Data velocity
  • Data veracity
  • Data value
  • Questions to ask
  • Ingesting data from a relational database
  • AWS DMS
  • AWS Glue
  • Full one-off loads from one or more tables
  • Initial full loads from a table, and subsequent loads of new records
  • Creating AWS Glue jobs with AWS Lake Formation
  • Other ways to ingest data from a database
  • Deciding on the best approach to ingesting from a database
  • The size of the database
  • Database load
  • Data ingestion frequency
  • Technical requirements and compatibility
  • Ingesting streaming data
  • Amazon Kinesis versus Amazon Managed Streaming for Kafka (MSK)
  • Serverless services versus managed services
  • Open-source flexibility versus proprietary software with strong AWS integration
  • At-least-once messaging versus exactly once messaging
  • A single processing engine versus niche tools
  • Deciding on a streaming ingestion tool
  • Hands-on - ingesting data with AWS DMS
  • Deploying MySQL and an EC2 data loader via CloudFormation.
  • Creating an IAM policy and role for DMS
  • Configuring DMS settings and performing a full load from MySQL to S3
  • Querying data with Amazon Athena
  • Hands-on - ingesting streaming data
  • Configuring Kinesis Data Firehose for streaming delivery to Amazon S3
  • Configuring Amazon Kinesis Data Generator (KDG)
  • Adding newly ingested data to the Glue Data Catalog
  • Querying the data with Amazon Athena
  • Summary
  • Chapter 7: Transforming Data to Optimize for Analytics
  • Technical requirements
  • Overview of how transformations can create value
  • Cooking, baking, and data transformations
  • Transformations as part of a pipeline
  • Types of data transformation tools
  • Apache Spark
  • Hadoop and MapReduce
  • SQL
  • GUI-based tools
  • Common data preparation transformations
  • Protecting PII data
  • Optimizing the file format
  • Optimizing with data partitioning
  • Data cleansing
  • Common business use case transformations
  • Data denormalization
  • Enriching data
  • Pre-aggregating data
  • Extracting metadata from unstructured data
  • Working with Change Data Capture (CDC) data
  • Traditional approaches - data upserts and SQL views
  • Modern approaches - Open Table Formats (OTFs)
  • Apache Iceberg
  • Apache Hudi
  • Databricks Delta Lake
  • Hands-on - joining datasets with AWS Glue Studio
  • Creating a new data lake zone - the curated zone
  • Creating a new IAM role for the Glue job
  • Configuring a denormalization transform using AWS Glue Studio
  • Finalizing the denormalization transform job to write to S3
  • Create a transform job to join streaming and film data using AWS Glue Studio
  • Summary
  • Chapter 8: Identifying and Enabling Data Consumers
  • Technical requirements
  • Understanding the impact of data democratization
  • A growing variety of data consumers
  • How a data mesh helps data consumers.
  • Meeting the needs of business users with data visualization.