Building ETL Pipelines with Python Create and Deploy Enterprise-Ready ETL Pipelines by Employing Modern Methods

Develop production-ready ETL pipelines by leveraging Python libraries and deploying them for suitable use cases Key Features Understand how to set up a Python virtual environment with PyCharm Learn functional and object-oriented approaches to create ETL pipelines Create robust CI/CD processes for ET...

Descripción completa

Detalles Bibliográficos
Otros Autores: Pandey, Brij Kishore, author (author), Schoof, Emily Ro, author
Formato: Libro electrónico
Idioma:Inglés
Publicado: Birmingham, England : Packt Publishing Ltd [2023]
Edición:First edition
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009769033206719
Tabla de Contenidos:
  • Cover
  • Title Page
  • Copyright
  • Dedication
  • Contributors
  • Table of Contents
  • Preface
  • Part 1: Introduction to ETL, Data Pipelines, and Design Principles
  • Chapter 1: A Primer on Python and the Development Environment
  • Introducing Python fundamentals
  • An overview of Python data structures
  • Python if…else conditions or conditional statements
  • Python looping techniques
  • Python functions
  • Object-oriented programming with Python
  • Working with files in Python
  • Establishing a development environment
  • Version control with Git tracking
  • Documenting environment dependencies with requirements.txt
  • Utilizing module management systems (MMSs)
  • Configuring a Pipenv environment in PyCharm
  • Summary
  • Chapter 2: Understanding the ETL Process and Data Pipelines
  • What is a data pipeline?
  • How do we create a robust pipeline?
  • Pre-work - understanding your data
  • Design planning - planning your workflow
  • Architecture development - developing your resources
  • Putting it all together - project diagrams
  • What is an ETL data pipeline?
  • Batch processing
  • Streaming method
  • Cloud-native
  • Automating ETL pipelines
  • Exploring use cases for ETL pipelines
  • Summary
  • References
  • Chapter 3: Design Principles for Creating Scalable and Resilient Pipelines
  • Technical requirements
  • Understanding the design patterns for ETL
  • Basic ETL design pattern
  • ETL-P design pattern
  • ETL-VP design pattern
  • ELT two-phase pattern
  • Preparing your local environment for installations
  • Open source Python libraries for ETL pipelines
  • Pandas
  • NumPy
  • Scaling for big data packages
  • Dask
  • Numba
  • Summary
  • References
  • Part 2: Designing ETL Pipelines with Python
  • Chapter 4: Sourcing Insightful Data and Data Extraction Strategies
  • Technical requirements
  • What is data sourcing?
  • Accessibility to data.
  • Types of data sources
  • Getting started with data extraction
  • CSV and Excel data files
  • Parquet data files
  • API connections
  • Databases
  • Data from web pages
  • Creating a data extraction pipeline using Python
  • Data extraction
  • Logging
  • Summary
  • References
  • Chapter 5: Data Cleansing and Transformation
  • Technical requirements
  • Scrubbing your data
  • Data transformation
  • Data cleansing and transformation in ETL pipelines
  • Understanding the downstream applications of your data
  • Strategies for data cleansing and transformation in Python
  • Preliminary tasks - the importance of staging data
  • Transformation activities in Python
  • Creating data pipeline activity in Python
  • Summary
  • Chapter 6: Loading Transformed Data
  • Technical requirements
  • Introduction to data loading
  • Choosing the load destination
  • Types of load destinations
  • Best practices for data loading
  • Optimizing data loading activities by controlling the data import method
  • Creating demo data
  • Full data loads
  • Incremental data loads
  • Precautions to consider
  • Tutorial - preparing your local environment for data loading activities
  • Downloading and installing PostgreSQL
  • Creating data schemas in PostgreSQL
  • Summary
  • Chapter 7: Tutorial - Building an End-to-End ETL Pipeline in Python
  • Technical requirements
  • Introducing the project
  • The approach
  • The data
  • Creating tables in PostgreSQL
  • Sourcing and extracting the data
  • Transformation and data cleansing
  • Loading data into PostgreSQL tables
  • Making it deployable
  • Summary
  • Chapter 8: Powerful ETL Libraries and Tools in Python
  • Technical requirements
  • Architecture of Python files
  • Configuring your local environment
  • config.ini
  • config.yaml
  • Part 1 - ETL tools in Python
  • Bonobo
  • Odo
  • Mito ETL
  • Riko
  • pETL
  • Luigi.
  • Part 2 - pipeline workflow management platforms in Python
  • Airflow
  • Summary
  • Part 3: Creating ETL Pipelines in AWS
  • Chapter 9: A Primer on AWS Tools for ETL Processes
  • Common data storage tools in AWS
  • Amazon RDS
  • Amazon Redshift
  • Amazon S3
  • Amazon EC2
  • Discussion - Building flexible applications in AWS
  • Leveraging S3 and EC2
  • Computing and automation with AWS
  • AWS Glue
  • AWS Lambda
  • AWS Step Functions
  • AWS big data tools for ETL pipelines
  • AWS Data Pipeline
  • Amazon Kinesis
  • Amazon EMR
  • Walk-through - creating a Free Tier AWS account
  • Prerequisites for running AWS from your device in AWS
  • AWS CLI
  • Docker
  • LocalStack
  • AWS SAM CLI
  • Summary
  • Chapter 10: Tutorial - Creating an ETL Pipeline in AWS
  • Technical requirements
  • Creating a Python pipeline with Amazon S3, Lambda, and Step Functions
  • Setting the stage with the AWS CLI
  • Creating a "proof of concept" data pipeline in Python
  • Using Boto3 and Amazon S3 to read data
  • AWS Lambda functions
  • AWS Step Functions
  • An introduction to a scalable ETL pipeline using Bonobo, EC2, and RDS
  • Configuring your AWS environment with EC2 and RDS
  • Creating an RDS instance
  • Creating an EC2 instance
  • Creating a data pipeline locally with Bonobo
  • Adding the pipeline to AWS
  • Summary
  • Chapter 11: Building Robust Deployment Pipelines in AWS
  • Technical requirements
  • What is CI/CD and why is it important?
  • The six key elements of CI/CD
  • Essential steps for CI/CD adoption
  • CI/CD is a continual process
  • Creating a robust CI/CD process for ETL pipelines in AWS
  • Creating a CI/CD pipeline
  • Building an ETL pipeline using various AWS services
  • Setting up a CodeCommit repository
  • Orchestrating with AWS CodePipeline
  • Testing the pipeline
  • Summary
  • Part 4: Automating and Scaling ETL Pipelines.
  • Chapter 12: Orchestration and Scaling in ETL Pipelines
  • Technical requirements
  • Performance bottlenecks
  • Inflexibility
  • Limited scalability
  • Operational overheads
  • Exploring the types of scaling
  • Vertical scaling
  • Horizontal scaling
  • Choose your scaling strategy
  • Processing requirements
  • Data volume
  • Cost
  • Complexity and skills
  • Reliability and availability
  • Data pipeline orchestration
  • Task scheduling
  • Error handling and recovery
  • Resource management
  • Monitoring and logging
  • Putting it together with a practical example
  • Summary
  • Chapter 13: Testing Strategies for ETL Pipelines
  • Technical requirements
  • Benefits of testing data pipeline code
  • How to choose the right testing strategies for your ETL pipeline
  • How often should you test your ETL pipeline?
  • Creating tests for a simple ETL pipeline
  • Unit testing
  • Validation testing
  • Integration testing
  • End-to-end testing
  • Performance testing
  • Resilience testing
  • Best practices for a testing environment for ETL pipelines
  • Defining testing objectives
  • Establishing a testing framework
  • Automating ETL tests
  • Monitoring ETL pipelines
  • ETL testing challenges
  • Data privacy and security
  • Environment parity
  • Top ETL testing tools
  • Summary
  • Chapter 14: Best Practices for ETL Pipelines
  • Technical requirements
  • Data quality
  • Poor scalability
  • Lack of error-handling and recovery methods
  • ETL logging in Python
  • Debugging and issue resolution
  • Auditing and compliance
  • Performance monitoring
  • Including contextual information
  • Handling exceptions and errors
  • The Goldilocks principle
  • Implementing logging in Python
  • Checkpoint for recovery
  • Avoiding SPOFs
  • Modularity and auditing
  • Modularity
  • Auditing
  • Summary
  • Chapter 15: Use Cases and Further Reading
  • Technical requirements.
  • New York Yellow Taxi data, ETL pipeline, and deployment
  • Step 1 - configuration
  • Step 2 - ETL pipeline script
  • Step 3 - unit tests
  • Building a robust ETL pipeline with US construction data in AWS
  • Prerequisites
  • Step 1 - data extraction
  • Step 2 - data transformation
  • Step 3 - data loading
  • Running the ETL pipeline
  • Bonus - deploying your ETL pipeline
  • Summary
  • Further reading
  • Index
  • About Packt
  • Other Books You May Enjoy.