Building ETL Pipelines with Python Create and Deploy Enterprise-Ready ETL Pipelines by Employing Modern Methods
Develop production-ready ETL pipelines by leveraging Python libraries and deploying them for suitable use cases Key Features Understand how to set up a Python virtual environment with PyCharm Learn functional and object-oriented approaches to create ETL pipelines Create robust CI/CD processes for ET...
Otros Autores: | , |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Birmingham, England :
Packt Publishing Ltd
[2023]
|
Edición: | First edition |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009769033206719 |
Tabla de Contenidos:
- Cover
- Title Page
- Copyright
- Dedication
- Contributors
- Table of Contents
- Preface
- Part 1: Introduction to ETL, Data Pipelines, and Design Principles
- Chapter 1: A Primer on Python and the Development Environment
- Introducing Python fundamentals
- An overview of Python data structures
- Python if…else conditions or conditional statements
- Python looping techniques
- Python functions
- Object-oriented programming with Python
- Working with files in Python
- Establishing a development environment
- Version control with Git tracking
- Documenting environment dependencies with requirements.txt
- Utilizing module management systems (MMSs)
- Configuring a Pipenv environment in PyCharm
- Summary
- Chapter 2: Understanding the ETL Process and Data Pipelines
- What is a data pipeline?
- How do we create a robust pipeline?
- Pre-work - understanding your data
- Design planning - planning your workflow
- Architecture development - developing your resources
- Putting it all together - project diagrams
- What is an ETL data pipeline?
- Batch processing
- Streaming method
- Cloud-native
- Automating ETL pipelines
- Exploring use cases for ETL pipelines
- Summary
- References
- Chapter 3: Design Principles for Creating Scalable and Resilient Pipelines
- Technical requirements
- Understanding the design patterns for ETL
- Basic ETL design pattern
- ETL-P design pattern
- ETL-VP design pattern
- ELT two-phase pattern
- Preparing your local environment for installations
- Open source Python libraries for ETL pipelines
- Pandas
- NumPy
- Scaling for big data packages
- Dask
- Numba
- Summary
- References
- Part 2: Designing ETL Pipelines with Python
- Chapter 4: Sourcing Insightful Data and Data Extraction Strategies
- Technical requirements
- What is data sourcing?
- Accessibility to data.
- Types of data sources
- Getting started with data extraction
- CSV and Excel data files
- Parquet data files
- API connections
- Databases
- Data from web pages
- Creating a data extraction pipeline using Python
- Data extraction
- Logging
- Summary
- References
- Chapter 5: Data Cleansing and Transformation
- Technical requirements
- Scrubbing your data
- Data transformation
- Data cleansing and transformation in ETL pipelines
- Understanding the downstream applications of your data
- Strategies for data cleansing and transformation in Python
- Preliminary tasks - the importance of staging data
- Transformation activities in Python
- Creating data pipeline activity in Python
- Summary
- Chapter 6: Loading Transformed Data
- Technical requirements
- Introduction to data loading
- Choosing the load destination
- Types of load destinations
- Best practices for data loading
- Optimizing data loading activities by controlling the data import method
- Creating demo data
- Full data loads
- Incremental data loads
- Precautions to consider
- Tutorial - preparing your local environment for data loading activities
- Downloading and installing PostgreSQL
- Creating data schemas in PostgreSQL
- Summary
- Chapter 7: Tutorial - Building an End-to-End ETL Pipeline in Python
- Technical requirements
- Introducing the project
- The approach
- The data
- Creating tables in PostgreSQL
- Sourcing and extracting the data
- Transformation and data cleansing
- Loading data into PostgreSQL tables
- Making it deployable
- Summary
- Chapter 8: Powerful ETL Libraries and Tools in Python
- Technical requirements
- Architecture of Python files
- Configuring your local environment
- config.ini
- config.yaml
- Part 1 - ETL tools in Python
- Bonobo
- Odo
- Mito ETL
- Riko
- pETL
- Luigi.
- Part 2 - pipeline workflow management platforms in Python
- Airflow
- Summary
- Part 3: Creating ETL Pipelines in AWS
- Chapter 9: A Primer on AWS Tools for ETL Processes
- Common data storage tools in AWS
- Amazon RDS
- Amazon Redshift
- Amazon S3
- Amazon EC2
- Discussion - Building flexible applications in AWS
- Leveraging S3 and EC2
- Computing and automation with AWS
- AWS Glue
- AWS Lambda
- AWS Step Functions
- AWS big data tools for ETL pipelines
- AWS Data Pipeline
- Amazon Kinesis
- Amazon EMR
- Walk-through - creating a Free Tier AWS account
- Prerequisites for running AWS from your device in AWS
- AWS CLI
- Docker
- LocalStack
- AWS SAM CLI
- Summary
- Chapter 10: Tutorial - Creating an ETL Pipeline in AWS
- Technical requirements
- Creating a Python pipeline with Amazon S3, Lambda, and Step Functions
- Setting the stage with the AWS CLI
- Creating a "proof of concept" data pipeline in Python
- Using Boto3 and Amazon S3 to read data
- AWS Lambda functions
- AWS Step Functions
- An introduction to a scalable ETL pipeline using Bonobo, EC2, and RDS
- Configuring your AWS environment with EC2 and RDS
- Creating an RDS instance
- Creating an EC2 instance
- Creating a data pipeline locally with Bonobo
- Adding the pipeline to AWS
- Summary
- Chapter 11: Building Robust Deployment Pipelines in AWS
- Technical requirements
- What is CI/CD and why is it important?
- The six key elements of CI/CD
- Essential steps for CI/CD adoption
- CI/CD is a continual process
- Creating a robust CI/CD process for ETL pipelines in AWS
- Creating a CI/CD pipeline
- Building an ETL pipeline using various AWS services
- Setting up a CodeCommit repository
- Orchestrating with AWS CodePipeline
- Testing the pipeline
- Summary
- Part 4: Automating and Scaling ETL Pipelines.
- Chapter 12: Orchestration and Scaling in ETL Pipelines
- Technical requirements
- Performance bottlenecks
- Inflexibility
- Limited scalability
- Operational overheads
- Exploring the types of scaling
- Vertical scaling
- Horizontal scaling
- Choose your scaling strategy
- Processing requirements
- Data volume
- Cost
- Complexity and skills
- Reliability and availability
- Data pipeline orchestration
- Task scheduling
- Error handling and recovery
- Resource management
- Monitoring and logging
- Putting it together with a practical example
- Summary
- Chapter 13: Testing Strategies for ETL Pipelines
- Technical requirements
- Benefits of testing data pipeline code
- How to choose the right testing strategies for your ETL pipeline
- How often should you test your ETL pipeline?
- Creating tests for a simple ETL pipeline
- Unit testing
- Validation testing
- Integration testing
- End-to-end testing
- Performance testing
- Resilience testing
- Best practices for a testing environment for ETL pipelines
- Defining testing objectives
- Establishing a testing framework
- Automating ETL tests
- Monitoring ETL pipelines
- ETL testing challenges
- Data privacy and security
- Environment parity
- Top ETL testing tools
- Summary
- Chapter 14: Best Practices for ETL Pipelines
- Technical requirements
- Data quality
- Poor scalability
- Lack of error-handling and recovery methods
- ETL logging in Python
- Debugging and issue resolution
- Auditing and compliance
- Performance monitoring
- Including contextual information
- Handling exceptions and errors
- The Goldilocks principle
- Implementing logging in Python
- Checkpoint for recovery
- Avoiding SPOFs
- Modularity and auditing
- Modularity
- Auditing
- Summary
- Chapter 15: Use Cases and Further Reading
- Technical requirements.
- New York Yellow Taxi data, ETL pipeline, and deployment
- Step 1 - configuration
- Step 2 - ETL pipeline script
- Step 3 - unit tests
- Building a robust ETL pipeline with US construction data in AWS
- Prerequisites
- Step 1 - data extraction
- Step 2 - data transformation
- Step 3 - data loading
- Running the ETL pipeline
- Bonus - deploying your ETL pipeline
- Summary
- Further reading
- Index
- About Packt
- Other Books You May Enjoy.