Data pipelines with Apache Airflow
An Airflow bible. Useful for all kinds of users, from novice to expert. Rambabu Posa, Sai Aashika Consultancy A successful pipeline moves data efficiently, minimizing pauses and blockages between tasks, keeping every process along the way operational. Apache Airflow provides a single customizable en...
Otros Autores: | , |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Shelter Island, New York :
Manning
[2021]
|
Edición: | 1st edition |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009631842806719 |
Tabla de Contenidos:
- Intro
- inside front cover
- Data Pipelines with Apache Airflow
- Copyright
- brief contents
- contents
- front matter
- preface
- acknowledgments
- Bas Harenslak
- Julian de Ruiter
- about this book
- Who should read this book
- How this book is organized: A road map
- About the code
- LiveBook discussion forum
- about the authors
- about the cover illustration
- Part 1. Getting started
- 1 Meet Apache Airflow
- 1.1 Introducing data pipelines
- 1.1.1 Data pipelines as graphs
- 1.1.2 Executing a pipeline graph
- 1.1.3 Pipeline graphs vs. sequential scripts
- 1.1.4 Running pipeline using workflow managers
- 1.2 Introducing Airflow
- 1.2.1 Defining pipelines flexibly in (Python) code
- 1.2.2 Scheduling and executing pipelines
- 1.2.3 Monitoring and handling failures
- 1.2.4 Incremental loading and backfilling
- 1.3 When to use Airflow
- 1.3.1 Reasons to choose Airflow
- 1.3.2 Reasons not to choose Airflow
- 1.4 The rest of this book
- Summary
- 2 Anatomy of an Airflow DAG
- 2.1 Collecting data from numerous sources
- 2.1.1 Exploring the data
- 2.2 Writing your first Airflow DAG
- 2.2.1 Tasks vs. operators
- 2.2.2 Running arbitrary Python code
- 2.3 Running a DAG in Airflow
- 2.3.1 Running Airflow in a Python environment
- 2.3.2 Running Airflow in Docker containers
- 2.3.3 Inspecting the Airflow UI
- 2.4 Running at regular intervals
- 2.5 Handling failing tasks
- Summary
- 3 Scheduling in Airflow
- 3.1 An example: Processing user events
- 3.2 Running at regular intervals
- 3.2.1 Defining scheduling intervals
- 3.2.2 Cron-based intervals
- 3.2.3 Frequency-based intervals
- 3.3 Processing data incrementally
- 3.3.1 Fetching events incrementally
- 3.3.2 Dynamic time references using execution dates
- 3.3.3 Partitioning your data
- 3.4 Understanding Airflow's execution dates.
- 3.4.1 Executing work in fixed-length intervals
- 3.5 Using backfilling to fill in past gaps
- 3.5.1 Executing work back in time
- 3.6 Best practices for designing tasks
- 3.6.1 Atomicity
- 3.6.2 Idempotency
- Summary
- 4 Templating tasks using the Airflow context
- 4.1 Inspecting data for processing with Airflow
- 4.1.1 Determining how to load incremental data
- 4.2 Task context and Jinja templating
- 4.2.1 Templating operator arguments
- 4.2.2 What is available for templating?
- 4.2.3 Templating the PythonOperator
- 4.2.4 Providing variables to the PythonOperator
- 4.2.5 Inspecting templated arguments
- 4.3 Hooking up other systems
- Summary
- 5 Defining dependencies between tasks
- 5.1 Basic dependencies
- 5.1.1 Linear dependencies
- 5.1.2 Fan-in/-out dependencies
- 5.2 Branching
- 5.2.1 Branching within tasks
- 5.2.2 Branching within the DAG
- 5.3 Conditional tasks
- 5.3.1 Conditions within tasks
- 5.3.2 Making tasks conditional
- 5.3.3 Using built-in operators
- 5.4 More about trigger rules
- 5.4.1 What is a trigger rule?
- 5.4.2 The effect of failures
- 5.4.3 Other trigger rules
- 5.5 Sharing data between tasks
- 5.5.1 Sharing data using XComs
- 5.5.2 When (not) to use XComs
- 5.5.3 Using custom XCom backends
- 5.6 Chaining Python tasks with the Taskflow API
- 5.6.1 Simplifying Python tasks with the Taskflow API
- 5.6.2 When (not) to use the Taskflow API
- Summary
- Part 2. Beyond the basics
- 6 Triggering workflows
- 6.1 Polling conditions with sensors
- 6.1.1 Polling custom conditions
- 6.1.2 Sensors outside the happy flow
- 6.2 Triggering other DAGs
- 6.2.1 Backfilling with the TriggerDagRunOperator
- 6.2.2 Polling the state of other DAGs
- 6.3 Starting workflows with REST/CLI
- Summary
- 7 Communicating with external systems
- 7.1 Connecting to cloud services.
- 7.1.1 Installing extra dependencies
- 7.1.2 Developing a machine learning model
- 7.1.3 Developing locally with external systems
- 7.2 Moving data from between systems
- 7.2.1 Implementing a PostgresToS3Operator
- 7.2.2 Outsourcing the heavy work
- Summary
- 8 Building custom components
- 8.1 Starting with a PythonOperator
- 8.1.1 Simulating a movie rating API
- 8.1.2 Fetching ratings from the API
- 8.1.3 Building the actual DAG
- 8.2 Building a custom hook
- 8.2.1 Designing a custom hook
- 8.2.2 Building our DAG with the MovielensHook
- 8.3 Building a custom operator
- 8.3.1 Defining a custom operator
- 8.3.2 Building an operator for fetching ratings
- 8.4 Building custom sensors
- 8.5 Packaging your components
- 8.5.1 Bootstrapping a Python package
- 8.5.2 Installing your package
- Summary
- 9 Testing
- 9.1 Getting started with testing
- 9.1.1 Integrity testing all DAGs
- 9.1.2 Setting up a CI/CD pipeline
- 9.1.3 Writing unit tests
- 9.1.4 Pytest project structure
- 9.1.5 Testing with files on disk
- 9.2 Working with DAGs and task context in tests
- 9.2.1 Working with external systems
- 9.3 Using tests for development
- 9.3.1 Testing complete DAGs
- 9.4 Emulate production environments with Whirl
- 9.5 Create DTAP environments
- Summary
- 10 Running tasks in containers
- 10.1 Challenges of many different operators
- 10.1.1 Operator interfaces and implementations
- 10.1.2 Complex and conflicting dependencies
- 10.1.3 Moving toward a generic operator
- 10.2 Introducing containers
- 10.2.1 What are containers?
- 10.2.2 Running our first Docker container
- 10.2.3 Creating a Docker image
- 10.2.4 Persisting data using volumes
- 10.3 Containers and Airflow
- 10.3.1 Tasks in containers
- 10.3.2 Why use containers?
- 10.4 Running tasks in Docker
- 10.4.1 Introducing the DockerOperator.
- 10.4.2 Creating container images for tasks
- 10.4.3 Building a DAG with Docker tasks
- 10.4.4 Docker-based workflow
- 10.5 Running tasks in Kubernetes
- 10.5.1 Introducing Kubernetes
- 10.5.2 Setting up Kubernetes
- 10.5.3 Using the KubernetesPodOperator
- 10.5.4 Diagnosing Kubernetes-related issues
- 10.5.5 Differences with Docker-based workflows
- Summary
- Part 3. Airflow in practice
- 11 Best practices
- 11.1 Writing clean DAGs
- 11.1.1 Use style conventions
- 11.1.2 Manage credentials centrally
- 11.1.3 Specify configuration details consistently
- 11.1.4 Avoid doing any computation in your DAG definition
- 11.1.5 Use factories to generate common patterns
- 11.1.6 Group related tasks using task groups
- 11.1.7 Create new DAGs for big changes
- 11.2 Designing reproducible tasks
- 11.2.1 Always require tasks to be idempotent
- 11.2.2 Task results should be deterministic
- 11.2.3 Design tasks using functional paradigms
- 11.3 Handling data efficiently
- 11.3.1 Limit the amount of data being processed
- 11.3.2 Incremental loading/processing
- 11.3.3 Cache intermediate data
- 11.3.4 Don't store data on local file systems
- 11.3.5 Offload work to external/source systems
- 11.4 Managing your resources
- 11.4.1 Managing concurrency using pools
- 11.4.2 Detecting long-running tasks using SLAs and alerts
- Summary
- 12 Operating Airflow in production
- 12.1 Airflow architectures
- 12.1.1 Which executor is right for me?
- 12.1.2 Configuring a metastore for Airflow
- 12.1.3 A closer look at the scheduler
- 12.2 Installing each executor
- 12.2.1 Setting up the SequentialExecutor
- 12.2.2 Setting up the LocalExecutor
- 12.2.3 Setting up the CeleryExecutor
- 12.2.4 Setting up the KubernetesExecutor
- 12.3 Capturing logs of all Airflow processes
- 12.3.1 Capturing the webserver output.
- 12.3.2 Capturing the scheduler output
- 12.3.3 Capturing task logs
- 12.3.4 Sending logs to remote storage
- 12.4 Visualizing and monitoring Airflow metrics
- 12.4.1 Collecting metrics from Airflow
- 12.4.2 Configuring Airflow to send metrics
- 12.4.3 Configuring Prometheus to collect metrics
- 12.4.4 Creating dashboards with Grafana
- 12.4.5 What should you monitor?
- 12.5 How to get notified of a failing task
- 12.5.1 Alerting within DAGs and operators
- 12.5.2 Defining service-level agreements
- 12.6 Scalability and performance
- 12.6.1 Controlling the maximum number of running tasks
- 12.6.2 System performance configurations
- 12.6.3 Running multiple schedulers
- Summary
- 13 Securing Airflow
- 13.1 Securing the Airflow web interface
- 13.1.1 Adding users to the RBAC interface
- 13.1.2 Configuring the RBAC interface
- 13.2 Encrypting data at rest
- 13.2.1 Creating a Fernet key
- 13.3 Connecting with an LDAP service
- 13.3.1 Understanding LDAP
- 13.3.2 Fetching users from an LDAP service
- 13.4 Encrypting traffic to the webserver
- 13.4.1 Understanding HTTPS
- 13.4.2 Configuring a certificate for HTTPS
- 13.5 Fetching credentials from secret management systems
- Summary
- 14 Project: Finding the fastest way to get around NYC
- 14.1 Understanding the data
- 14.1.1 Yellow Cab file share
- 14.1.2 Citi Bike REST API
- 14.1.3 Deciding on a plan of approach
- 14.2 Extracting the data
- 14.2.1 Downloading Citi Bike data
- 14.2.2 Downloading Yellow Cab data
- 14.3 Applying similar transformations to data
- 14.4 Structuring a data pipeline
- 14.5 Developing idempotent data pipelines
- Summary
- Part 4. In the clouds
- 15 Airflow in the clouds
- 15.1 Designing (cloud) deployment strategies
- 15.2 Cloud-specific operators and hooks
- 15.3 Managed services
- 15.3.1 Astronomer.io
- 15.3.2 Google Cloud Composer.
- 15.3.3 Amazon Managed Workflows for Apache Airflow.