Data pipelines with Apache Airflow

An Airflow bible. Useful for all kinds of users, from novice to expert. Rambabu Posa, Sai Aashika Consultancy A successful pipeline moves data efficiently, minimizing pauses and blockages between tasks, keeping every process along the way operational. Apache Airflow provides a single customizable en...

Descripción completa

Detalles Bibliográficos
Otros Autores: Harenslak, Bas, author (author), Ruiter, Julian de, author
Formato: Libro electrónico
Idioma:Inglés
Publicado: Shelter Island, New York : Manning [2021]
Edición:1st edition
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009631842806719
Tabla de Contenidos:
  • Intro
  • inside front cover
  • Data Pipelines with Apache Airflow
  • Copyright
  • brief contents
  • contents
  • front matter
  • preface
  • acknowledgments
  • Bas Harenslak
  • Julian de Ruiter
  • about this book
  • Who should read this book
  • How this book is organized: A road map
  • About the code
  • LiveBook discussion forum
  • about the authors
  • about the cover illustration
  • Part 1. Getting started
  • 1 Meet Apache Airflow
  • 1.1 Introducing data pipelines
  • 1.1.1 Data pipelines as graphs
  • 1.1.2 Executing a pipeline graph
  • 1.1.3 Pipeline graphs vs. sequential scripts
  • 1.1.4 Running pipeline using workflow managers
  • 1.2 Introducing Airflow
  • 1.2.1 Defining pipelines flexibly in (Python) code
  • 1.2.2 Scheduling and executing pipelines
  • 1.2.3 Monitoring and handling failures
  • 1.2.4 Incremental loading and backfilling
  • 1.3 When to use Airflow
  • 1.3.1 Reasons to choose Airflow
  • 1.3.2 Reasons not to choose Airflow
  • 1.4 The rest of this book
  • Summary
  • 2 Anatomy of an Airflow DAG
  • 2.1 Collecting data from numerous sources
  • 2.1.1 Exploring the data
  • 2.2 Writing your first Airflow DAG
  • 2.2.1 Tasks vs. operators
  • 2.2.2 Running arbitrary Python code
  • 2.3 Running a DAG in Airflow
  • 2.3.1 Running Airflow in a Python environment
  • 2.3.2 Running Airflow in Docker containers
  • 2.3.3 Inspecting the Airflow UI
  • 2.4 Running at regular intervals
  • 2.5 Handling failing tasks
  • Summary
  • 3 Scheduling in Airflow
  • 3.1 An example: Processing user events
  • 3.2 Running at regular intervals
  • 3.2.1 Defining scheduling intervals
  • 3.2.2 Cron-based intervals
  • 3.2.3 Frequency-based intervals
  • 3.3 Processing data incrementally
  • 3.3.1 Fetching events incrementally
  • 3.3.2 Dynamic time references using execution dates
  • 3.3.3 Partitioning your data
  • 3.4 Understanding Airflow's execution dates.
  • 3.4.1 Executing work in fixed-length intervals
  • 3.5 Using backfilling to fill in past gaps
  • 3.5.1 Executing work back in time
  • 3.6 Best practices for designing tasks
  • 3.6.1 Atomicity
  • 3.6.2 Idempotency
  • Summary
  • 4 Templating tasks using the Airflow context
  • 4.1 Inspecting data for processing with Airflow
  • 4.1.1 Determining how to load incremental data
  • 4.2 Task context and Jinja templating
  • 4.2.1 Templating operator arguments
  • 4.2.2 What is available for templating?
  • 4.2.3 Templating the PythonOperator
  • 4.2.4 Providing variables to the PythonOperator
  • 4.2.5 Inspecting templated arguments
  • 4.3 Hooking up other systems
  • Summary
  • 5 Defining dependencies between tasks
  • 5.1 Basic dependencies
  • 5.1.1 Linear dependencies
  • 5.1.2 Fan-in/-out dependencies
  • 5.2 Branching
  • 5.2.1 Branching within tasks
  • 5.2.2 Branching within the DAG
  • 5.3 Conditional tasks
  • 5.3.1 Conditions within tasks
  • 5.3.2 Making tasks conditional
  • 5.3.3 Using built-in operators
  • 5.4 More about trigger rules
  • 5.4.1 What is a trigger rule?
  • 5.4.2 The effect of failures
  • 5.4.3 Other trigger rules
  • 5.5 Sharing data between tasks
  • 5.5.1 Sharing data using XComs
  • 5.5.2 When (not) to use XComs
  • 5.5.3 Using custom XCom backends
  • 5.6 Chaining Python tasks with the Taskflow API
  • 5.6.1 Simplifying Python tasks with the Taskflow API
  • 5.6.2 When (not) to use the Taskflow API
  • Summary
  • Part 2. Beyond the basics
  • 6 Triggering workflows
  • 6.1 Polling conditions with sensors
  • 6.1.1 Polling custom conditions
  • 6.1.2 Sensors outside the happy flow
  • 6.2 Triggering other DAGs
  • 6.2.1 Backfilling with the TriggerDagRunOperator
  • 6.2.2 Polling the state of other DAGs
  • 6.3 Starting workflows with REST/CLI
  • Summary
  • 7 Communicating with external systems
  • 7.1 Connecting to cloud services.
  • 7.1.1 Installing extra dependencies
  • 7.1.2 Developing a machine learning model
  • 7.1.3 Developing locally with external systems
  • 7.2 Moving data from between systems
  • 7.2.1 Implementing a PostgresToS3Operator
  • 7.2.2 Outsourcing the heavy work
  • Summary
  • 8 Building custom components
  • 8.1 Starting with a PythonOperator
  • 8.1.1 Simulating a movie rating API
  • 8.1.2 Fetching ratings from the API
  • 8.1.3 Building the actual DAG
  • 8.2 Building a custom hook
  • 8.2.1 Designing a custom hook
  • 8.2.2 Building our DAG with the MovielensHook
  • 8.3 Building a custom operator
  • 8.3.1 Defining a custom operator
  • 8.3.2 Building an operator for fetching ratings
  • 8.4 Building custom sensors
  • 8.5 Packaging your components
  • 8.5.1 Bootstrapping a Python package
  • 8.5.2 Installing your package
  • Summary
  • 9 Testing
  • 9.1 Getting started with testing
  • 9.1.1 Integrity testing all DAGs
  • 9.1.2 Setting up a CI/CD pipeline
  • 9.1.3 Writing unit tests
  • 9.1.4 Pytest project structure
  • 9.1.5 Testing with files on disk
  • 9.2 Working with DAGs and task context in tests
  • 9.2.1 Working with external systems
  • 9.3 Using tests for development
  • 9.3.1 Testing complete DAGs
  • 9.4 Emulate production environments with Whirl
  • 9.5 Create DTAP environments
  • Summary
  • 10 Running tasks in containers
  • 10.1 Challenges of many different operators
  • 10.1.1 Operator interfaces and implementations
  • 10.1.2 Complex and conflicting dependencies
  • 10.1.3 Moving toward a generic operator
  • 10.2 Introducing containers
  • 10.2.1 What are containers?
  • 10.2.2 Running our first Docker container
  • 10.2.3 Creating a Docker image
  • 10.2.4 Persisting data using volumes
  • 10.3 Containers and Airflow
  • 10.3.1 Tasks in containers
  • 10.3.2 Why use containers?
  • 10.4 Running tasks in Docker
  • 10.4.1 Introducing the DockerOperator.
  • 10.4.2 Creating container images for tasks
  • 10.4.3 Building a DAG with Docker tasks
  • 10.4.4 Docker-based workflow
  • 10.5 Running tasks in Kubernetes
  • 10.5.1 Introducing Kubernetes
  • 10.5.2 Setting up Kubernetes
  • 10.5.3 Using the KubernetesPodOperator
  • 10.5.4 Diagnosing Kubernetes-related issues
  • 10.5.5 Differences with Docker-based workflows
  • Summary
  • Part 3. Airflow in practice
  • 11 Best practices
  • 11.1 Writing clean DAGs
  • 11.1.1 Use style conventions
  • 11.1.2 Manage credentials centrally
  • 11.1.3 Specify configuration details consistently
  • 11.1.4 Avoid doing any computation in your DAG definition
  • 11.1.5 Use factories to generate common patterns
  • 11.1.6 Group related tasks using task groups
  • 11.1.7 Create new DAGs for big changes
  • 11.2 Designing reproducible tasks
  • 11.2.1 Always require tasks to be idempotent
  • 11.2.2 Task results should be deterministic
  • 11.2.3 Design tasks using functional paradigms
  • 11.3 Handling data efficiently
  • 11.3.1 Limit the amount of data being processed
  • 11.3.2 Incremental loading/processing
  • 11.3.3 Cache intermediate data
  • 11.3.4 Don't store data on local file systems
  • 11.3.5 Offload work to external/source systems
  • 11.4 Managing your resources
  • 11.4.1 Managing concurrency using pools
  • 11.4.2 Detecting long-running tasks using SLAs and alerts
  • Summary
  • 12 Operating Airflow in production
  • 12.1 Airflow architectures
  • 12.1.1 Which executor is right for me?
  • 12.1.2 Configuring a metastore for Airflow
  • 12.1.3 A closer look at the scheduler
  • 12.2 Installing each executor
  • 12.2.1 Setting up the SequentialExecutor
  • 12.2.2 Setting up the LocalExecutor
  • 12.2.3 Setting up the CeleryExecutor
  • 12.2.4 Setting up the KubernetesExecutor
  • 12.3 Capturing logs of all Airflow processes
  • 12.3.1 Capturing the webserver output.
  • 12.3.2 Capturing the scheduler output
  • 12.3.3 Capturing task logs
  • 12.3.4 Sending logs to remote storage
  • 12.4 Visualizing and monitoring Airflow metrics
  • 12.4.1 Collecting metrics from Airflow
  • 12.4.2 Configuring Airflow to send metrics
  • 12.4.3 Configuring Prometheus to collect metrics
  • 12.4.4 Creating dashboards with Grafana
  • 12.4.5 What should you monitor?
  • 12.5 How to get notified of a failing task
  • 12.5.1 Alerting within DAGs and operators
  • 12.5.2 Defining service-level agreements
  • 12.6 Scalability and performance
  • 12.6.1 Controlling the maximum number of running tasks
  • 12.6.2 System performance configurations
  • 12.6.3 Running multiple schedulers
  • Summary
  • 13 Securing Airflow
  • 13.1 Securing the Airflow web interface
  • 13.1.1 Adding users to the RBAC interface
  • 13.1.2 Configuring the RBAC interface
  • 13.2 Encrypting data at rest
  • 13.2.1 Creating a Fernet key
  • 13.3 Connecting with an LDAP service
  • 13.3.1 Understanding LDAP
  • 13.3.2 Fetching users from an LDAP service
  • 13.4 Encrypting traffic to the webserver
  • 13.4.1 Understanding HTTPS
  • 13.4.2 Configuring a certificate for HTTPS
  • 13.5 Fetching credentials from secret management systems
  • Summary
  • 14 Project: Finding the fastest way to get around NYC
  • 14.1 Understanding the data
  • 14.1.1 Yellow Cab file share
  • 14.1.2 Citi Bike REST API
  • 14.1.3 Deciding on a plan of approach
  • 14.2 Extracting the data
  • 14.2.1 Downloading Citi Bike data
  • 14.2.2 Downloading Yellow Cab data
  • 14.3 Applying similar transformations to data
  • 14.4 Structuring a data pipeline
  • 14.5 Developing idempotent data pipelines
  • Summary
  • Part 4. In the clouds
  • 15 Airflow in the clouds
  • 15.1 Designing (cloud) deployment strategies
  • 15.2 Cloud-specific operators and hooks
  • 15.3 Managed services
  • 15.3.1 Astronomer.io
  • 15.3.2 Google Cloud Composer.
  • 15.3.3 Amazon Managed Workflows for Apache Airflow.