Data Engineering with Databricks Cookbook Build Effective Data and AI Solutions Using Apache Spark, Databricks, and Delta Lake

Data Engineering with Databricks Cookbook will guide you through recipes to effectively use Apache Spark, Delta Lake, and Databricks for data engineering, beginning with an introduction to data ingestion and loading with Apache Spark. As you progress, you’ll be introduced to various data manipulatio...

Descripción completa

Detalles Bibliográficos
Otros Autores: Chadha, Pulkit, author (author)
Formato: Libro electrónico
Idioma:Inglés
Publicado: Birmingham, England : Packt Publishing [2024]
Edición:First edition
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009825852706719
Tabla de Contenidos:
  • Intro
  • Title Page
  • Copyright and Credits
  • Dedication
  • Contributors
  • Table of Contents
  • Preface
  • Part 1 - Working with Apache Spark and Delta Lake
  • Chapter 1: Data Ingestion and Data Extraction with Apache Spark
  • Technical requirements
  • Reading CSV data with Apache Spark
  • How to do it...
  • There's more…
  • See also
  • Reading JSON data with Apache Spark
  • How to do it...
  • There's more…
  • See also
  • Reading Parquet data with Apache Spark
  • How to do it...
  • See also
  • Parsing XML data with Apache Spark
  • How to do it…
  • There's more…
  • See also
  • Working with nested data structures in Apache Spark
  • How to do it…
  • There's more…
  • See also
  • Processing text data in Apache Spark
  • How to do it…
  • There's more…
  • See also
  • Writing data with Apache Spark
  • How to do it…
  • There's more…
  • See also
  • Chapter 2: Data Transformation and Data Manipulation with Apache Spark
  • Technical requirements
  • Applying basic transformations to data with Apache Spark
  • How to do it...
  • There's more…
  • See also
  • Filtering data with Apache Spark
  • How to do it…
  • There's more…
  • See also
  • Performing joins with Apache Spark
  • How to do it...
  • There's more…
  • See also
  • Performing aggregations with Apache Spark
  • How to do it...
  • There's more…
  • See also
  • Using window functions with Apache Spark
  • How to do it...
  • There's more…
  • Writing custom UDFs in Apache Spark
  • How to do it...
  • There's more…
  • See also
  • Handling null values with Apache Spark
  • How to do it...
  • There's more…
  • See also
  • Chapter 3: Data Management with Delta Lake
  • Technical requirements
  • Creating a Delta Lake table
  • How to do it...
  • There's more…
  • See also
  • Reading a Delta Lake table
  • How to do it...
  • There's more...
  • See also
  • Updating data in a Delta Lake table
  • How to do it.
  • See also
  • Merging data into Delta tables
  • How to do it...
  • There's more…
  • See also
  • Change data capture in Delta Lake
  • How to do it...
  • See also
  • Optimizing Delta Lake tables
  • How to do it...
  • There's more...
  • See also
  • Versioning and time travel for Delta Lake tables
  • How to do it...
  • There's more...
  • See also
  • Managing Delta Lake tables
  • How to do it...
  • See also
  • Chapter 4: Ingesting Streaming Data
  • Technical requirements
  • Configuring Spark Structured Streaming for real-time data processing
  • Getting ready
  • How to do it…
  • How it works…
  • There's more…
  • See also
  • Reading data from real-time sources, such as Apache Kafka, with Apache Spark Structured Streaming
  • Getting ready
  • How to do it…
  • How it works…
  • There's more…
  • See also
  • Defining transformations and filters on a Streaming DataFrame
  • Getting ready
  • How to do it…
  • See also
  • Configuring checkpoints for Structured Streaming in Apache Spark
  • Getting ready
  • How to do it…
  • How it works…
  • There's more…
  • See also
  • Configuring triggers for Structured Streaming in Apache Spark
  • Getting ready
  • How to do it…
  • How it works…
  • See also
  • Applying window aggregations to streaming data with Apache Spark Structured Streaming
  • Getting ready
  • How to do it…
  • There's more…
  • See also
  • Handling out-of-order and late-arriving events with watermarking in Apache Spark Structured Streaming
  • Getting ready
  • How to do it…
  • There's more…
  • See also
  • Chapter 5: Processing Streaming Data
  • Technical requirements
  • Writing the output of Apache Spark Structured Streaming to a sink such as Delta Lake
  • Getting ready
  • How to do it…
  • How it works…
  • See also
  • Idempotent stream writing with Delta Lake and Apache Spark Structured Streaming
  • Getting ready
  • How to do it…
  • See also.
  • Merging or applying Change Data Capture on Apache Spark Structured Streaming and Delta Lake
  • Getting ready
  • How to do it…
  • There's more…
  • Joining streaming data with static data in Apache Spark Structured Streaming and Delta Lake
  • Getting ready
  • How to do it…
  • There's more…
  • See also
  • Joining streaming data with streaming data in Apache Spark Structured Streaming and Delta Lake
  • Getting ready
  • How to do it…
  • There's more…
  • See also
  • Monitoring real-time data processing with Apache Spark Structured Streaming
  • Getting ready
  • How to do it…
  • There's more…
  • See also
  • Chapter 6: Performance Tuning with Apache Spark
  • Technical requirements
  • Monitoring Spark jobs in the Spark UI
  • How to do it…
  • See also
  • Using broadcast variables
  • How to do it…
  • How it works…
  • There's more…
  • Optimizing Spark jobs by minimizing data shuffling
  • How to do it…
  • See also
  • Avoiding data skew
  • How to do it…
  • There's more...
  • Caching and persistence
  • How to do it…
  • There's more…
  • Partitioning and repartitioning
  • How to do it…
  • There's more…
  • Optimizing join strategies
  • How to do it…
  • See also
  • Chapter 7: Performance Tuning in Delta Lake
  • Technical requirements
  • Optimizing Delta Lake table partitioning for query performance
  • How to do it…
  • There's more…
  • See also
  • Organizing data with Z-ordering for efficient query execution
  • How to do it…
  • How it works…
  • See also
  • Skipping data for faster query execution
  • How to do it…
  • See also
  • Reducing Delta Lake table size and I/O cost with compression
  • How to do it…
  • How it works…
  • See also
  • Part 2 - Data Engineering Capabilities within Databricks
  • Chapter 8: Orchestration and Scheduling Data Pipeline with Databricks Workflows
  • Technical requirements
  • Building Databricks workflows
  • How to do it…
  • See also.
  • Running and managing Databricks Workflows
  • How to do it...
  • See also
  • Passing task and job parameters within a Databricks Workflow
  • How to do it...
  • See also
  • Conditional branching in Databricks Workflows
  • How to do it...
  • See also
  • Triggering jobs based on file arrival
  • Getting ready
  • How to do it…
  • See also
  • Setting up workflow alerts and notifications
  • How to do it…
  • There's more…
  • See also
  • Troubleshooting and repairing failures in Databricks Workflows
  • How to do it...
  • See also
  • Chapter 9: Building Data Pipelines with Delta Live Tables
  • Technical requirements
  • Creating a multi-hop medallion architecture data pipeline with Delta Live Tables in Databricks
  • How to do it…
  • How it works…
  • See also
  • Building a data pipeline with Delta Live Tables on Databricks
  • How to do it…
  • See also
  • Implementing data quality and validation rules with Delta Live Tables in Databricks
  • How to do it…
  • How it works…
  • See also
  • Quarantining bad data with Delta Live Tables in Databricks
  • How to do it…
  • See also
  • Monitoring Delta Live Tables pipelines
  • How to do it…
  • See also
  • Deploying Delta Live Tables pipelines with Databricks Asset Bundles
  • Getting ready
  • How to do it…
  • There's more…
  • See also
  • Applying changes (CDC) to Delta tables with Delta Live Tables
  • How to do it…
  • See also
  • Chapter 10: Data Governance with Unity Catalog
  • Technical requirements
  • Connecting to cloud object storage using Unity Catalog
  • Getting ready
  • How to do it…
  • See also
  • Creating and managing catalogs, schemas, volumes, and tables using Unity Catalog
  • Getting ready
  • How to do it…
  • See also
  • Defining and applying fine-grained access control policies using Unity Catalog
  • Getting ready
  • How to do it…
  • See also.
  • Tagging, commenting, and capturing metadata about data and AI assets using Databricks Unity Catalog
  • Getting ready
  • How to do it…
  • See also
  • Filtering sensitive data with Unity Catalog
  • Getting ready
  • How to do it…
  • See also
  • Using Unity Catalogs lineage data for debugging, root cause analysis, and impact assessment
  • Getting ready
  • How to do it…
  • See also
  • Accessing and querying system tables using Unity Catalog
  • Getting ready
  • How to do it…
  • See also
  • Chapter 11: Implementing DataOps and DevOps on Databricks
  • Technical requirements
  • Using Databricks Repos to store code in Git
  • Getting ready
  • How to do it…
  • There's more…
  • See also
  • Automating tasks by using the Databricks CLI
  • Getting ready
  • How to do it…
  • There's more…
  • See also
  • Using the Databricks VSCode extension for local development and testing
  • Getting ready
  • How to do it…
  • See also
  • Using Databricks Asset Bundles (DABs)
  • Getting ready
  • How to do it…
  • See also
  • Leveraging GitHub Actions with Databricks Asset Bundles (DABs)
  • Getting ready
  • How to do it…
  • See also
  • Index
  • About Packt
  • Other Books You May Enjoy.