Data Engineering with Databricks Cookbook Build Effective Data and AI Solutions Using Apache Spark, Databricks, and Delta Lake

Data Engineering with Databricks Cookbook will guide you through recipes to effectively use Apache Spark, Delta Lake, and Databricks for data engineering, beginning with an introduction to data ingestion and loading with Apache Spark. As you progress, you’ll be introduced to various data manipulatio...

Full description

Bibliographic Details
Other Authors:	Chadha, Pulkit, author (author)
Format:	eBook
Language:	Inglés
Published:	Birmingham, England : Packt Publishing [2024]
Edition:	First edition
Subjects:	Spark (Electronic resource : Apache Software Foundation) Data mining. Electronic data processing. Databases.
See on Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009825852706719

Table of Contents:

Intro
Title Page
Copyright and Credits
Dedication
Contributors
Table of Contents
Preface
Part 1 - Working with Apache Spark and Delta Lake
Chapter 1: Data Ingestion and Data Extraction with Apache Spark
Technical requirements
Reading CSV data with Apache Spark
How to do it...
There's more…
See also
Reading JSON data with Apache Spark
How to do it...
There's more…
See also
Reading Parquet data with Apache Spark
How to do it...
See also
Parsing XML data with Apache Spark
How to do it…
There's more…
See also
Working with nested data structures in Apache Spark
How to do it…
There's more…
See also
Processing text data in Apache Spark
How to do it…
There's more…
See also
Writing data with Apache Spark
How to do it…
There's more…
See also
Chapter 2: Data Transformation and Data Manipulation with Apache Spark
Technical requirements
Applying basic transformations to data with Apache Spark
How to do it...
There's more…
See also
Filtering data with Apache Spark
How to do it…
There's more…
See also
Performing joins with Apache Spark
How to do it...
There's more…
See also
Performing aggregations with Apache Spark
How to do it...
There's more…
See also
Using window functions with Apache Spark
How to do it...
There's more…
Writing custom UDFs in Apache Spark
How to do it...
There's more…
See also
Handling null values with Apache Spark
How to do it...
There's more…
See also
Chapter 3: Data Management with Delta Lake
Technical requirements
Creating a Delta Lake table
How to do it...
There's more…
See also
Reading a Delta Lake table
How to do it...
There's more...
See also
Updating data in a Delta Lake table
How to do it.
See also
Merging data into Delta tables
How to do it...
There's more…
See also
Change data capture in Delta Lake
How to do it...
See also
Optimizing Delta Lake tables
How to do it...
There's more...
See also
Versioning and time travel for Delta Lake tables
How to do it...
There's more...
See also
Managing Delta Lake tables
How to do it...
See also
Chapter 4: Ingesting Streaming Data
Technical requirements
Configuring Spark Structured Streaming for real-time data processing
Getting ready
How to do it…
How it works…
There's more…
See also
Reading data from real-time sources, such as Apache Kafka, with Apache Spark Structured Streaming
Getting ready
How to do it…
How it works…
There's more…
See also
Defining transformations and filters on a Streaming DataFrame
Getting ready
How to do it…
See also
Configuring checkpoints for Structured Streaming in Apache Spark
Getting ready
How to do it…
How it works…
There's more…
See also
Configuring triggers for Structured Streaming in Apache Spark
Getting ready
How to do it…
How it works…
See also
Applying window aggregations to streaming data with Apache Spark Structured Streaming
Getting ready
How to do it…
There's more…
See also
Handling out-of-order and late-arriving events with watermarking in Apache Spark Structured Streaming
Getting ready
How to do it…
There's more…
See also
Chapter 5: Processing Streaming Data
Technical requirements
Writing the output of Apache Spark Structured Streaming to a sink such as Delta Lake
Getting ready
How to do it…
How it works…
See also
Idempotent stream writing with Delta Lake and Apache Spark Structured Streaming
Getting ready
How to do it…
See also.
Merging or applying Change Data Capture on Apache Spark Structured Streaming and Delta Lake
Getting ready
How to do it…
There's more…
Joining streaming data with static data in Apache Spark Structured Streaming and Delta Lake
Getting ready
How to do it…
There's more…
See also
Joining streaming data with streaming data in Apache Spark Structured Streaming and Delta Lake
Getting ready
How to do it…
There's more…
See also
Monitoring real-time data processing with Apache Spark Structured Streaming
Getting ready
How to do it…
There's more…
See also
Chapter 6: Performance Tuning with Apache Spark
Technical requirements
Monitoring Spark jobs in the Spark UI
How to do it…
See also
Using broadcast variables
How to do it…
How it works…
There's more…
Optimizing Spark jobs by minimizing data shuffling
How to do it…
See also
Avoiding data skew
How to do it…
There's more...
Caching and persistence
How to do it…
There's more…
Partitioning and repartitioning
How to do it…
There's more…
Optimizing join strategies
How to do it…
See also
Chapter 7: Performance Tuning in Delta Lake
Technical requirements
Optimizing Delta Lake table partitioning for query performance
How to do it…
There's more…
See also
Organizing data with Z-ordering for efficient query execution
How to do it…
How it works…
See also
Skipping data for faster query execution
How to do it…
See also
Reducing Delta Lake table size and I/O cost with compression
How to do it…
How it works…
See also
Part 2 - Data Engineering Capabilities within Databricks
Chapter 8: Orchestration and Scheduling Data Pipeline with Databricks Workflows
Technical requirements
Building Databricks workflows
How to do it…
See also.
Running and managing Databricks Workflows
How to do it...
See also
Passing task and job parameters within a Databricks Workflow
How to do it...
See also
Conditional branching in Databricks Workflows
How to do it...
See also
Triggering jobs based on file arrival
Getting ready
How to do it…
See also
Setting up workflow alerts and notifications
How to do it…
There's more…
See also
Troubleshooting and repairing failures in Databricks Workflows
How to do it...
See also
Chapter 9: Building Data Pipelines with Delta Live Tables
Technical requirements
Creating a multi-hop medallion architecture data pipeline with Delta Live Tables in Databricks
How to do it…
How it works…
See also
Building a data pipeline with Delta Live Tables on Databricks
How to do it…
See also
Implementing data quality and validation rules with Delta Live Tables in Databricks
How to do it…
How it works…
See also
Quarantining bad data with Delta Live Tables in Databricks
How to do it…
See also
Monitoring Delta Live Tables pipelines
How to do it…
See also
Deploying Delta Live Tables pipelines with Databricks Asset Bundles
Getting ready
How to do it…
There's more…
See also
Applying changes (CDC) to Delta tables with Delta Live Tables
How to do it…
See also
Chapter 10: Data Governance with Unity Catalog
Technical requirements
Connecting to cloud object storage using Unity Catalog
Getting ready
How to do it…
See also
Creating and managing catalogs, schemas, volumes, and tables using Unity Catalog
Getting ready
How to do it…
See also
Defining and applying fine-grained access control policies using Unity Catalog
Getting ready
How to do it…
See also.
Tagging, commenting, and capturing metadata about data and AI assets using Databricks Unity Catalog
Getting ready
How to do it…
See also
Filtering sensitive data with Unity Catalog
Getting ready
How to do it…
See also
Using Unity Catalogs lineage data for debugging, root cause analysis, and impact assessment
Getting ready
How to do it…
See also
Accessing and querying system tables using Unity Catalog
Getting ready
How to do it…
See also
Chapter 11: Implementing DataOps and DevOps on Databricks
Technical requirements
Using Databricks Repos to store code in Git
Getting ready
How to do it…
There's more…
See also
Automating tasks by using the Databricks CLI
Getting ready
How to do it…
There's more…
See also
Using the Databricks VSCode extension for local development and testing
Getting ready
How to do it…
See also
Using Databricks Asset Bundles (DABs)
Getting ready
How to do it…
See also
Leveraging GitHub Actions with Databricks Asset Bundles (DABs)
Getting ready
How to do it…
See also
Index
About Packt
Other Books You May Enjoy.

Data Engineering with Databricks Cookbook Build Effective Data and AI Solutions Using Apache Spark, Databricks, and Delta Lake

Similar Items