Modern data engineering with Apache Spark a hands-on guide for building mission-critical streaming applications
Leverage Apache Spark within a modern data engineering ecosystem. This hands-on guide will teach you how to write fully functional applications, follow industry best practices, and learn the rationale behind these decisions. With Apache Spark as the foundation, you will follow a step-by-step journey...
Otros Autores: | |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
[Place of publication not identified] :
Apress
[2022]
|
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009652824706719 |
Tabla de Contenidos:
- Intro
- Table of Contents
- About the Author
- About the Technical Reviewer
- Acknowledgments
- Introduction
- Part I: The Fundamentals of Data Engineering with Spark
- Chapter 1: Introduction to Modern Data Engineering
- The Emergence of Data Engineering
- Before the Cloud
- Automation as a Catalyst
- The Cloud Age
- The Public Cloud
- The Origins of the Data Engineer
- The Many Flavors of Databases
- OLTP and the OLAP Database
- The Trouble with Transactions
- Analytical Queries
- No Schema. No Problem. The NoSQL Database
- The NewSQL Database
- Thinking about Tradeoffs
- Cloud Storage
- Data Warehouses and the Data Lake
- The Data Warehouse
- The ETL Job
- The Data Lake
- The Data Pipeline Architecture
- The Data Pipeline
- Workflow Orchestration
- The Data Catalog
- Data Lineage
- Stream Processing
- Interprocess Communication
- Network Queues
- From Distributed Queues to Repayable Message Queues
- Fault-Tolerance and Reliability
- Kafka's Distributed Architecture
- Kafka Records
- Brokers
- Why Stream Processing Matters
- Summary
- Chapter 2: Getting Started with Apache Spark
- The Apache Spark Architecture
- The MapReduce Paradigm
- Mappers
- Durable and Safe Acyclic Execution
- Reducers
- From Data Isolation to Distributed Datasets
- The Spark Programming Model
- Did You Never Learn to Share?
- The Resilient Distributed Data Model
- The Spark Application Architecture
- The Role of the Driver Program
- The Role of the Cluster Manager
- Bring Your Own Cluster
- The Role of the Spark Executors
- The Modular Spark Ecosystem
- The Core Spark Modules
- From RDDs to DataFrames and Datasets
- Getting Up and Running with Spark
- Installing Spark
- Downloading Java JDK
- Downloading Scala
- Downloading Spark
- Taking Spark for a Test Ride
- The Spark Shell.
- Exercise 2-1: Revisiting the Business Intelligence Use Case
- Defining the Problem
- Solving the Problem
- Problem 1: Find the Daily Active Users for a Given Day
- Problem 2: Calculate the Daily Average Number of Items Across All User Carts
- Problem 3: Generate the Top Ten Most Added Items Across All User Carts
- Exercise 2-1: Summary
- Summary
- Chapter 3: Working with Data
- Docker
- Containers
- Docker Desktop
- Configuring Docker
- Apache Zeppelin
- Interpreters
- Notebooks
- Preparing Your Zeppelin Environment
- Running Apache Zeppelin with Docker
- Docker Network
- Docker Compose
- Volumes
- Environment
- Ports
- Using Apache Zeppelin
- Binding Interpreters
- Exercise 3-1: Reading Plain Text Files and Transforming DataFrames
- Converting Plain Text Files into DataFrames
- Peeking at the Contents of a DataFrame
- DataFrame Transformation with Pattern Matching
- Exercise 3-1: Summary
- Working with Structured Data
- Exercise 3-2: DataFrames and Semi-Structured Data
- Schema Inference
- Using Inferred Schemas
- Using Declared Schemas
- Steal the Schema Pattern
- Building a Data Definition
- All About the StructType
- StructField
- Spark Data Types
- Adding Metadata to Your Structured Schemas
- Exercise 3-2: Summary
- Using Interpreted Spark SQL
- Exercise 3-3: A Quick Introduction to SparkSQL
- Creating SQL Views
- Using the Spark SQL Zeppelin Interpreter
- Computing Averages
- Exercise 3-3: Summary
- Your First Spark ETL
- Exercise 3-4: An End-to-End Spark ETL
- Writing Structured Data
- Parquet Data
- Reading Parquet Data
- Exercise 3-4: Summary
- Summary
- Chapter 4: Transforming Data with Spark SQL and the DataFrame API
- Data Transformations
- Basic Data Transformations
- Exercise 4-1: Selections and Projections
- Data Generation
- Selection
- Filtering
- Projection.
- Exercise 4-1: Summary
- Joins
- Exercise 4-2: Expanding Data Through Joins
- Inner Join
- Right Join
- Left Join
- Semi-Join
- Anti-Join
- Semi-Join and Anti-Join Aliases
- Using the IN Operator
- Negating the IN Operator
- Full Join
- Exercise 4-2: Summary
- Putting It All Together
- Exercise 4-3: Problem Solving with SQL Expressions and Conditional Queries
- Expressions as Columns
- Using an Inner Query
- Using Conditional Select Expressions
- Exercise 4-3: Summary
- Summary
- Chapter 5: Bridging Spark SQL with JDBC
- Overview
- MySQL on Docker Crash Course
- Starting Up the Docker Environment
- Docker MySQL Config
- Exercise 5-1: Exploring MySQL 8 on Docker
- Working with Tables
- Connecting to the MySQL Docker Container
- Using the MySQL Shell
- The Default Database
- Creating the Customers Table
- Inserting Customer Records
- Viewing the Customers Table
- Exercise 5-1: Summary
- Using RDBMS with Spark SQL and JDBC
- Managing Dependencies
- Exercise 5-2: Config-Driven Development with the Spark Shell and JDBC
- Configuration, Dependency Management, and Runtime File Interpretation in the Spark Shell
- Runtime Configuration
- Local Dependency Management
- Runtime Package Management
- Dynamic Class Compilation and Loading
- Spark Config: Access Patterns and Runtime Mutation
- Viewing the SparkConf
- Accessing the Runtime Configuration
- Iterative Development with the Spark Shell
- Describing Views and Tables
- Writing DataFrames to External MySQL Tables
- Generate Some New Customers
- Using JDBC DataFrameWriter
- SaveMode
- Exercise 5-2: Summary
- Continued Explorations
- Good Schemas Lead to Better Designs
- Write Customer Records with Minimal Schema
- Deduplicate, Reorder, and Truncate Your Table
- Drop Duplicates
- Sorting with Order By
- Truncating SQL Tables
- Stash and Replace.
- Summary
- Chapter 6: Data Discovery and the Spark SQL Catalog
- Data Discovery and Data Catalogs
- Why Data Catalogs Matter
- Data Wishful Thinking
- Data Catalogs to the Rescue
- The Apache Hive Metastore
- Metadata with a Modern Twist
- Exercise 6-1: Enhancing Spark SQL with the Hive Metastore
- Configuring the Hive Metastore
- Create the Metastore Database
- Connect to the MySQL Docker Container
- Authenticate as the root the MySQL User
- Create the Hive Metastore Database
- Grant Access to the Metastore
- Create the Metastore Tables
- Authenticate as the dataeng User
- Switch Databases to the Metastore
- Import the Hive Metastore Tables
- Configuring Spark to Use the Hive Metastore
- Configure the Hive Site XML
- Configure Apache Spark to Connect to Your External Hive Metastore
- Using the Hive Metastore for Schema Enforcement
- Production Hive Metastore Considerations
- Exercise 6-1: Summary
- The Spark SQL Catalog
- Exercise 6-2: Using the Spark SQL Catalog
- Creating the Spark Session
- Spark SQL Databases
- Listing Available Databases
- Finding the Current Database
- Creating a Database
- Loading External Tables Using JDBC
- Listing Tables
- Creating Persistent Tables
- Finding the Existence of a Table
- Databases and Tables in the Hive Metastore
- View Hive Metastore Databases
- View Hive Metastore Tables
- Hive Table Parameters
- Working with Tables from the Spark SQL Catalog
- Data Discovery Through Table and Column-Level Annotations
- Adding Table-Level Descriptions and Listing Tables
- Adding Column Descriptions and Listing Columns
- Caching Tables
- Cache a Table in Spark Memory
- The Storage View of the Spark UI
- Force Spark to Cache
- Uncache Tables
- Clear All Table Caches
- Refresh a Table
- Testing Automatic Cache Refresh with Spark Managed Tables
- Removing Tables
- Drop Table.
- Conditionally Drop a Table
- Using Spark SQL Catalyst to Remove a Table
- Exercise 6-2: Summary
- The Spark Catalyst Optimizer
- Introspecting Spark's Catalyst Optimizer with Explain
- Logical Plan Parsing
- Logical Plan Analysis
- Unresolvable Errors
- Logical Plan Optimization
- Physical Planning
- Java Bytecode Generation
- Datasets
- Exercise 6-3: Converting DataFrames to Datasets
- Create the Customers Case Class
- Dataset Aliasing
- Mixing Catalyst and Scala Functionality
- Using Typed Catalyst Expressions
- Exercise 6-3: Summary
- Summary
- Chapter 7: Data Pipelines and Structured Spark Applications
- Data Pipelines
- Pipeline Foundations
- Spark Applications: Form and Function
- Interactive Applications
- Spark Shell
- Notebook Environments
- Batch Applications
- Stateless Batch Applications
- Stateful Batch Applications
- From Stateful Batch to Streaming Applications
- Streaming Applications
- Micro-Batch Processing
- Continuous Processing
- Designing Spark Applications
- Use Case: CoffeeCo and the Ritual of Coffee
- Thinking about Data
- Data Storytelling and Modeling Data
- Exercise 7-1: Data Modeling
- The Story
- Breaking Down the Story
- Extracting the Data Models
- Customer
- Store
- Product, Goods and Items
- Vendor
- Location
- Rating
- Exercise 7-1: Summary
- From Data Model to Data Application
- Every Application Begins with an Idea
- The Idea
- Exercise 7-2: Spark Application Blueprint
- Default Application Layout
- README.md
- build.sbt
- conf
- project
- src
- Common Spark Application Components
- Application Configuration
- Application Default Config
- Runtime Config Overrides
- Common Spark Application Initialization
- Dependable Batch Applications
- Exercise 7-2: Summary
- Connecting the Dots
- Application Goals.
- Exercise 7-3: The SparkEventExtractor Application.