Modern data engineering with Apache Spark a hands-on guide for building mission-critical streaming applications

Leverage Apache Spark within a modern data engineering ecosystem. This hands-on guide will teach you how to write fully functional applications, follow industry best practices, and learn the rationale behind these decisions. With Apache Spark as the foundation, you will follow a step-by-step journey...

Descripción completa

Detalles Bibliográficos
Otros Autores: Haines, Scott, author (author)
Formato: Libro electrónico
Idioma:Inglés
Publicado: [Place of publication not identified] : Apress [2022]
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009652824706719
Tabla de Contenidos:
  • Intro
  • Table of Contents
  • About the Author
  • About the Technical Reviewer
  • Acknowledgments
  • Introduction
  • Part I: The Fundamentals of Data Engineering with Spark
  • Chapter 1: Introduction to Modern Data Engineering
  • The Emergence of Data Engineering
  • Before the Cloud
  • Automation as a Catalyst
  • The Cloud Age
  • The Public Cloud
  • The Origins of the Data Engineer
  • The Many Flavors of Databases
  • OLTP and the OLAP Database
  • The Trouble with Transactions
  • Analytical Queries
  • No Schema. No Problem. The NoSQL Database
  • The NewSQL Database
  • Thinking about Tradeoffs
  • Cloud Storage
  • Data Warehouses and the Data Lake
  • The Data Warehouse
  • The ETL Job
  • The Data Lake
  • The Data Pipeline Architecture
  • The Data Pipeline
  • Workflow Orchestration
  • The Data Catalog
  • Data Lineage
  • Stream Processing
  • Interprocess Communication
  • Network Queues
  • From Distributed Queues to Repayable Message Queues
  • Fault-Tolerance and Reliability
  • Kafka's Distributed Architecture
  • Kafka Records
  • Brokers
  • Why Stream Processing Matters
  • Summary
  • Chapter 2: Getting Started with  Apache Spark
  • The Apache Spark Architecture
  • The MapReduce Paradigm
  • Mappers
  • Durable and Safe Acyclic Execution
  • Reducers
  • From Data Isolation to Distributed Datasets
  • The Spark Programming Model
  • Did You Never Learn to Share?
  • The Resilient Distributed Data Model
  • The Spark Application Architecture
  • The Role of the Driver Program
  • The Role of the Cluster Manager
  • Bring Your Own Cluster
  • The Role of the Spark Executors
  • The Modular Spark Ecosystem
  • The Core Spark Modules
  • From RDDs to DataFrames and Datasets
  • Getting Up and Running with Spark
  • Installing Spark
  • Downloading Java JDK
  • Downloading Scala
  • Downloading Spark
  • Taking Spark for a Test Ride
  • The Spark Shell.
  • Exercise 2-1: Revisiting the Business Intelligence Use Case
  • Defining the Problem
  • Solving the Problem
  • Problem 1: Find the Daily Active Users for a Given Day
  • Problem 2: Calculate the Daily Average Number of Items Across All User Carts
  • Problem 3: Generate the Top Ten Most Added Items Across All User Carts
  • Exercise 2-1: Summary
  • Summary
  • Chapter 3: Working with Data
  • Docker
  • Containers
  • Docker Desktop
  • Configuring Docker
  • Apache Zeppelin
  • Interpreters
  • Notebooks
  • Preparing Your Zeppelin Environment
  • Running Apache Zeppelin with Docker
  • Docker Network
  • Docker Compose
  • Volumes
  • Environment
  • Ports
  • Using Apache Zeppelin
  • Binding Interpreters
  • Exercise 3-1: Reading Plain Text Files and Transforming DataFrames
  • Converting Plain Text Files into DataFrames
  • Peeking at the Contents of a DataFrame
  • DataFrame Transformation with Pattern Matching
  • Exercise 3-1: Summary
  • Working with Structured Data
  • Exercise 3-2: DataFrames and Semi-Structured Data
  • Schema Inference
  • Using Inferred Schemas
  • Using Declared Schemas
  • Steal the Schema Pattern
  • Building a Data Definition
  • All About the StructType
  • StructField
  • Spark Data Types
  • Adding Metadata to Your Structured Schemas
  • Exercise 3-2: Summary
  • Using Interpreted Spark SQL
  • Exercise 3-3: A Quick Introduction to SparkSQL
  • Creating SQL Views
  • Using the Spark SQL Zeppelin Interpreter
  • Computing Averages
  • Exercise 3-3: Summary
  • Your First Spark ETL
  • Exercise 3-4: An End-to-End Spark ETL
  • Writing Structured Data
  • Parquet Data
  • Reading Parquet Data
  • Exercise 3-4: Summary
  • Summary
  • Chapter 4: Transforming Data with  Spark SQL and the  DataFrame API
  • Data Transformations
  • Basic Data Transformations
  • Exercise 4-1: Selections and Projections
  • Data Generation
  • Selection
  • Filtering
  • Projection.
  • Exercise 4-1: Summary
  • Joins
  • Exercise 4-2: Expanding Data Through Joins
  • Inner Join
  • Right Join
  • Left Join
  • Semi-Join
  • Anti-Join
  • Semi-Join and Anti-Join Aliases
  • Using the IN Operator
  • Negating the IN Operator
  • Full Join
  • Exercise 4-2: Summary
  • Putting It All Together
  • Exercise 4-3: Problem Solving with SQL Expressions and Conditional Queries
  • Expressions as Columns
  • Using an Inner Query
  • Using Conditional Select Expressions
  • Exercise 4-3: Summary
  • Summary
  • Chapter 5: Bridging Spark SQL with JDBC
  • Overview
  • MySQL on Docker Crash Course
  • Starting Up the Docker Environment
  • Docker MySQL Config
  • Exercise 5-1: Exploring MySQL 8 on Docker
  • Working with Tables
  • Connecting to the MySQL Docker Container
  • Using the MySQL Shell
  • The Default Database
  • Creating the Customers Table
  • Inserting Customer Records
  • Viewing the Customers Table
  • Exercise 5-1: Summary
  • Using RDBMS with Spark SQL and JDBC
  • Managing Dependencies
  • Exercise 5-2: Config-Driven Development with the Spark Shell and JDBC
  • Configuration, Dependency Management, and Runtime File Interpretation in the Spark Shell
  • Runtime Configuration
  • Local Dependency Management
  • Runtime Package Management
  • Dynamic Class Compilation and Loading
  • Spark Config: Access Patterns and Runtime Mutation
  • Viewing the SparkConf
  • Accessing the Runtime Configuration
  • Iterative Development with the Spark Shell
  • Describing Views and Tables
  • Writing DataFrames to External MySQL Tables
  • Generate Some New Customers
  • Using JDBC DataFrameWriter
  • SaveMode
  • Exercise 5-2: Summary
  • Continued Explorations
  • Good Schemas Lead to Better Designs
  • Write Customer Records with Minimal Schema
  • Deduplicate, Reorder, and Truncate Your Table
  • Drop Duplicates
  • Sorting with Order By
  • Truncating SQL Tables
  • Stash and Replace.
  • Summary
  • Chapter 6: Data Discovery and the Spark SQL Catalog
  • Data Discovery and Data Catalogs
  • Why Data Catalogs Matter
  • Data Wishful Thinking
  • Data Catalogs to the Rescue
  • The Apache Hive Metastore
  • Metadata with a Modern Twist
  • Exercise 6-1: Enhancing Spark SQL with the Hive Metastore
  • Configuring the Hive Metastore
  • Create the Metastore Database
  • Connect to the MySQL Docker Container
  • Authenticate as the root the MySQL User
  • Create the Hive Metastore Database
  • Grant Access to the Metastore
  • Create the Metastore Tables
  • Authenticate as the dataeng User
  • Switch Databases to the Metastore
  • Import the Hive Metastore Tables
  • Configuring Spark to Use the Hive Metastore
  • Configure the Hive Site XML
  • Configure Apache Spark to Connect to Your External Hive Metastore
  • Using the Hive Metastore for Schema Enforcement
  • Production Hive Metastore Considerations
  • Exercise 6-1: Summary
  • The Spark SQL Catalog
  • Exercise 6-2: Using the Spark SQL Catalog
  • Creating the Spark Session
  • Spark SQL Databases
  • Listing Available Databases
  • Finding the Current Database
  • Creating a Database
  • Loading External Tables Using JDBC
  • Listing Tables
  • Creating Persistent Tables
  • Finding the Existence of a Table
  • Databases and Tables in the Hive Metastore
  • View Hive Metastore Databases
  • View Hive Metastore Tables
  • Hive Table Parameters
  • Working with Tables from the Spark SQL Catalog
  • Data Discovery Through Table and Column-Level Annotations
  • Adding Table-Level Descriptions and Listing Tables
  • Adding Column Descriptions and Listing Columns
  • Caching Tables
  • Cache a Table in Spark Memory
  • The Storage View of the Spark UI
  • Force Spark to Cache
  • Uncache Tables
  • Clear All Table Caches
  • Refresh a Table
  • Testing Automatic Cache Refresh with Spark Managed Tables
  • Removing Tables
  • Drop Table.
  • Conditionally Drop a Table
  • Using Spark SQL Catalyst to Remove a Table
  • Exercise 6-2: Summary
  • The Spark Catalyst Optimizer
  • Introspecting Spark's Catalyst Optimizer with Explain
  • Logical Plan Parsing
  • Logical Plan Analysis
  • Unresolvable Errors
  • Logical Plan Optimization
  • Physical Planning
  • Java Bytecode Generation
  • Datasets
  • Exercise 6-3: Converting DataFrames to Datasets
  • Create the Customers Case Class
  • Dataset Aliasing
  • Mixing Catalyst and Scala Functionality
  • Using Typed Catalyst Expressions
  • Exercise 6-3: Summary
  • Summary
  • Chapter 7: Data Pipelines and  Structured Spark Applications
  • Data Pipelines
  • Pipeline Foundations
  • Spark Applications: Form and Function
  • Interactive Applications
  • Spark Shell
  • Notebook Environments
  • Batch Applications
  • Stateless Batch Applications
  • Stateful Batch Applications
  • From Stateful Batch to Streaming Applications
  • Streaming Applications
  • Micro-Batch Processing
  • Continuous Processing
  • Designing Spark Applications
  • Use Case: CoffeeCo and the Ritual of Coffee
  • Thinking about Data
  • Data Storytelling and Modeling Data
  • Exercise 7-1: Data Modeling
  • The Story
  • Breaking Down the Story
  • Extracting the Data Models
  • Customer
  • Store
  • Product, Goods and Items
  • Vendor
  • Location
  • Rating
  • Exercise 7-1: Summary
  • From Data Model to Data Application
  • Every Application Begins with an Idea
  • The Idea
  • Exercise 7-2: Spark Application Blueprint
  • Default Application Layout
  • README.md
  • build.sbt
  • conf
  • project
  • src
  • Common Spark Application Components
  • Application Configuration
  • Application Default Config
  • Runtime Config Overrides
  • Common Spark Application Initialization
  • Dependable Batch Applications
  • Exercise 7-2: Summary
  • Connecting the Dots
  • Application Goals.
  • Exercise 7-3: The SparkEventExtractor Application.