Beginning Apache Spark 3 with DataFrame, Spark SQL, structured streaming, and Spark machine learning library

Take a journey toward discovering, learning, and using Apache Spark 3.0. In this book, you will gain expertise on the powerful and efficient distributed data processing engine inside of Apache Spark; its user-friendly, comprehensive, and flexible programming model for processing data in batch and st...

Descripción completa

Detalles Bibliográficos
Otros Autores: Hien Luu, author (author)
Formato: Libro electrónico
Idioma:Inglés
Publicado: New York, New York : Apress [2021]
Edición:Second edition
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009634658006719
Tabla de Contenidos:
  • Intro
  • Table of Contents
  • About the Author
  • About the Technical Reviewers
  • Acknowledgments
  • Introduction
  • Chapter 1: Introduction to Apache Spark
  • Overview
  • History
  • Spark Core Concepts and Architecture
  • Spark Cluster and Resource Management System
  • Spark Applications
  • Spark Drivers and Executors
  • Spark Unified Stack
  • Spark Core
  • Spark SQL
  • Spark Structured Streaming
  • Spark MLlib
  • Spark GraphX
  • SparkR
  • Apache Spark 3.0
  • Adaptive Query Execution Framework
  • Dynamic Partition Pruning (DPP)
  • Accelerator-aware Scheduler
  • Apache Spark Applications
  • Spark Example Applications
  • Apache Spark Ecosystem
  • Delta Lake
  • Koalas
  • MLflow
  • Summary
  • Chapter 2: Working with Apache Spark
  • Downloading and Installation
  • Downloading Spark
  • Installing Spark
  • Spark Scala Shell
  • Spark Python Shell
  • Having Fun with the Spark Scala Shell
  • Useful Spark Scala Shell Command and Tips
  • Basic Interactions with Scala and Spark
  • Basic Interactions with Scala
  • Spark UI and Basic Interactions with Spark
  • Spark UI
  • Basic Interactions with Spark
  • Introduction to Collaborative Notebooks
  • Create a Cluster
  • Create a Folder
  • Create a Notebook
  • Setting up Spark Source Code
  • Summary
  • Chapter 3: Spark SQL: Foundation
  • Understanding RDD
  • Introduction to the DataFrame API
  • Creating a DataFrame
  • Creating a DataFrame from RDD
  • Creating a DataFrame from a Range of Numbers
  • Creating a DataFrame from Data Sources
  • Creating a DataFrame by Reading Text Files
  • Creating a DataFrame by Reading CSV Files
  • Creating a DataFrame by Reading JSON Files
  • Creating a DataFrame by Reading Parquet Files
  • Creating a DataFrame by Reading ORC Files
  • Creating a DataFrame from JDBC
  • Working with Structured Operations
  • Working with Columns
  • Working with Structured Transformations.
  • select(columns)
  • selectExpr(expressions)
  • filler(condition), where(condition)
  • distinct, dropDuplicates
  • sort(columns), orderBy(columns)
  • limit(n)
  • union(otherDataFrame)
  • withColumn(colName, column)
  • withColumnRenamed(existingColName, newColName)
  • drop(columnName1, columnName2)
  • sample(fraction), sample(fraction, seed), sample(fraction, seed, withReplacement)
  • randomSplit(weights)
  • Working with Missing or Bad Data
  • Working with Structured Actions
  • describe(columnNames)
  • Introduction to Datasets
  • Creating Datasets
  • Working with Datasets
  • Using SQL in Spark SQL
  • Running SQL in Spark
  • Writing Data Out to Storage Systems
  • The Trio: DataFrame, Dataset, and SQL
  • DataFrame Persistence
  • Summary
  • Chapter 4: Spark SQL: Advanced
  • Aggregations
  • Aggregation Functions
  • Common Aggregation Functions
  • count(col)
  • countDistinct(col)
  • min(col), max(col)
  • sum(col)
  • sumDistinct(col)
  • avg(col)
  • skewness(col), kurtosis(col)
  • variance(col), stddev(col)
  • Aggregation with Grouping
  • Multiple Aggregations per Group
  • Collection Group Values
  • Aggregation with Pivoting
  • Joins
  • Join Expression and Join Types
  • Working with Joins
  • Inner Joins
  • Left Outer Joins
  • Right Outer Joins
  • Outer Joins (a.k.a. Full Outer Joins)
  • Left Anti-Joins
  • Left Semi-Joins
  • Cross (a.k.a. Cartesian)
  • Dealing with Duplicate Column Names
  • Use Original DataFrame
  • Renaming Column Before Joining
  • Using Joined Column Name
  • Overview of Join Implementation
  • Shuffle Hash Join
  • Broadcast Hash Join
  • Functions
  • Working with Built-in Functions
  • Working with Date Time Functions
  • Working with String Functions
  • Working with Math Functions
  • Working with Collection Functions
  • Working with Miscellaneous Functions
  • Working with User-Defined Functions (UDFs)
  • Advanced Analytics Functions.
  • Aggregation with Rollups and Cubes
  • Rollups
  • Cubes
  • Aggregation with Time Windows
  • Window Functions
  • Exploring Catalyst Optimizer
  • Logical Plan
  • Physical Plan
  • Catalyst in Action
  • Project Tungsten
  • Summary
  • Chapter 5: Optimizing Spark Applications
  • Common Performance Issues
  • Spark Configurations
  • Different Ways of Setting Properties
  • Different Kinds of Properties
  • Viewing Spark Properties
  • Spark Memory Management
  • Spark Driver
  • Spark Executor
  • Leverage In-Memory Computation
  • When to Persist and Cache Data
  • Persistence and Caching APIs
  • Persistence and Caching Example
  • Understanding Spark Joins
  • Broadcast Hash Join
  • Shuffle Sort Merge Join
  • Adaptive Query Execution
  • Dynamically Coalescing Shuffle Partitions
  • Dynamically Switching Join Strategies
  • Dynamically Optimizing Skew Joins
  • Summary
  • Chapter 6: Spark Streaming
  • Stream Processing
  • Concepts
  • Data Delivery Semantics
  • Notion of Time
  • Windowing
  • Stream Processing Engine Landscape
  • Spark Streaming Overview
  • Spark DStream
  • Spark Structured Streaming
  • Overview
  • Core Concepts
  • Data Sources
  • Output Modes
  • Trigger Types
  • Data Sinks
  • Watermarking
  • Structured Streaming Applications
  • Streaming DataFrame Operations
  • Selection, Project, Aggregation Operations
  • Join Operations
  • Working with Data Sources
  • Working with a Socket Data Source
  • Working with a Rate Data Source
  • Working with a File Data Source
  • Working with a Kafka Data Source
  • Working with a Custom Data Source
  • Working with Data Sinks
  • Working with a File Data Sink
  • Working with a Kafka Data Sink
  • Working with a foreach Data Sink
  • Working with a Console Data Sink
  • Working with a Memory Data Sink
  • Output Modes
  • Triggers
  • Summary
  • Chapter 7: Advanced Spark Streaming
  • Event Time.
  • Fixed Window Aggregation over an Event Time
  • Sliding Window Aggregation over Event Time
  • Aggregation State
  • Watermarking: Limit State and Handle Late Data
  • Arbitrary Stateful Processing
  • Arbitrary Stateful Processing with Structured Streaming
  • Handling State Timeouts
  • Arbitrary State Processing in Action
  • Extracting Patterns with mapGroupsWithState
  • User Sessionization with flatMapGroupsWithState
  • Handling Duplicate Data
  • Fault Tolerance
  • Streaming Application Code Change
  • Spark Runtime Change
  • Streaming Query Metrics and Monitoring
  • Streaming Query Metrics
  • Monitoring Streaming Queries via Callback
  • Monitoring Streaming Queries via Visualization UI
  • Streaming Query Summary Information
  • Streaming Query Detailed Statistics Information
  • Troubleshooting Streaming Query
  • Summary
  • Chapter 8: Machine Learning with Spark
  • Machine Learning Overview
  • Machine Learning Terminologies
  • Machine Learning Types
  • Supervised Learning
  • Unsupervised Learning
  • Reinforcement Learning
  • Machine Learning Development Process
  • Spark Machine Learning Library
  • Machine Learning Pipelines
  • Transformers
  • Estimators
  • Pipeline
  • Pipeline Persistence: Saving and Loading
  • Model Tuning
  • Speeding Up Model Tuning
  • Model Evaluators
  • Machine Learning Tasks in Action
  • Classification
  • Model Hyperparameters
  • Example
  • Regression
  • Model Hyperparameters
  • Example
  • Recommendation
  • Model Hyperparameters
  • Example
  • Deep Learning Pipeline
  • Summary
  • Chapter 9: Managing the Machine Learning Life Cycle
  • The Rise of MLOps
  • MLOps Overview
  • MLflow Overview
  • MLflow Components
  • MLflow in Action
  • MLflow Tracking
  • MLflow Projects
  • MLflow Models
  • MLflow Model Registry
  • Model Deployment and Prediction
  • Summary
  • Index.