Databricks Certified Associate Developer for Apache Spark Using Python The Ultimate Guide to Getting Certified in Apache Spark Using Practical Examples with Python

Learn the concepts and exercises needed to get certified as a Databricks Associate Developer for Apache Spark 3.0 and validate your skills as a Spark expert with an industry-recognized credential Key Features Understand the fundamentals of Apache Spark to help you design robust and fast Spark applic...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Shah, Saba, author (author), Waltermann, Rod, author
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Birmingham, England : Packt Publishing [2024]
Edición:	First edition
Materias:	Big data. Cloud computing.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009835419206719

Tabla de Contenidos:

Cover
Title Page
Copyright and Credits
Foreword
Contributors
Table of Contents
Preface
Part 1: Exam Overview
Chapter 1: Overview of the Certification Guide and Exam
Overview of the certification exam
Distribution of questions
Resources to prepare for the exam
Resources available during the exam
Registering for your exam
Prerequisites for the exam
Online proctored exam
Types of questions
Theoretical questions
Code-based questions
Summary
Part 2: Introducing Spark
Chapter 2: Understanding Apache Spark and Its Applications
What is Apache Spark?
The history of Apache Spark
Understanding Spark differentiators
The components of Spark
Why choose Apache Spark?
Speed
Reusability
In-memory computation
A unified platform
What are the Spark use cases?
Big data processing
Machine learning applications
Real-time streaming
Graph analytics
Who are the Spark users?
Data analysts
Data engineers
Data scientists
Machine learning engineers
Summary
Sample questions
Chapter 3: Spark Architecture and Transformations
Spark architecture
Execution hierarchy
Spark components
Spark driver
SparkSession
Cluster manager
Spark executors
Partitioning in Spark
Deployment modes
RDDs
Lazy computation
Transformations
Summary
Sample questions
Answers
Part 3: Spark Operations
Chapter 4: Spark DataFrames and their Operations
Getting Started in PySpark
Installing Spark
Creating a Spark session
Dataset API
DataFrame API
Creating DataFrame operations
Using a list of rows
Using a list of rows with schema
Using Pandas DataFrames
Using tuples
How to view the DataFrames
Viewing DataFrames
Viewing top n rows
Viewing DataFrame schema
Viewing data vertically.
Viewing columns of data
Viewing summary statistics
Collecting the data
Using take
Using tail
Using head
Counting the number of rows of data
Converting a PySpark DataFrame to a Pandas DataFrame
How to manipulate data on rows and columns
Selecting columns
Creating columns
Dropping columns
Updating columns
Renaming columns
Finding unique values in a column
Changing the case of a column
Filtering a DataFrame
Logical operators in a DataFrame
Using isin()
Datatype conversions
Dropping null values from a DataFrame
Dropping duplicates from a DataFrame
Using aggregates in a DataFrame
Summary
Sample question
Answer
Chapter 5: Advanced Operations and Optimizations in Spark
Grouping data in Spark and different Spark joins
Using groupBy in a DataFrame
A complex groupBy statement
Joining DataFrames in Spark
Reading and writing data
Reading and writing CSV files
Reading and writing Parquet files
Reading and writing ORC files
Reading and writing Delta files
Using SQL in Spark
UDFs in Apache Spark
What are UDFs?
Creating and registering UDFs
Use cases for UDFs
Best practices for using UDFs
Optimizations in Apache Spark
Understanding optimization in Spark
Catalyst optimizer
Adaptive Query Execution (AQE)
Data-based optimizations in Apache Spark
Addressing the small file problem in Apache Spark
Tackling data skew in Apache Spark
Managing data spills in Apache Spark
Managing data shuffle in Apache Spark
Shuffle joins
Shuffle sort-merge joins
Broadcast joins
Broadcast hash joins
Narrow and wide transformations in Apache Spark
Narrow transformations
Wide transformations
Choosing between narrow and wide transformations
Optimizing wide transformations
Persisting and caching in Apache Spark.
Understanding data persistence
Caching data
Unpersisting data
Best practices
Repartitioning and coalescing in Apache Spark
Understanding data partitioning
Repartitioning data
Coalescing data
Use cases for repartitioning and coalescing
Best practices
Summary
Sample questions
Answers
Chapter 6: SQL Queries in Spark
What is Spark SQL?
Advantages of Spark SQL
Integration with Apache Spark
Key concepts - DataFrames and datasets
Getting started with Spark SQL
Loading and saving data
Utilizing Spark SQL to filter and select data based on specific criteria
Exploring sorting and aggregation operations using Spark SQL
Grouping and aggregating data - grouping data based on specific columns and performing aggregate functions
Advanced Spark SQL operations
Leveraging window functions to perform advanced analytical operations on DataFrames
User-defined functions
Working with complex data types - pivot and unpivot
Summary
Sample questions
Answers
Part 4: Spark Applications
Chapter 7: Structured Streaming in Spark
Real-time data processing
What is streaming?
Streaming architectures
Introducing Spark Streaming
Exploring the architecture of Spark Streaming
Key concepts
Advantages
Challenges
Introducing Structured Streaming
Key features and advantages
Structured Streaming versus Spark Streaming
Limitations and considerations
Streaming fundamentals
Stateless streaming - processing one event at a time
Stateful streaming - maintaining stateful information
The differences between stateless and stateful streaming
Structured Streaming concepts
Event time and processing time
Watermarking and late data handling
Triggers and output modes
Windowing operations
Joins and aggregations
Streaming sources and sinks.
Built-in streaming sources
Custom streaming sources
Built-in streaming sinks
Custom streaming sinks
Advanced techniques in Structured Streaming
Handling fault tolerance
Handling schema evolution
Different joins in Structured Streaming
Stream-stream joins
Stream-static joins
Final thoughts and future developments
Summary
Chapter 8: Machine Learning with Spark ML
Introduction to ML
The key concepts of ML
Types of ML
Types of supervised learning
ML with Spark
Advantages of Apache Spark for large-scale ML
Spark MLlib versus Spark ML
ML life cycle
Problem statement
Data preparation and feature engineering
Model training and evaluation
Model deployment
Model monitoring and management
Model iteration and improvement
Case studies and real-world examples
Customer churn prediction
Fraud detection
Future trends in Spark ML and distributed ML
Summary
Part 5: Mock Papers
Chapter 9: Mock Test 1
Questions
Answers
Chapter 10: Mock Test 2
Questions
Answers
Index
Other Books You May Enjoy.

Databricks Certified Associate Developer for Apache Spark Using Python The Ultimate Guide to Getting Certified in Apache Spark Using Practical Examples with Python

Ejemplares similares