Databricks Certified Associate Developer for Apache Spark Using Python The Ultimate Guide to Getting Certified in Apache Spark Using Practical Examples with Python

Learn the concepts and exercises needed to get certified as a Databricks Associate Developer for Apache Spark 3.0 and validate your skills as a Spark expert with an industry-recognized credential Key Features Understand the fundamentals of Apache Spark to help you design robust and fast Spark applic...

Descripción completa

Detalles Bibliográficos
Otros Autores: Shah, Saba, author (author), Waltermann, Rod, author
Formato: Libro electrónico
Idioma:Inglés
Publicado: Birmingham, England : Packt Publishing [2024]
Edición:First edition
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009835419206719
Tabla de Contenidos:
  • Cover
  • Title Page
  • Copyright and Credits
  • Foreword
  • Contributors
  • Table of Contents
  • Preface
  • Part 1: Exam Overview
  • Chapter 1: Overview of the Certification Guide and Exam
  • Overview of the certification exam
  • Distribution of questions
  • Resources to prepare for the exam
  • Resources available during the exam
  • Registering for your exam
  • Prerequisites for the exam
  • Online proctored exam
  • Types of questions
  • Theoretical questions
  • Code-based questions
  • Summary
  • Part 2: Introducing Spark
  • Chapter 2: Understanding Apache Spark and Its Applications
  • What is Apache Spark?
  • The history of Apache Spark
  • Understanding Spark differentiators
  • The components of Spark
  • Why choose Apache Spark?
  • Speed
  • Reusability
  • In-memory computation
  • A unified platform
  • What are the Spark use cases?
  • Big data processing
  • Machine learning applications
  • Real-time streaming
  • Graph analytics
  • Who are the Spark users?
  • Data analysts
  • Data engineers
  • Data scientists
  • Machine learning engineers
  • Summary
  • Sample questions
  • Chapter 3: Spark Architecture and Transformations
  • Spark architecture
  • Execution hierarchy
  • Spark components
  • Spark driver
  • SparkSession
  • Cluster manager
  • Spark executors
  • Partitioning in Spark
  • Deployment modes
  • RDDs
  • Lazy computation
  • Transformations
  • Summary
  • Sample questions
  • Answers
  • Part 3: Spark Operations
  • Chapter 4: Spark DataFrames and their Operations
  • Getting Started in PySpark
  • Installing Spark
  • Creating a Spark session
  • Dataset API
  • DataFrame API
  • Creating DataFrame operations
  • Using a list of rows
  • Using a list of rows with schema
  • Using Pandas DataFrames
  • Using tuples
  • How to view the DataFrames
  • Viewing DataFrames
  • Viewing top n rows
  • Viewing DataFrame schema
  • Viewing data vertically.
  • Viewing columns of data
  • Viewing summary statistics
  • Collecting the data
  • Using take
  • Using tail
  • Using head
  • Counting the number of rows of data
  • Converting a PySpark DataFrame to a Pandas DataFrame
  • How to manipulate data on rows and columns
  • Selecting columns
  • Creating columns
  • Dropping columns
  • Updating columns
  • Renaming columns
  • Finding unique values in a column
  • Changing the case of a column
  • Filtering a DataFrame
  • Logical operators in a DataFrame
  • Using isin()
  • Datatype conversions
  • Dropping null values from a DataFrame
  • Dropping duplicates from a DataFrame
  • Using aggregates in a DataFrame
  • Summary
  • Sample question
  • Answer
  • Chapter 5: Advanced Operations and Optimizations in Spark
  • Grouping data in Spark and different Spark joins
  • Using groupBy in a DataFrame
  • A complex groupBy statement
  • Joining DataFrames in Spark
  • Reading and writing data
  • Reading and writing CSV files
  • Reading and writing Parquet files
  • Reading and writing ORC files
  • Reading and writing Delta files
  • Using SQL in Spark
  • UDFs in Apache Spark
  • What are UDFs?
  • Creating and registering UDFs
  • Use cases for UDFs
  • Best practices for using UDFs
  • Optimizations in Apache Spark
  • Understanding optimization in Spark
  • Catalyst optimizer
  • Adaptive Query Execution (AQE)
  • Data-based optimizations in Apache Spark
  • Addressing the small file problem in Apache Spark
  • Tackling data skew in Apache Spark
  • Managing data spills in Apache Spark
  • Managing data shuffle in Apache Spark
  • Shuffle joins
  • Shuffle sort-merge joins
  • Broadcast joins
  • Broadcast hash joins
  • Narrow and wide transformations in Apache Spark
  • Narrow transformations
  • Wide transformations
  • Choosing between narrow and wide transformations
  • Optimizing wide transformations
  • Persisting and caching in Apache Spark.
  • Understanding data persistence
  • Caching data
  • Unpersisting data
  • Best practices
  • Repartitioning and coalescing in Apache Spark
  • Understanding data partitioning
  • Repartitioning data
  • Coalescing data
  • Use cases for repartitioning and coalescing
  • Best practices
  • Summary
  • Sample questions
  • Answers
  • Chapter 6: SQL Queries in Spark
  • What is Spark SQL?
  • Advantages of Spark SQL
  • Integration with Apache Spark
  • Key concepts - DataFrames and datasets
  • Getting started with Spark SQL
  • Loading and saving data
  • Utilizing Spark SQL to filter and select data based on specific criteria
  • Exploring sorting and aggregation operations using Spark SQL
  • Grouping and aggregating data - grouping data based on specific columns and performing aggregate functions
  • Advanced Spark SQL operations
  • Leveraging window functions to perform advanced analytical operations on DataFrames
  • User-defined functions
  • Working with complex data types - pivot and unpivot
  • Summary
  • Sample questions
  • Answers
  • Part 4: Spark Applications
  • Chapter 7: Structured Streaming in Spark
  • Real-time data processing
  • What is streaming?
  • Streaming architectures
  • Introducing Spark Streaming
  • Exploring the architecture of Spark Streaming
  • Key concepts
  • Advantages
  • Challenges
  • Introducing Structured Streaming
  • Key features and advantages
  • Structured Streaming versus Spark Streaming
  • Limitations and considerations
  • Streaming fundamentals
  • Stateless streaming - processing one event at a time
  • Stateful streaming - maintaining stateful information
  • The differences between stateless and stateful streaming
  • Structured Streaming concepts
  • Event time and processing time
  • Watermarking and late data handling
  • Triggers and output modes
  • Windowing operations
  • Joins and aggregations
  • Streaming sources and sinks.
  • Built-in streaming sources
  • Custom streaming sources
  • Built-in streaming sinks
  • Custom streaming sinks
  • Advanced techniques in Structured Streaming
  • Handling fault tolerance
  • Handling schema evolution
  • Different joins in Structured Streaming
  • Stream-stream joins
  • Stream-static joins
  • Final thoughts and future developments
  • Summary
  • Chapter 8: Machine Learning with Spark ML
  • Introduction to ML
  • The key concepts of ML
  • Types of ML
  • Types of supervised learning
  • ML with Spark
  • Advantages of Apache Spark for large-scale ML
  • Spark MLlib versus Spark ML
  • ML life cycle
  • Problem statement
  • Data preparation and feature engineering
  • Model training and evaluation
  • Model deployment
  • Model monitoring and management
  • Model iteration and improvement
  • Case studies and real-world examples
  • Customer churn prediction
  • Fraud detection
  • Future trends in Spark ML and distributed ML
  • Summary
  • Part 5: Mock Papers
  • Chapter 9: Mock Test 1
  • Questions
  • Answers
  • Chapter 10: Mock Test 2
  • Questions
  • Answers
  • Index
  • Other Books You May Enjoy.