Data Engineering with Scala and Spark Build Streaming and Batch Pipelines That Process Massive Amounts of Data Using Scala

Take your data engineering skills to the next level by learning how to utilize Scala and functional programming to create continuous and scheduled pipelines that ingest, transform, and aggregate data Key Features Transform data into a clean and trusted source of information for your organization usi...

Descripción completa

Detalles Bibliográficos
Autor principal:	Tome, Eric (-)
Otros Autores:	Bhattacharjee, Rupam, Radford, David
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Birmingham : Packt Publishing, Limited 2024.
Edición:	1st ed
Materias:	Software engineering. Programming languages (Electronic computers)
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009799143906719

Tabla de Contenidos:

Cover
Title Page
Copyright and Credits
Contributors
Table of Contents
Preface
Part 1 - Introduction to Data Engineering, Scala, and an Environment Setup
Chapter 1: Scala Essentials for Data Engineers
Technical requirements
Understanding functional programming
Understanding objects, classes, and traits
Classes
Object
Trait
Working with higher-order functions (HOFs)
Examples of HOFs from the Scala collection library
Understanding polymorphic functions
Variance
Option type
Collections
Understanding pattern matching
Wildcard patterns
Constant patterns
Variable patterns
Constructor patterns
Sequence patterns
Tuple patterns
Typed patterns
Implicits in Scala
Summary
Further reading
Chapter 2: Environment Setup
Technical requirements
Setting up a cloud environment
Leveraging cloud object storage
Using Databricks
Local environment setup
The build tool
Summary
Further reading
Part 2 - Data Ingestion, Transformation, Cleansing, and Profiling Using Scala and Spark
Chapter 3: An Introduction to Apache Spark and Its APIs - DataFrame, Dataset, and Spark SQL
Technical requirements
Working with Apache Spark
How do Spark applications work?
What happens on executors?
Creating a Spark application using Scala
Spark stages
Shuffling
Understanding the Spark Dataset API
Understanding the Spark DataFrame API
Spark SQL
The select function
Creating temporary views
Summary
Chapter 4: Working with Databases
Technical requirements
Understanding the Spark JDBC API
Working with the Spark JDBC API
Loading the database configuration
Creating a database interface
Creating a factory method for SparkSession
Performing various database operations
Working with databases.
Updating the Database API with Spark read and write
Summary
Chapter 5: Object Stores and Data Lakes
Understanding distributed file systems
Data lakes
Object stores
Streaming data
Working with streaming sources
Processing and sinks
Aggregating streams
Summary
Chapter 6: Understanding Data Transformation
Technical requirements
Understanding the difference between transformations and actions
Using Select and SelectExpr
Filtering and sorting
Learning how to aggregate, group, and join data
Leveraging advanced window functions
Working with complex dataset types
Summary
Chapter 7: Data Profiling and Data Quality
Technical requirements
Understanding components of Deequ
Performing data analysis
Leveraging automatic constraint suggestion
Defining constraints
Storing metrics using MetricsRepository
Detecting anomalies
Summary
Part 3 - Software Engineering Best Practices for Data Engineering in Scala
Chapter 8: Test-Driven Development, Code Health, and Maintainability
Technical requirements
Introducing TDD
Creating unit tests
Performing integration testing
Checking code coverage
Running static code analysis
Installing SonarQube locally
Creating a project
Running SonarScanner
Understanding linting and code style
Linting code with WartRemover
Formatting code using scalafmt
Summary
Chapter 9: CI/CD with GitHub
Technical requirements
Introducing CI/CD and GitHub
Understanding Continuous Integration (CI)
Understanding Continuous Delivery (CD)
Understanding the big picture of CI/CD
Working with GitHub
Cloning a repository
Understanding branches
Writing, committing, and pushing code
Creating pull requests
Reviewing and merging pull requests
Understanding GitHub Actions
Workflows
Jobs
Steps.
Summary
Part 4 - Productionalizing Data Engineering Pipelines - Orchestration and Tuning
Chapter 10: Data Pipeline Orchestration
Technical requirements
Understanding the basics of orchestration
Understanding core features of Apache Airflow
Apache Airflow's extensibility
Extending beyond operators
Monitoring and UI
Hosting and deployment options
Designing data pipelines with Airflow
Working with Argo Workflows
Installing Argo Workflows
Understanding the core components of Argo Workflows
Taking a short detour
Creating an Argo workflow
Using Databricks Workflows
Leveraging Azure Data Factory
Primary components of ADF
Summary
Chapter 11: Performance Tuning
Introducing the Spark UI
Navigating the Spark UI
The Jobs tab - overview of job execution
Leveraging the Spark UI for performance tuning
Identifying performance bottlenecks
Optimizing data shuffling
Memory management and garbage collection
Scaling resources
Analyzing SQL query performance
Right-sizing compute resources
Understanding the basics
Understanding data skewing, indexing, and partitioning
Data skew
Indexing and partitioning
Summary
Part 5 - End-to-End Data Pipelines
Chapter 12: Building Batch Pipelines Using Spark and Scala
Understanding our business use case
What's our marketing use case?
Understanding the data
Understanding the medallion architecture
The end-to-end pipeline
Ingesting the data
Transforming the data
Checking data quality
Creating a serving layer
Orchestrating our batch process
Summary
Chapter 13: Building Streaming Pipelines Using Spark and Scala
Understanding our business use case
What's our IoT use case?
Understanding the data
The end-to-end pipeline
Ingesting the data
Transforming the data.
Creating a serving layer
Orchestrating our streaming process
Summary
Index
About Packt
Other Books You May Enjoy.

Data Engineering with Scala and Spark Build Streaming and Batch Pipelines That Process Massive Amounts of Data Using Scala

Ejemplares similares