Data Engineering with Scala and Spark Build Streaming and Batch Pipelines That Process Massive Amounts of Data Using Scala

Take your data engineering skills to the next level by learning how to utilize Scala and functional programming to create continuous and scheduled pipelines that ingest, transform, and aggregate data Key Features Transform data into a clean and trusted source of information for your organization usi...

Descripción completa

Detalles Bibliográficos
Autor principal: Tome, Eric (-)
Otros Autores: Bhattacharjee, Rupam, Radford, David
Formato: Libro electrónico
Idioma:Inglés
Publicado: Birmingham : Packt Publishing, Limited 2024.
Edición:1st ed
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009799143906719
Tabla de Contenidos:
  • Cover
  • Title Page
  • Copyright and Credits
  • Contributors
  • Table of Contents
  • Preface
  • Part 1 - Introduction to Data Engineering, Scala, and an Environment Setup
  • Chapter 1: Scala Essentials for Data Engineers
  • Technical requirements
  • Understanding functional programming
  • Understanding objects, classes, and traits
  • Classes
  • Object
  • Trait
  • Working with higher-order functions (HOFs)
  • Examples of HOFs from the Scala collection library
  • Understanding polymorphic functions
  • Variance
  • Option type
  • Collections
  • Understanding pattern matching
  • Wildcard patterns
  • Constant patterns
  • Variable patterns
  • Constructor patterns
  • Sequence patterns
  • Tuple patterns
  • Typed patterns
  • Implicits in Scala
  • Summary
  • Further reading
  • Chapter 2: Environment Setup
  • Technical requirements
  • Setting up a cloud environment
  • Leveraging cloud object storage
  • Using Databricks
  • Local environment setup
  • The build tool
  • Summary
  • Further reading
  • Part 2 - Data Ingestion, Transformation, Cleansing, and Profiling Using Scala and Spark
  • Chapter 3: An Introduction to Apache Spark and Its APIs - DataFrame, Dataset, and Spark SQL
  • Technical requirements
  • Working with Apache Spark
  • How do Spark applications work?
  • What happens on executors?
  • Creating a Spark application using Scala
  • Spark stages
  • Shuffling
  • Understanding the Spark Dataset API
  • Understanding the Spark DataFrame API
  • Spark SQL
  • The select function
  • Creating temporary views
  • Summary
  • Chapter 4: Working with Databases
  • Technical requirements
  • Understanding the Spark JDBC API
  • Working with the Spark JDBC API
  • Loading the database configuration
  • Creating a database interface
  • Creating a factory method for SparkSession
  • Performing various database operations
  • Working with databases.
  • Updating the Database API with Spark read and write
  • Summary
  • Chapter 5: Object Stores and Data Lakes
  • Understanding distributed file systems
  • Data lakes
  • Object stores
  • Streaming data
  • Working with streaming sources
  • Processing and sinks
  • Aggregating streams
  • Summary
  • Chapter 6: Understanding Data Transformation
  • Technical requirements
  • Understanding the difference between transformations and actions
  • Using Select and SelectExpr
  • Filtering and sorting
  • Learning how to aggregate, group, and join data
  • Leveraging advanced window functions
  • Working with complex dataset types
  • Summary
  • Chapter 7: Data Profiling and Data Quality
  • Technical requirements
  • Understanding components of Deequ
  • Performing data analysis
  • Leveraging automatic constraint suggestion
  • Defining constraints
  • Storing metrics using MetricsRepository
  • Detecting anomalies
  • Summary
  • Part 3 - Software Engineering Best Practices for Data Engineering in Scala
  • Chapter 8: Test-Driven Development, Code Health, and Maintainability
  • Technical requirements
  • Introducing TDD
  • Creating unit tests
  • Performing integration testing
  • Checking code coverage
  • Running static code analysis
  • Installing SonarQube locally
  • Creating a project
  • Running SonarScanner
  • Understanding linting and code style
  • Linting code with WartRemover
  • Formatting code using scalafmt
  • Summary
  • Chapter 9: CI/CD with GitHub
  • Technical requirements
  • Introducing CI/CD and GitHub
  • Understanding Continuous Integration (CI)
  • Understanding Continuous Delivery (CD)
  • Understanding the big picture of CI/CD
  • Working with GitHub
  • Cloning a repository
  • Understanding branches
  • Writing, committing, and pushing code
  • Creating pull requests
  • Reviewing and merging pull requests
  • Understanding GitHub Actions
  • Workflows
  • Jobs
  • Steps.
  • Summary
  • Part 4 - Productionalizing Data Engineering Pipelines - Orchestration and Tuning
  • Chapter 10: Data Pipeline Orchestration
  • Technical requirements
  • Understanding the basics of orchestration
  • Understanding core features of Apache Airflow
  • Apache Airflow's extensibility
  • Extending beyond operators
  • Monitoring and UI
  • Hosting and deployment options
  • Designing data pipelines with Airflow
  • Working with Argo Workflows
  • Installing Argo Workflows
  • Understanding the core components of Argo Workflows
  • Taking a short detour
  • Creating an Argo workflow
  • Using Databricks Workflows
  • Leveraging Azure Data Factory
  • Primary components of ADF
  • Summary
  • Chapter 11: Performance Tuning
  • Introducing the Spark UI
  • Navigating the Spark UI
  • The Jobs tab - overview of job execution
  • Leveraging the Spark UI for performance tuning
  • Identifying performance bottlenecks
  • Optimizing data shuffling
  • Memory management and garbage collection
  • Scaling resources
  • Analyzing SQL query performance
  • Right-sizing compute resources
  • Understanding the basics
  • Understanding data skewing, indexing, and partitioning
  • Data skew
  • Indexing and partitioning
  • Summary
  • Part 5 - End-to-End Data Pipelines
  • Chapter 12: Building Batch Pipelines Using Spark and Scala
  • Understanding our business use case
  • What's our marketing use case?
  • Understanding the data
  • Understanding the medallion architecture
  • The end-to-end pipeline
  • Ingesting the data
  • Transforming the data
  • Checking data quality
  • Creating a serving layer
  • Orchestrating our batch process
  • Summary
  • Chapter 13: Building Streaming Pipelines Using Spark and Scala
  • Understanding our business use case
  • What's our IoT use case?
  • Understanding the data
  • The end-to-end pipeline
  • Ingesting the data
  • Transforming the data.
  • Creating a serving layer
  • Orchestrating our streaming process
  • Summary
  • Index
  • About Packt
  • Other Books You May Enjoy.