Modern data architectures with Python a practical guide to building and deploying data pipelines, data warehouses, and data lakes with Python
Modern Data Architectures with Python will teach you how to seamlessly incorporate your machine learning and data science work streams into your open data platforms. You'll learn how to take your data and create open lakehouses that work with any technology using tried-and-true techniques, incl...
Otros Autores: | |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Birmingham, England :
Packt Publishing Ltd
[2023]
|
Edición: | First edition |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009769035606719 |
Tabla de Contenidos:
- Cover
- Title Page
- Copyright and Credits
- Dedications
- Contributors
- Table of Contents
- Preface
- Part 1: Fundamental Data Knowledge
- Chapter 1: Modern Data Processing Architecture
- Technical requirements
- Databases, data warehouses, and data lakes
- OLTP
- OLAP
- Data lakes
- Event stores
- File formats
- Data platform architecture at a high level
- Comparing the Lambda and Kappa architectures
- Lambda architecture
- Kappa architecture
- Lakehouse and Delta architectures
- Lakehouses
- The seven central tenets
- The medallion data pattern and the Delta architecture
- Data mesh theory and practice
- Defining terms
- The four principles of data mesh
- Summary
- Practical lab
- Solution
- Chapter 2: Understanding Data Analytics
- Technical requirements
- Setting up your environment
- Python
- venv
- Graphviz
- Workflow initialization
- Cleaning and preparing your data
- Duplicate values
- Working with nulls
- Using RegEx
- Outlier identification
- Casting columns
- Fixing column names
- Complex data types
- Data documentation
- diagrams
- Data lineage graphs
- Data modeling patterns
- Relational
- Dimensional modeling
- Key terms
- OBT
- Practical lab
- Loading the problem data
- Solution
- Summary
- Part 2: Data Engineering Toolset
- Chapter 3: Apache Spark Deep Dive
- Technical requirements
- Setting up your environment
- Python, AWS, and Databricks
- Databricks CLI
- Cloud data storage
- Object storage
- Relational
- NoSQL
- Spark architecture
- Introduction to Apache Spark
- Key components
- Working with partitions
- Shuffling partitions
- Caching
- Broadcasting
- Job creation pipeline
- Delta Lake
- Transaction log
- Grouping tables with databases
- Table
- Adding speed with Z-ordering
- Bloom filters
- Practical lab
- Problem 1
- Problem 2.
- Problem 3
- Solution
- Summary
- Chapter 4: Batch and Stream Data Processing Using PySpark
- Technical requirements
- Setting up your environment
- Python, AWS, and Databricks
- Databricks CLI
- Batch processing
- Partitioning
- Data skew
- Reading data
- Spark schemas
- Making decisions
- Removing unwanted columns
- Working with data in groups
- The UDF
- Stream processing
- Reading from disk
- Debugging
- Writing to disk
- Batch stream hybrid
- Delta streaming
- Batch processing in a stream
- Practical lab
- Setup
- Creating fake data
- Problem 1
- Problem 2
- Problem 3
- Solution
- Solution 1
- Solution 2
- Solution 3
- Summary
- Chapter 5: Streaming Data with Kafka
- Technical requirements
- Setting up your environment
- Python, AWS, and Databricks
- Databricks CLI
- Confluent Kafka
- Signing up
- Kafka architecture
- Topics
- Partitions
- Brokers
- Producers
- Consumers
- Schema Registry
- Kafka Connect
- Spark and Kafka
- Practical lab
- Solution
- Summary
- Part 3: Modernizing the Data Platform
- Chapter 6: MLOps
- Technical requirements
- Setting up your environment
- Python, AWS, and Databricks
- Databricks CLI
- Introduction to machine learning
- Understanding data
- The basics of feature engineering
- Splitting up your data
- Fitting your data
- Cross-validation
- Understanding hyperparameters and parameters
- Training our model
- Working together
- AutoML
- MLflow
- MLOps benefits
- Feature stores
- Hyperopt
- Practical lab
- Create an MLflow project
- Summary
- Chapter 7: Data and Information Visualization
- Technical requirements
- Setting up your environment
- Principles of data visualization
- Understanding your user
- Validating your data
- Data visualization using notebooks
- Line charts
- Bar charts
- Histograms
- Scatter plots
- Pie charts.
- Bubble charts
- A single line chart
- A multiple line chart
- A bar chart
- A scatter plot
- A histogram
- A bubble chart
- GUI data visualizations
- Tips and tricks with Databricks notebooks
- Magic
- Markdown
- Other languages
- Terminal
- Filesystem
- Running other notebooks
- Widgets
- Databricks SQL analytics
- Accessing SQL analytics
- SQL Warehouses
- SQL editors
- Queries
- Dashboards
- Alerts
- Query history
- Connecting BI tools
- Practical lab
- Loading problem data
- Problem 1
- Solution
- Problem 2
- Solution
- Summary
- Chapter 8: Integrating Continous Integration into Your Workflow
- Technical requirements
- Setting up your environment
- Databricks
- Databricks CLI
- The DBX CLI
- Docker
- Git
- GitHub
- Pre-commit
- Terraform
- Docker
- Install Jenkins, container setup, and compose
- CI tooling
- Git and GitHub
- Pre-commit
- Python wheels and packages
- Anatomy of a package
- DBX
- Important commands
- Testing code
- Terraform - IaC
- IaC
- The CLI
- HCL
- Jenkins
- Jenkinsfile
- Practical lab
- Problem 1
- Problem 2
- Summary
- Chapter 9: Orchestrating Your Data Workflows
- Technical requirements
- Setting up your environment
- Databricks
- Databricks CLI
- The DBX CLI
- Orchestrating data workloads
- Making life easier with Autoloader
- Reading
- Writing
- Two modes
- Useful options
- Databricks Workflows
- Terraform
- Failed runs
- REST APIs
- The Databricks API
- Python code
- Logging
- Practical lab
- Solution
- Lambda code
- Notebook code
- Summary
- Part 4: Hands-on Project
- Chapter 10: Data Governance
- Technical requirements
- Setting up your environment
- Python, AWS, and Databricks
- The Databricks CLI
- What is data governance?
- Data standards
- Data catalogs
- Data lineage
- Data security and privacy
- Data quality.
- Great Expectations
- Creating test data
- Data context
- Data source
- Batch request
- Validator
- Adding tests
- Saving the suite
- Creating a checkpoint
- Datadocs
- Testing new data
- Profiler
- Databricks Unity
- Practical lab
- Summary
- Chapter 11: Building out the Groundwork
- Technical requirements
- Setting up your environment
- The Databricks CLI
- Git
- GitHub
- pre-commit
- Terraform
- PyPI
- Creating GitHub repos
- Terraform setup
- Initial file setup
- Schema repository
- Schema repository
- ML repository
- Infrastructure repository
- Summary
- Chapter 12: Completing Our Project
- Technical requirements
- Documentation
- Schema diagram
- C4 System Context diagram
- Faking data with Mockaroo
- Managing our schemas with code
- Building our data pipeline application
- Creating our machine learning application
- Displaying our data with dashboards
- Summary
- Index
- Other Books You May Enjoy.