Data engineering on Azure
Otros Autores: | |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Shelter Island, New York :
Manning
[2021]
|
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009634690206719 |
Tabla de Contenidos:
- Intro
- inside front cover
- Data Platform Architecture
- Data Engineering on Azure
- Copyright
- dedication
- brief contents
- contents
- front matter
- preface
- acknowledgments
- about this book
- about the author
- about the cover illustration
- 1 Introduction
- 1.1 What is data engineering?
- 1.2 Who this book is for
- 1.3 What is a data platform?
- 1.3.1 Anatomy of a data platform
- 1.3.2 Infrastructure as code, codeless infrastructure
- 1.4 Building in the cloud
- 1.4.1 IaaS, PaaS, SaaS
- 1.4.2 Network, storage, compute
- 1.4.3 Getting started with Azure
- 1.4.4 Interacting with Azure
- 1.5 Implementing an Azure data platform
- Summary
- Part 1 Infrastructure
- 2 Storage
- 2.1 Storing data in a data platform
- 2.1.1 Storing data across multiple data fabrics
- 2.1.2 Having a single source of truth
- 2.2 Introducing Azure Data Explorer
- 2.2.1 Deploying an Azure Data Explorer cluster
- 2.2.2 Using Azure Data Explorer
- 2.2.3 Working around query limits
- 2.3 Introducing Azure Data Lake Storage
- 2.3.1 Creating an Azure Data Lake Storage account
- 2.3.2 Using Azure Data Lake Storage
- 2.3.3 Integrating with Azure Data Explorer
- 2.4 Ingesting data
- 2.4.1 Ingestion frequency
- 2.4.2 Load type
- 2.4.3 Restatements and reloads
- Summary
- 3 DevOps
- 3.1 What is DevOps?
- 3.1.1 DevOps in data engineering
- 3.2 Introducing Azure DevOps
- 3.2.1 Using the az azure-devops extension
- 3.3 Deploying infrastructure
- 3.3.1 Exporting an Azure Resource Manager template
- 3.3.2 Creating Azure DevOps service connections
- 3.3.3 Deploying Azure Resource Manager templates
- 3.3.4 Understanding Azure Pipelines
- 3.4 Deploying analytics
- 3.4.1 Using Azure DevOps marketplace extensions
- 3.4.2 Storing everything in Git
- deploying everything automatically
- Summary
- 4 Orchestration.
- 4.1 Ingesting the Bing COVID-19 open dataset
- 4.2 Introducing Azure Data Factory
- 4.2.1 Setting up the data source
- 4.2.2 Setting up the data sink
- 4.2.3 Setting up the pipeline
- 4.2.4 Setting up a trigger
- 4.2.5 Orchestrating with Azure Data Factory
- 4.3 DevOps for Azure Data Factory
- 4.3.1 Deploying Azure Data Factory from Git
- 4.3.2 Setting up access control
- 4.3.3 Deploying the production data factory
- 4.3.4 DevOps for the Azure Data Factory recap
- 4.4 Monitoring with Azure Monitor
- Summary
- Part 2 Workloads
- 5 Processing
- 5.1 Data modeling techniques
- 5.1.1 Normalization and denormalization
- 5.1.2 Data warehousing
- 5.1.3 Semistructured data
- 5.1.4 Data modeling recap
- 5.2 Identity keyrings
- 5.2.1 Building an identity keyring
- 5.2.2 Understanding keyrings
- 5.3 Timelines
- 5.3.1 Building a timeline view
- 5.3.2 Using timelines
- 5.4 Continuous data processing
- 5.4.1 Tracking processing functions in Git
- 5.4.2 Keyring building in Azure Data Factory
- 5.4.3 Scaling out
- Summary
- 6 Analytics
- 6.1 Structuring storage
- 6.1.1 Providing development data
- 6.1.2 Replicating production data
- 6.1.3 Providing read-only access to the production data
- 6.1.4 Storage structure recap
- 6.2 Analytics workflow
- 6.2.1 Prototyping
- 6.2.2 Development and user acceptance testing
- 6.2.3 Production
- 6.2.4 Analytics workflow recap
- 6.3 Self-serve data movement
- 6.3.1 Support model
- 6.3.2 Data contracts
- 6.3.3 Pipeline validation
- 6.3.4 Postmortems
- 6.3.5 Self-serve data movement recap
- Summary
- 7 Machine learning
- 7.1 Training a machine learning model
- 7.1.1 Training a model using scikit-learn
- 7.1.2 High spender model implementation
- 7.2 Introducing Azure Machine Learning
- 7.2.1 Creating a workspace
- 7.2.2 Creating an Azure Machine Learning compute target.
- 7.2.3 Setting up Azure Machine Learning storage
- 7.2.4 Running ML in the cloud
- 7.2.5 Azure Machine Learning recap
- 7.3 MLOps
- 7.3.1 Deploying from Git
- 7.3.2 Storing pipeline IDs
- 7.3.3 DevOps for Azure Machine Learning recap
- 7.4 Orchestrating machine learning
- 7.4.1 Connecting Azure Data Factory with Azure Machine Learning
- 7.4.2 Machine learning orchestration
- 7.4.3 Orchestrating recap
- Summary
- Part 3 Governance
- 8 Metadata
- 8.1 Making sense of the data
- 8.2 Introducing Azure Purview
- 8.3 Maintaining a data inventory
- 8.3.1 Setting up a scan
- 8.3.2 Browsing the data dictionary
- 8.3.3 Data dictionary recap
- 8.4 Managing a data glossary
- 8.4.1 Adding a new glossary term
- 8.4.2 Curating terms
- 8.4.3 Custom templates and bulk import
- 8.4.4 Data glossary recap
- 8.5 Understanding Azure Purview's advanced features
- 8.5.1 Tracking lineage
- 8.5.2 Classification rules
- 8.5.3 REST API
- 8.5.4 Advanced features recap
- Summary
- 9 Data quality
- 9.1 Testing data
- 9.1.1 Availability tests
- 9.1.2 Correctness tests
- 9.1.3 Completeness tests
- 9.1.4 Detecting anomalies
- 9.1.5 Testing data recap
- 9.2 Running data quality checks
- 9.2.1 Testing using Azure Data Factory
- 9.2.2 Executing tests
- 9.2.3 Creating and using a template
- 9.2.4 Running data quality checks recap
- 9.3 Scaling out data testing
- 9.3.1 Supporting multiple data fabrics
- 9.3.2 Testing at rest and during movement
- 9.3.3 Authoring tests
- 9.3.4 Storing tests and results
- Summary
- 10 Compliance
- 10.1 Data classification
- 10.1.1 Feature data
- 10.1.2 Telemetry
- 10.1.3 User data
- 10.1.4 User-owned data
- 10.1.5 Business data
- 10.1.6 Data classification recap
- 10.2 Changing classification through processing
- 10.2.1 Aggregation
- 10.2.2 Anonymization
- 10.2.3 Pseudonymization
- 10.2.4 Masking.
- 10.2.5 Processing classification changes recap
- 10.3 Implementing an access model
- 10.3.1 Security groups
- 10.3.2 Securing Azure Data Explorer
- 10.3.3 Access model recap
- 10.4 Complying with GDPR and other considerations
- 10.4.1 Data handling
- 10.4.2 Data subject requests
- 10.4.3 Other considerations
- Summary
- 11 Distributing data
- 11.1 Data distribution overview
- 11.2 Building a data API
- 11.2.1 Introducing Azure Cosmos DB
- 11.2.2 Populating the Cosmos DB collection
- 11.2.3 Retrieving data
- 11.2.4 Data API recap
- 11.3 Serving machine learning
- 11.4 Sharing data for bulk copy
- 11.4.1 Separating compute resources
- 11.4.2 Introducing Azure Data Share
- 11.4.3 Sharing data for bulk copy recap
- 11.5 Data sharing best practices
- Summary
- Appendix A. Azure services
- Azure Storage
- Azure SQL
- Azure Synapse Analytics
- Azure Data Explorer
- Azure Databricks
- Azure Cosmos DB
- Appendix B. KQL quick reference
- Common query reference
- SQL to KQL
- Appendix C. Running code samples
- index
- inside back cover
- MLOps.