Data engineering on Azure

Detalles Bibliográficos
Otros Autores: Riscutia, Vlad, author (author)
Formato: Libro electrónico
Idioma:Inglés
Publicado: Shelter Island, New York : Manning [2021]
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009634690206719
Tabla de Contenidos:
  • Intro
  • inside front cover
  • Data Platform Architecture
  • Data Engineering on Azure
  • Copyright
  • dedication
  • brief contents
  • contents
  • front matter
  • preface
  • acknowledgments
  • about this book
  • about the author
  • about the cover illustration
  • 1 Introduction
  • 1.1 What is data engineering?
  • 1.2 Who this book is for
  • 1.3 What is a data platform?
  • 1.3.1 Anatomy of a data platform
  • 1.3.2 Infrastructure as code, codeless infrastructure
  • 1.4 Building in the cloud
  • 1.4.1 IaaS, PaaS, SaaS
  • 1.4.2 Network, storage, compute
  • 1.4.3 Getting started with Azure
  • 1.4.4 Interacting with Azure
  • 1.5 Implementing an Azure data platform
  • Summary
  • Part 1 Infrastructure
  • 2 Storage
  • 2.1 Storing data in a data platform
  • 2.1.1 Storing data across multiple data fabrics
  • 2.1.2 Having a single source of truth
  • 2.2 Introducing Azure Data Explorer
  • 2.2.1 Deploying an Azure Data Explorer cluster
  • 2.2.2 Using Azure Data Explorer
  • 2.2.3 Working around query limits
  • 2.3 Introducing Azure Data Lake Storage
  • 2.3.1 Creating an Azure Data Lake Storage account
  • 2.3.2 Using Azure Data Lake Storage
  • 2.3.3 Integrating with Azure Data Explorer
  • 2.4 Ingesting data
  • 2.4.1 Ingestion frequency
  • 2.4.2 Load type
  • 2.4.3 Restatements and reloads
  • Summary
  • 3 DevOps
  • 3.1 What is DevOps?
  • 3.1.1 DevOps in data engineering
  • 3.2 Introducing Azure DevOps
  • 3.2.1 Using the az azure-devops extension
  • 3.3 Deploying infrastructure
  • 3.3.1 Exporting an Azure Resource Manager template
  • 3.3.2 Creating Azure DevOps service connections
  • 3.3.3 Deploying Azure Resource Manager templates
  • 3.3.4 Understanding Azure Pipelines
  • 3.4 Deploying analytics
  • 3.4.1 Using Azure DevOps marketplace extensions
  • 3.4.2 Storing everything in Git
  • deploying everything automatically
  • Summary
  • 4 Orchestration.
  • 4.1 Ingesting the Bing COVID-19 open dataset
  • 4.2 Introducing Azure Data Factory
  • 4.2.1 Setting up the data source
  • 4.2.2 Setting up the data sink
  • 4.2.3 Setting up the pipeline
  • 4.2.4 Setting up a trigger
  • 4.2.5 Orchestrating with Azure Data Factory
  • 4.3 DevOps for Azure Data Factory
  • 4.3.1 Deploying Azure Data Factory from Git
  • 4.3.2 Setting up access control
  • 4.3.3 Deploying the production data factory
  • 4.3.4 DevOps for the Azure Data Factory recap
  • 4.4 Monitoring with Azure Monitor
  • Summary
  • Part 2 Workloads
  • 5 Processing
  • 5.1 Data modeling techniques
  • 5.1.1 Normalization and denormalization
  • 5.1.2 Data warehousing
  • 5.1.3 Semistructured data
  • 5.1.4 Data modeling recap
  • 5.2 Identity keyrings
  • 5.2.1 Building an identity keyring
  • 5.2.2 Understanding keyrings
  • 5.3 Timelines
  • 5.3.1 Building a timeline view
  • 5.3.2 Using timelines
  • 5.4 Continuous data processing
  • 5.4.1 Tracking processing functions in Git
  • 5.4.2 Keyring building in Azure Data Factory
  • 5.4.3 Scaling out
  • Summary
  • 6 Analytics
  • 6.1 Structuring storage
  • 6.1.1 Providing development data
  • 6.1.2 Replicating production data
  • 6.1.3 Providing read-only access to the production data
  • 6.1.4 Storage structure recap
  • 6.2 Analytics workflow
  • 6.2.1 Prototyping
  • 6.2.2 Development and user acceptance testing
  • 6.2.3 Production
  • 6.2.4 Analytics workflow recap
  • 6.3 Self-serve data movement
  • 6.3.1 Support model
  • 6.3.2 Data contracts
  • 6.3.3 Pipeline validation
  • 6.3.4 Postmortems
  • 6.3.5 Self-serve data movement recap
  • Summary
  • 7 Machine learning
  • 7.1 Training a machine learning model
  • 7.1.1 Training a model using scikit-learn
  • 7.1.2 High spender model implementation
  • 7.2 Introducing Azure Machine Learning
  • 7.2.1 Creating a workspace
  • 7.2.2 Creating an Azure Machine Learning compute target.
  • 7.2.3 Setting up Azure Machine Learning storage
  • 7.2.4 Running ML in the cloud
  • 7.2.5 Azure Machine Learning recap
  • 7.3 MLOps
  • 7.3.1 Deploying from Git
  • 7.3.2 Storing pipeline IDs
  • 7.3.3 DevOps for Azure Machine Learning recap
  • 7.4 Orchestrating machine learning
  • 7.4.1 Connecting Azure Data Factory with Azure Machine Learning
  • 7.4.2 Machine learning orchestration
  • 7.4.3 Orchestrating recap
  • Summary
  • Part 3 Governance
  • 8 Metadata
  • 8.1 Making sense of the data
  • 8.2 Introducing Azure Purview
  • 8.3 Maintaining a data inventory
  • 8.3.1 Setting up a scan
  • 8.3.2 Browsing the data dictionary
  • 8.3.3 Data dictionary recap
  • 8.4 Managing a data glossary
  • 8.4.1 Adding a new glossary term
  • 8.4.2 Curating terms
  • 8.4.3 Custom templates and bulk import
  • 8.4.4 Data glossary recap
  • 8.5 Understanding Azure Purview's advanced features
  • 8.5.1 Tracking lineage
  • 8.5.2 Classification rules
  • 8.5.3 REST API
  • 8.5.4 Advanced features recap
  • Summary
  • 9 Data quality
  • 9.1 Testing data
  • 9.1.1 Availability tests
  • 9.1.2 Correctness tests
  • 9.1.3 Completeness tests
  • 9.1.4 Detecting anomalies
  • 9.1.5 Testing data recap
  • 9.2 Running data quality checks
  • 9.2.1 Testing using Azure Data Factory
  • 9.2.2 Executing tests
  • 9.2.3 Creating and using a template
  • 9.2.4 Running data quality checks recap
  • 9.3 Scaling out data testing
  • 9.3.1 Supporting multiple data fabrics
  • 9.3.2 Testing at rest and during movement
  • 9.3.3 Authoring tests
  • 9.3.4 Storing tests and results
  • Summary
  • 10 Compliance
  • 10.1 Data classification
  • 10.1.1 Feature data
  • 10.1.2 Telemetry
  • 10.1.3 User data
  • 10.1.4 User-owned data
  • 10.1.5 Business data
  • 10.1.6 Data classification recap
  • 10.2 Changing classification through processing
  • 10.2.1 Aggregation
  • 10.2.2 Anonymization
  • 10.2.3 Pseudonymization
  • 10.2.4 Masking.
  • 10.2.5 Processing classification changes recap
  • 10.3 Implementing an access model
  • 10.3.1 Security groups
  • 10.3.2 Securing Azure Data Explorer
  • 10.3.3 Access model recap
  • 10.4 Complying with GDPR and other considerations
  • 10.4.1 Data handling
  • 10.4.2 Data subject requests
  • 10.4.3 Other considerations
  • Summary
  • 11 Distributing data
  • 11.1 Data distribution overview
  • 11.2 Building a data API
  • 11.2.1 Introducing Azure Cosmos DB
  • 11.2.2 Populating the Cosmos DB collection
  • 11.2.3 Retrieving data
  • 11.2.4 Data API recap
  • 11.3 Serving machine learning
  • 11.4 Sharing data for bulk copy
  • 11.4.1 Separating compute resources
  • 11.4.2 Introducing Azure Data Share
  • 11.4.3 Sharing data for bulk copy recap
  • 11.5 Data sharing best practices
  • Summary
  • Appendix A. Azure services
  • Azure Storage
  • Azure SQL
  • Azure Synapse Analytics
  • Azure Data Explorer
  • Azure Databricks
  • Azure Cosmos DB
  • Appendix B. KQL quick reference
  • Common query reference
  • SQL to KQL
  • Appendix C. Running code samples
  • index
  • inside back cover
  • MLOps.