Data Engineering with Google Cloud Platform A Guide to Leveling up As a Data Engineer by Building a Scalable Data Platform with Google Cloud

The second edition of Data Engineering with Google Cloud builds upon the success of the first edition by offering enhanced clarity and depth to data professionals navigating the intricate landscape of data engineering. Beyond its foundational lessons, this new edition delves into the essential realm...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Wijaya, Adi, author (author), Vilares, António, writer of foreword (writer of foreword)
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Birmingham, England : Packt Publishing Ltd [2024]
Edición:	Second edition
Materias:	Big data. Cloud computing.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009816678106719

Tabla de Contenidos:

Cover
Title Page
Copyright and Credits
Dedication
Foreword
Contributors
Table of Contents
Preface
Part 1: Getting Started with Data Engineering with GCP
Chapter 1: Fundamentals of Data Engineering
Understanding the data life cycle
Understanding the need for a data warehouse
Start with knowing the roles of a data engineer
A data engineer versus a data scientist
The focus of data engineers
Going through the foundational concepts for data engineering
ETL concept in data engineering
The difference between ETL and ELT
What is not big data?
A quick look at how big data technologies store data
A quick look at how to process multiple files using MapReduce
Summary
Exercise
Further Reading
Chapter 2: Big Data Capabilities on GCP
Technical requirements
Understanding what the cloud is
The difference between the cloud and non-cloud era
The on-demand nature of the cloud
Getting started with GCP
Introduction to the GCP console
Practicing pinning services
A quick overview of GCP services for data engineering
Understanding the GCP serverless service
Service mapping and prioritization
The concept of quotas on GCP services
User account versus service account
Summary
Part 2: Build Solutions with GCP Components
Chapter 3: Building a Data Warehouse in BigQuery
Technical requirements
Introduction to GCS and BigQuery
BigQuery data location
Introduction to the BigQuery console
Creating a dataset in BigQuery using the console
Loading the local CSV file into the BigQuery table
Using public data in BigQuery
Data types in BigQuery compared to other databases
Timestamp data in BigQuery compared to other databases
Preparing the prerequisites before developing our data warehouse
Step 1 - Accessing Cloud Shell.
Step 2 - Checking the current setup using the command line
Step 3 - Initializing the gcloud init command
Step 4 - Downloading example data from Git
Step 5 - Uploading data to GCS from Git
Practicing developing a data warehouse
Data warehouse in BigQuery - Requirements for scenario 1
Steps and planning for handling scenario 1
Data warehouse in BigQuery - Requirements for scenario 2
Using the GCP console versus the code-based approach
Steps and planning for handling scenario 2
BigQuery's useful features
BigQuery console sub-menu options
BigQuery partitioned table
Summary
Exercise - Scenario 3
See also
Chapter 4: Building Workflows for Batch Data Loading Using Cloud Composer
Technical requirements
Introduction to Cloud Composer
Understanding the working of Airflow
Cloud Composer 1 vs Cloud Composer 2
Provisioning Cloud Composer in a GCP project
Introducing the Airflow web UI
Cloud Composer bucket directories
Exercise - build data pipeline orchestration using Cloud Composer
Level 1 DAG - creating dummy workflows
Deploying the DAG file into Cloud Composer
Level 2 DAG - scheduling a pipeline from Cloud SQL to GCS and BigQuery datasets
Level 3 DAG - parameterized variables
Level 4 DAG - Guaranteeing task idempotency in Cloud Composer
Level 5 DAG - handling DAG dependency using an Airflow dataset
Summary
Chapter 5: Building a Data Lake Using Dataproc
Technical requirements
Introduction to Dataproc
A brief history of the data lake and Hadoop ecosystem
A deeper look into Hadoop components
How much Hadoop-related knowledge do you need on GCP?
Introducing the Spark RDD and DataFrame concepts
Introducing the data lake concept
Hadoop and Dataproc positioning on GCP
Introduction to Dataproc Serverless.
Exercise - Building a data lake on a Dataproc cluster
Creating a Dataproc cluster on GCP
Using GCS as an underlying Dataproc filesystem
Exercise - Creating and running jobs on a Dataproc cluster
Preparing log data in GCS and HDFS
Developing a Spark ETL job from HDFS to HDFS
Developing a Spark ETL job from GCS to GCS
Developing a Spark ETL job from GCS to BigQuery
Understanding the concept of an ephemeral cluster
Practicing using a workflow template on Dataproc
Building an ephemeral cluster using Dataproc and Cloud Composer
Submitting a Spark ETL job from GCS to BigQuery using Dataproc Serverless
Summary
Chapter 6: Processing Streaming Data with Pub/Sub and Dataflow
Technical requirements
Processing streaming data
Introduction to Pub/Sub
Introduction to Dataflow
Exercise - publishing event streams to Pub/Sub
Creating a Pub/Sub topic
Creating and running a Pub/Sub publisher using Python
Creating a Pub/Sub subscription
Exercise - using Dataflow to stream data from Pub/Sub to GCS
Creating a HelloWorld application using Apache Beam
Creating a Dataflow streaming job without aggregation
Creating a streaming job with aggregation
Introduction to CDC and Datastream
What is Datastream?
Exercise - Datastream ETL streaming to BigQuery
Step 1 - create a CloudSQL MySQL table
Step 2 - create a GCS bucket
Step 3 - create a GCS notification to the Pub/Sub topic and subscription
Step 4 - create a BigQuery dataset
Step 5 - configure a Datastream job
Step 6 - run a Dataflow job from the Dataflow template
Step 7 - insert a value in MySQL and check the result in BigQuery
Summary
Chapter 7: Visualizing Data to Make Data-Driven Decisions with Looker Studio
Technical requirements
Unlocking the power of your data with Looker Studio.
Don't confuse Looker Studio with Looker
From data to metrics in minutes with an Illustrative use case
Understanding what BigQuery INFORMATION_SCHEMA is
Exercise - accessing the BigQuery INFORMATION_SCHEMA table using Looker Studio
Exercise - creating a Looker Studio report using data from a bike-sharing data warehouse
Understanding how Looker Studio can impact the cost of BigQuery
What kind of table could be 1 TB in size?
How can a table be accessed 10,000 times in a month?
Creating Materialized Views and understanding how BI Engine works
Understanding BI Engine
Summary
Chapter 8: Building Machine Learning Solutions on GCP
Technical requirements
A quick look at ML
Exercise - practicing ML code using Python
Preparing the ML dataset by using a table from the BigQuery public dataset
Training the ML model using Random Forest in Python
Creating a batch prediction using the training dataset's output
The MLOps landscape in GCP
Understanding the basic principles of MLOps
Introducing GCP services related to MLOps
Exercise - leveraging pre-built GCP models as a service
Uploading the image to a GCS bucket
Creating a detect text function in Python
Exercise - using GCP in AutoML to train an ML model
Exercise - deploying a dummy workflow with Vertex AI Pipelines
Creating a dedicated regional GCS bucket
Developing the pipeline on Python
Monitoring the pipeline on the Vertex AI Pipelines console
Exercise - deploying a scikit-learn model pipeline with Vertex AI
Creating the first pipeline, which will result in an ML model file in GCS
Running the first pipeline in Vertex AI Pipelines
Creating the second pipeline, which will use the model file from the prediction results as a CSV file in GCS
Running the second pipeline in Vertex AI Pipelines
Summary.
Part 3: Key Strategies for Architecting Top-Notch Solutions
Chapter 9: User and Project Management in GCP
Technical requirements
Understanding IAM in GCP
Planning a GCP project structure
Understanding the GCP organization, folder, and project hierarchy
Deciding how many projects we should have in a GCP organization
Controlling user access to our data warehouse
Use-case scenario - planning BigQuery ACLs on an eCommerce organization
Practicing the concept of IaC using Terraform
Exercise - creating and running basic Terraform scripts
Self-exercise - managing a GCP project and resources using Terraform
Summary
Chapter 10: Data Governance in GCP
Technical requirements
Introduction to data governance
A deeper understanding of data usability
Exercise - implementing metadata tagging using Dataplex
A deeper understanding of data security
Example - BigQuery data masking
Exercise - finding PII using SDP
A deeper understanding of being accountable
Clear traceability
Clear data ownership
Data lineage
Clear data quality process
Exercise - practicing data quality using Dataform
Summary
Chapter 11: Cost Strategy in GCP
Technical requirements
Estimating the cost of your end-to-end data solution in GCP
Comparing BigQuery on-demand and editions
An example - an estimating data engineering use case
Tips to optimize BigQuery using partitioned and clustered tables
Partitioned tables
Clustered tables
An exercise - optimizing BigQuery on-demand cost
Summary
Chapter 12: CI/CD on GCP for Data Engineers
Technical requirements
An introduction to CI/CD
Understanding the data engineer's relationship with CI/CD practices
Understanding CI/CD components with GCP services
Exercise - implementing CI using Cloud Build.
Creating a GitHub repository using a Cloud Source Repository.

Data Engineering with Google Cloud Platform A Guide to Leveling up As a Data Engineer by Building a Scalable Data Platform with Google Cloud

Ejemplares similares