Big Data on Kubernetes A Practical Guide to Building Efficient and Scalable Data Solutions

Gain hands-on experience in building efficient and scalable big data architecture on Kubernetes, utilizing leading technologies such as Spark, Airflow, Kafka, and Trino Key Features Leverage Kubernetes in a cloud environment to integrate seamlessly with a variety of tools Explore best practices for...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Crepalde, Neylson, author (author)
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Birmingham, England : Packt Publishing [2024]
Edición:	First edition
Materias:	Kubernetes. Application software > Development. Application program interfaces (Computer software) Big data.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009841205806719

Tabla de Contenidos:

Cover
Title page
Copyright and credits
Dedication
Contributors
Table of Contents
Preface
Part 1: Docker and Kubernetes
Chapter 1: Getting Started with Containers
Technical requirements
Container architecture
Installing Docker
Windows
macOS
Linux
Getting started with Docker images
hello-world
NGINX
Julia
Building your own image
Batch processing job
API service
Summary
Chapter 2: Kubernetes Architecture
Technical requirements
Kubernetes architecture
Control plane
Node components
Pods
Deployments
StatefulSets
Jobs
Services
ClusterIP Service
NodePort Service
LoadBalancer Service
Ingress and Ingress Controller
Gateway
Persistent Volumes
StorageClasses
ConfigMaps and Secrets
ConfigMaps
Secrets
Summary
Chapter 3: Getting Hands-On with Kubernetes
Technical requirements
Installing kubectl
Deploying a local cluster using Kind
Installing kind
Deploying the cluster
Deploying an AWS EKS cluster
Deploying a Google Cloud GKE cluster
Deploying an Azure AKS cluster
Running your API on Kubernetes
Creating the deployment
Creating a service
Using an ingress to access the API
Running a data processing job in Kubernetes
Summary
Part 2: Big Data Stack
Chapter 4: The Modern Data Stack
Data architectures
The Lambda architecture
The Kappa architecture
Comparing Lambda and Kappa
Data lake design for big data
Data warehouses
The rise of big data and data lakes
The rise of the data lakehouse
Implementing the lakehouse architecture
Batch ingestion
Storage
Batch processing
Orchestration
Batch serving
Data visualization
Real-time ingestion
Real-time processing
Real-time serving
Real-time data visualization
Summary.
Chapter 5: Big Data Processing with Apache Spark
Technical requirements
Getting started with Spark
Installing Spark locally
Spark architecture
Spark executors
Components of execution
Starting a Spark program
The DataFrame API and the Spark SQL API
Transformations
Actions
Lazy evaluation
Data partitioning
Narrow versus wide transformations
Analyzing the titanic dataset
Working with real data
How Spark performs joins
Joining IMDb tables
Summary
Chapter 6: Building Pipelines with Apache Airflow
Technical requirements
Getting started with Airflow
Installing Airflow with Astro
Airflow architecture
Airflow's distributed architecture
Building a data pipeline
Airflow integration with other tools
Summary
Chapter 7: Apache Kafka for Real-Time Events and Data Ingestion
Technical requirements
Getting started with Kafka
Exploring the Kafka architecture
The PubSub design
How Kafka delivers exactly-once semantics
First producer and consumer
Streaming from a database with Kafka Connect
Real-time data processing with Kafka and Spark
Summary
Part 3: Connecting It All Together
Chapter 8: Deploying the Big Data Stack on Kubernetes
Technical requirements
Deploying Spark on Kubernetes
Deploying Airflow on Kubernetes
Deploying Kafka on Kubernetes
Summary
Chapter 9: Data Consumption Layer
Technical requirements
Getting started with SQL query engines
The limitations of traditional data warehouses
The rise of SQL query engines
The architecture of SQL query engines
Deploying Trino in Kubernetes
Connecting DBeaver with Trino
Deploying Elasticsearch in Kubernetes
How Elasticsearch stores, indexes and manages data
Elasticsearch deployment
Summary
Chapter 10: Building a Big Data Pipeline on Kubernetes.
Technical requirements
Checking the deployed tools
Building a batch pipeline
Building the Airflow DAG
Creating SparkApplication jobs
Creating a Glue crawler
Building a real-time pipeline
Deploying Kafka Connect and Elasticsearch
Real-time processing with Spark
Deploying the Elasticsearch sink connector
Summary
Chapter 11: Generative AI on Kubernetes
Technical requirements
What generative AI is and what it is not
The power of large neural networks
Challenges and limitations
Using Amazon Bedrock to work with foundational models
Building a generative AI application on Kubernetes
Deploying the Streamlit app
Building RAG with Knowledge Bases for Amazon Bedrock
Adjusting the code for RAG retrieval
Building action models with agents
Creating a DynamoDB table
Configuring the agent
Deploying the application on Kubernetes
Summary
Chapter 12: Where to Go from Here
Important topics for big data in Kubernetes
Kubernetes monitoring and application monitoring
Building a service mesh
Security considerations
Automated scalability
GitOps and CI/CD for Kubernetes
Kubernetes cost control
What about team skills?
Key skills for monitoring
Building a service mesh
Security considerations
Automated scalability
Skills for GitOps and CI/CD
Cost control skills
Summary
Index
Other Books You May Enjoy.

Big Data on Kubernetes A Practical Guide to Building Efficient and Scalable Data Solutions

Ejemplares similares