Big Data on Kubernetes A Practical Guide to Building Efficient and Scalable Data Solutions
Gain hands-on experience in building efficient and scalable big data architecture on Kubernetes, utilizing leading technologies such as Spark, Airflow, Kafka, and Trino Key Features Leverage Kubernetes in a cloud environment to integrate seamlessly with a variety of tools Explore best practices for...
Otros Autores: | |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Birmingham, England :
Packt Publishing
[2024]
|
Edición: | First edition |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009841205806719 |
Tabla de Contenidos:
- Cover
- Title page
- Copyright and credits
- Dedication
- Contributors
- Table of Contents
- Preface
- Part 1: Docker and Kubernetes
- Chapter 1: Getting Started with Containers
- Technical requirements
- Container architecture
- Installing Docker
- Windows
- macOS
- Linux
- Getting started with Docker images
- hello-world
- NGINX
- Julia
- Building your own image
- Batch processing job
- API service
- Summary
- Chapter 2: Kubernetes Architecture
- Technical requirements
- Kubernetes architecture
- Control plane
- Node components
- Pods
- Deployments
- StatefulSets
- Jobs
- Services
- ClusterIP Service
- NodePort Service
- LoadBalancer Service
- Ingress and Ingress Controller
- Gateway
- Persistent Volumes
- StorageClasses
- ConfigMaps and Secrets
- ConfigMaps
- Secrets
- Summary
- Chapter 3: Getting Hands-On with Kubernetes
- Technical requirements
- Installing kubectl
- Deploying a local cluster using Kind
- Installing kind
- Deploying the cluster
- Deploying an AWS EKS cluster
- Deploying a Google Cloud GKE cluster
- Deploying an Azure AKS cluster
- Running your API on Kubernetes
- Creating the deployment
- Creating a service
- Using an ingress to access the API
- Running a data processing job in Kubernetes
- Summary
- Part 2: Big Data Stack
- Chapter 4: The Modern Data Stack
- Data architectures
- The Lambda architecture
- The Kappa architecture
- Comparing Lambda and Kappa
- Data lake design for big data
- Data warehouses
- The rise of big data and data lakes
- The rise of the data lakehouse
- Implementing the lakehouse architecture
- Batch ingestion
- Storage
- Batch processing
- Orchestration
- Batch serving
- Data visualization
- Real-time ingestion
- Real-time processing
- Real-time serving
- Real-time data visualization
- Summary.
- Chapter 5: Big Data Processing with Apache Spark
- Technical requirements
- Getting started with Spark
- Installing Spark locally
- Spark architecture
- Spark executors
- Components of execution
- Starting a Spark program
- The DataFrame API and the Spark SQL API
- Transformations
- Actions
- Lazy evaluation
- Data partitioning
- Narrow versus wide transformations
- Analyzing the titanic dataset
- Working with real data
- How Spark performs joins
- Joining IMDb tables
- Summary
- Chapter 6: Building Pipelines with Apache Airflow
- Technical requirements
- Getting started with Airflow
- Installing Airflow with Astro
- Airflow architecture
- Airflow's distributed architecture
- Building a data pipeline
- Airflow integration with other tools
- Summary
- Chapter 7: Apache Kafka for Real-Time Events and Data Ingestion
- Technical requirements
- Getting started with Kafka
- Exploring the Kafka architecture
- The PubSub design
- How Kafka delivers exactly-once semantics
- First producer and consumer
- Streaming from a database with Kafka Connect
- Real-time data processing with Kafka and Spark
- Summary
- Part 3: Connecting It All Together
- Chapter 8: Deploying the Big Data Stack on Kubernetes
- Technical requirements
- Deploying Spark on Kubernetes
- Deploying Airflow on Kubernetes
- Deploying Kafka on Kubernetes
- Summary
- Chapter 9: Data Consumption Layer
- Technical requirements
- Getting started with SQL query engines
- The limitations of traditional data warehouses
- The rise of SQL query engines
- The architecture of SQL query engines
- Deploying Trino in Kubernetes
- Connecting DBeaver with Trino
- Deploying Elasticsearch in Kubernetes
- How Elasticsearch stores, indexes and manages data
- Elasticsearch deployment
- Summary
- Chapter 10: Building a Big Data Pipeline on Kubernetes.
- Technical requirements
- Checking the deployed tools
- Building a batch pipeline
- Building the Airflow DAG
- Creating SparkApplication jobs
- Creating a Glue crawler
- Building a real-time pipeline
- Deploying Kafka Connect and Elasticsearch
- Real-time processing with Spark
- Deploying the Elasticsearch sink connector
- Summary
- Chapter 11: Generative AI on Kubernetes
- Technical requirements
- What generative AI is and what it is not
- The power of large neural networks
- Challenges and limitations
- Using Amazon Bedrock to work with foundational models
- Building a generative AI application on Kubernetes
- Deploying the Streamlit app
- Building RAG with Knowledge Bases for Amazon Bedrock
- Adjusting the code for RAG retrieval
- Building action models with agents
- Creating a DynamoDB table
- Configuring the agent
- Deploying the application on Kubernetes
- Summary
- Chapter 12: Where to Go from Here
- Important topics for big data in Kubernetes
- Kubernetes monitoring and application monitoring
- Building a service mesh
- Security considerations
- Automated scalability
- GitOps and CI/CD for Kubernetes
- Kubernetes cost control
- What about team skills?
- Key skills for monitoring
- Building a service mesh
- Security considerations
- Automated scalability
- Skills for GitOps and CI/CD
- Cost control skills
- Summary
- Index
- Other Books You May Enjoy.