Kafka Troubleshooting in Production Stabilizing Kafka Clusters in the Cloud and On-Premises

This book provides Kafka administrators, site reliability engineers, and DataOps and DevOps practitioners with a list of real production issues that can occur in Kafka clusters and how to solve them. The production issues covered are assembled into a comprehensive troubleshooting guide for those eng...

Descripción completa

Detalles Bibliográficos
Autor principal:	Eldor, Elad (-)
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Berkeley, CA : Apress L. P. 2023.
Edición:	1st ed
Materias:	Kafka (Electronic resource) Big data. Cloud computing.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009786706506719

Tabla de Contenidos:

Intro
Table of Contents
About the Author
About the Technical Reviewer
Acknowledgments
Introduction
Chapter 1: Storage Usage in Kafka: Challenges, Strategies, and Best Practices
How Kafka Runs Out of Disk Space
A Retention Policy Can Cause Data Loss
Configuring a Retention Policy for Kafka Topics
Managing Consumer Lag and Preventing Data Loss
Handling Bursty Data Influx from Producers
Balancing Consumer Throttling and Avoiding Unintended Lag
Understanding Daily Traffic Variations and Their Impact on Data Retention
Ensuring Batch Duration Compliance with Topic Retention to Avoid Data Loss
Adding Storage to Kafka Clusters
When the Cluster Is On-Prem
Scaling Up
Scaling Out
When the Cluster Is in the Cloud
EBS Disks
NVME Disks (Ephemeral Disks)
Strategies and Considerations for Extended Retention in Kafka Clusters
Calculating Storage Capacity Based on Time-Based Retention
Retention Monitoring
Data Skew in Partitions
Message Rate Into Topic
Don't Write to the / Mount Point
Summary
Chapter 2: Strategies for Aggregation, Data Cardinality, and Batching
Balancing Message Distribution and Aggregation for Optimal Kafka Performance
Tuning Parameters to Increase Throughput and Reduce Latency
Optimizing Producer and Broker Performance: The Impact of Tuning linger.ms and batch.size in Kafka
Understanding Compression Rate
The Effect of Data Cardinality on Producers, Consumers, and Brokers
Defining Data Cardinality
Effects of High Data Cardinality
Reducing Cardinality Level and Distribution
Duplicating Data to Reduce Latency
Summary
Chapter 3: Understanding and Addressing Partition Skew in Kafka
Skew of Partition Leaders vs. Skew of Partition Followers
Potential Problems with Brokers That Host Many Partition Leaders.
Message Rate (or Incoming Bytes Rate)
Number of Consumers Consuming the Topic
Number of Producers Producing to the Topic
Follower (Replica) Skew in the Broker
Number of In-Sync Replicas in the Broker
Checking for an Imbalance of Partition Leaders
Reassigning Partitions to Achieve an Even Distribution
Data Distribution Among Disks
Summary
Chapter 4: Dealing with Skewed and Lost Leaders
When Partitions Lose Their Leadership
ZooKeeper
The Network Interface Card (NIC)
Should Leader Skew Always Be Solved?
When There Is High Traffic
When There Is a Large Number of Consumers/Producers
Understanding Leader Skew
Summary
Chapter 5: CPU Saturation in Kafka: Causes, Consequences, and Solutions
CPU Saturation
CPU Usage Types
Causes of High CPU User Times
Causes of High CPU System Times
Example of Kafka Brokers with High CPU %us and %sy
Causes of High CPU Wait Times
Causes of High CPU System Interrupt Times
The Effect of Compacted Topics with High Retention on Disk and CPU Use
What Is Log Compaction?
Real Production Issues Due to Log Compaction
The Number of Consumers per Topic vs. CPU Use
Summary
Chapter 6: RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
Adding RAM to a Kafka Cluster
The Strategic Role of RAM Over CPU and Disks
Cloud vs. On-Prem RAM Expansion: Considerations and Constraints
Adding RAM to the Cloud
Adding RAM to On-Prem Kafka Clusters
Enhancing Kafka's Performance: The Benefits of Increasing Broker RAM
Performance Boost
Disk I/O Reduction
Throughput Enhancement
Latency Reduction
Understanding the Linux Page Cache
Page Cache in Kafka: Accelerating Writes and Reads
Balancing Performance and Reliability: Kafka's Page Cache Utilization.
Monitoring Page Cache Usage Using the Cachestat Tool
Lack of RAM and its Effect on Disks
Optimize Kafka Disks When the Cluster Lacks RAM
Use SSDs Instead of HDDs
Distribute Logs Across Disks
Tune OS Disk Scheduling Algorithm
Adjust Kafka's Disk Flush Policies
Enable Log Compression
Enable OS Page Cache
Monitor Disk Usage and I/O
A Lack of RAM Can Cause Disks to Reach IOPS Saturation
Optimize Kafka in Terms of RAM Allocation
Set vm.swappiness to the Minimum Possible Value
Increase the File Descriptor Limits
Increase the Limit of Memory-Mapped Files
GIVE at Least 32GB RAM to Your Kafka Brokers
Monitor Garbage Collection Times Closely
Tuning JVM Options
Using Appropriate Instance Types When Deploying on a Cloud Platform
Balancing Topics and Partitions Across Brokers
Dealing with Garbage Collection (GC) and Out-Of-Memory (OOM)
Latency Spikes
Resource Utilization
System Stability
Impact on ZooKeeper Heartbeat
Measuring Kafka Memory Usage
The Crucial Role of RAM: Lessons from a Non-Kafka Cluster
Summary
Chapter 7: Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
Disk Performance Metrics
Detecting Whether Disks Cause Latency in Kafka Brokers, Consumers, or Producers
How Kafka Reads and Writes to the Disks
Writes
Reads
Disk Performance Detection
Data Skew in the Scope of a Single Broker
Data Skew in the Scope of a Kafka Cluster
Consumer Lag from a Specific Broker
Slow (Faulty) Disk
Real Production Issue: Detecting a Faulty Broker Using Disk Performance Metrics
Discussion
The Effect of Too Many disk.io Threads
Discussion
Looking at Disk Performance the Whole Time vs. During Peak Time Only
Discussion
The Effect of disk.io Threads on Broker, Producer, and Consumer Performance
Request Queue Size.
Produce Latency
Number of JVM Threads
Number of Context Switches
CPU User Time, System Time, and Normalized Load Average
Discussion
Summary
Chapter 8: Disk Configuration: RAID 10 vs. JBOD
RAID 10 and JBOD Terminology
RAID 0 (aka Stripe Set)
RAID 1 (aka Mirror Set)
RAID 1+0 (aka RAID 10)
Comparing RAID 10 and JBOD
Disk Failure
Data Skew
Storage Use
Pros and Cons of RAID 10 and JBOD
Performance of Write Operations
Storage Usage
Disk Failure Tolerance
Considering the Maintenance Burden of Disk Failure in On-Premises Clusters
Disk Health Monitoring
Frequency of Replacing Disks
Kafka Availability During Disk Replacement
Balancing the Data Between the Disks in the Broker
JBOD
RAID 10
Managing Disk Health in Kafka Clusters with JBOD Configuration
Summary
Chapter 9: A Deep Dive Into Producer Monitoring
Producer Metrics
Network I/O Rate Metric
When Network I/O Rate Is High
When Network I/O Rate Is Low
Importance of the Network I/O Rate Metric
Record Queue Time Metric
When Record Queue Time Is High
When Record Queue Time Is Low
Mitigating a High Record Queue Time
Importance of the Record Queue Time Metric
Output Bytes Metric
When Output Bytes Is High
When Output Bytes Is Low
Mitigating the Output Bytes Value
Importance of the Output Bytes Metric
Input Bytes Metric
The Difference Between Output Bytes and Input Bytes
Average Batch Size Metric
When Average Batch Size Is High
When Average Batch Size Is Low
Mitigating the Average Batch Size Metric
Importance of the Average Batch Size Metric
Buffer Available Bytes Metric
When Buffer Available Bytes Is High
When Buffer Available Bytes Is Low
Mitigating the Buffer Available Bytes Metric
Importance of the Buffer Available Bytes Metric.
Request Latency (Avg/Max) Metrics
When Request Latency Is High
When Request Latency Is Low
Mitigating the Request Latency Metric
Importance of the Request Latency Metric
Understanding the Impact of Multiple Producers and Consumers on the Kafka Cluster
Compression Rate: A Special Kind of Producer Metric
Configuring Compression on the Producer and Broker Levels
Compression Rate
When Compression Rate Is High
When Compression Rate Is Low
Mitigating Compression Rate
Importance of the Compression Rate Metric
Summary
Chapter 10: A Deep Dive Into Consumer Monitoring
Consumer Metrics
Consumer Lag Metrics
When the Consumer Lag Metric Is High
When the Consumer Lag Metric Is Low
Mitigating the Consumer Lag Metric
Importance of the Consumer Lag Metric
Fetch Request Rate Metric
When the Fetch Request Rate Is High
When the Fetch Request Rate Is Low
Mitigating the Fetch Request Rate
Importance of the Fetch Request Rate
Fetch Request Size (Avg/Max) Metrics
When the Fetch Request Size Is High
When the Fetch Request Size Is Low
Mitigating the Fetch Request Size
Importance of the Fetch Request Size Metrics
Consumer I/O Wait Ratio Metric
When the Consumer I/O Wait Ratio Is High
When the Consumer I/O Wait Ratio Is Low
Mitigating the Consumer I/O Wait Ratio
Importance of the Consumer I/O Wait Ratio
Records per Request Avg Metric
When the Records per Request Metric Is High
When the Records per Request Metric Is Low
Mitigating the Records per Request Metric
Importance of the Records per Request Metric
Fetch Latency Avg/Max Metrics
When the Fetch Latency Metrics Are High
When the Fetch Latency Metrics Are Low
Mitigating the Fetch Latency Metrics
Importance of the Fetch Latency Metrics
Consumer Request Rate Metric.
When the Consumer Request Rate Metric Is High.

Kafka Troubleshooting in Production Stabilizing Kafka Clusters in the Cloud and On-Premises

Ejemplares similares