Kafka Troubleshooting in Production Stabilizing Kafka Clusters in the Cloud and On-Premises
This book provides Kafka administrators, site reliability engineers, and DataOps and DevOps practitioners with a list of real production issues that can occur in Kafka clusters and how to solve them. The production issues covered are assembled into a comprehensive troubleshooting guide for those eng...
Autor principal: | |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Berkeley, CA :
Apress L. P.
2023.
|
Edición: | 1st ed |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009786706506719 |
Tabla de Contenidos:
- Intro
- Table of Contents
- About the Author
- About the Technical Reviewer
- Acknowledgments
- Introduction
- Chapter 1: Storage Usage in Kafka: Challenges, Strategies, and Best Practices
- How Kafka Runs Out of Disk Space
- A Retention Policy Can Cause Data Loss
- Configuring a Retention Policy for Kafka Topics
- Managing Consumer Lag and Preventing Data Loss
- Handling Bursty Data Influx from Producers
- Balancing Consumer Throttling and Avoiding Unintended Lag
- Understanding Daily Traffic Variations and Their Impact on Data Retention
- Ensuring Batch Duration Compliance with Topic Retention to Avoid Data Loss
- Adding Storage to Kafka Clusters
- When the Cluster Is On-Prem
- Scaling Up
- Scaling Out
- When the Cluster Is in the Cloud
- EBS Disks
- NVME Disks (Ephemeral Disks)
- Strategies and Considerations for Extended Retention in Kafka Clusters
- Calculating Storage Capacity Based on Time-Based Retention
- Retention Monitoring
- Data Skew in Partitions
- Message Rate Into Topic
- Don't Write to the / Mount Point
- Summary
- Chapter 2: Strategies for Aggregation, Data Cardinality, and Batching
- Balancing Message Distribution and Aggregation for Optimal Kafka Performance
- Tuning Parameters to Increase Throughput and Reduce Latency
- Optimizing Producer and Broker Performance: The Impact of Tuning linger.ms and batch.size in Kafka
- Understanding Compression Rate
- The Effect of Data Cardinality on Producers, Consumers, and Brokers
- Defining Data Cardinality
- Effects of High Data Cardinality
- Reducing Cardinality Level and Distribution
- Duplicating Data to Reduce Latency
- Summary
- Chapter 3: Understanding and Addressing Partition Skew in Kafka
- Skew of Partition Leaders vs. Skew of Partition Followers
- Potential Problems with Brokers That Host Many Partition Leaders.
- Message Rate (or Incoming Bytes Rate)
- Number of Consumers Consuming the Topic
- Number of Producers Producing to the Topic
- Follower (Replica) Skew in the Broker
- Number of In-Sync Replicas in the Broker
- Checking for an Imbalance of Partition Leaders
- Reassigning Partitions to Achieve an Even Distribution
- Data Distribution Among Disks
- Summary
- Chapter 4: Dealing with Skewed and Lost Leaders
- When Partitions Lose Their Leadership
- ZooKeeper
- The Network Interface Card (NIC)
- Should Leader Skew Always Be Solved?
- When There Is High Traffic
- When There Is a Large Number of Consumers/Producers
- Understanding Leader Skew
- Summary
- Chapter 5: CPU Saturation in Kafka: Causes, Consequences, and Solutions
- CPU Saturation
- CPU Usage Types
- Causes of High CPU User Times
- Causes of High CPU System Times
- Example of Kafka Brokers with High CPU %us and %sy
- Causes of High CPU Wait Times
- Causes of High CPU System Interrupt Times
- The Effect of Compacted Topics with High Retention on Disk and CPU Use
- What Is Log Compaction?
- Real Production Issues Due to Log Compaction
- The Number of Consumers per Topic vs. CPU Use
- Summary
- Chapter 6: RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
- Adding RAM to a Kafka Cluster
- The Strategic Role of RAM Over CPU and Disks
- Cloud vs. On-Prem RAM Expansion: Considerations and Constraints
- Adding RAM to the Cloud
- Adding RAM to On-Prem Kafka Clusters
- Enhancing Kafka's Performance: The Benefits of Increasing Broker RAM
- Performance Boost
- Disk I/O Reduction
- Throughput Enhancement
- Latency Reduction
- Understanding the Linux Page Cache
- Page Cache in Kafka: Accelerating Writes and Reads
- Balancing Performance and Reliability: Kafka's Page Cache Utilization.
- Monitoring Page Cache Usage Using the Cachestat Tool
- Lack of RAM and its Effect on Disks
- Optimize Kafka Disks When the Cluster Lacks RAM
- Use SSDs Instead of HDDs
- Distribute Logs Across Disks
- Tune OS Disk Scheduling Algorithm
- Adjust Kafka's Disk Flush Policies
- Enable Log Compression
- Enable OS Page Cache
- Monitor Disk Usage and I/O
- A Lack of RAM Can Cause Disks to Reach IOPS Saturation
- Optimize Kafka in Terms of RAM Allocation
- Set vm.swappiness to the Minimum Possible Value
- Increase the File Descriptor Limits
- Increase the Limit of Memory-Mapped Files
- GIVE at Least 32GB RAM to Your Kafka Brokers
- Monitor Garbage Collection Times Closely
- Tuning JVM Options
- Using Appropriate Instance Types When Deploying on a Cloud Platform
- Balancing Topics and Partitions Across Brokers
- Dealing with Garbage Collection (GC) and Out-Of-Memory (OOM)
- Latency Spikes
- Resource Utilization
- System Stability
- Impact on ZooKeeper Heartbeat
- Measuring Kafka Memory Usage
- The Crucial Role of RAM: Lessons from a Non-Kafka Cluster
- Summary
- Chapter 7: Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
- Disk Performance Metrics
- Detecting Whether Disks Cause Latency in Kafka Brokers, Consumers, or Producers
- How Kafka Reads and Writes to the Disks
- Writes
- Reads
- Disk Performance Detection
- Data Skew in the Scope of a Single Broker
- Data Skew in the Scope of a Kafka Cluster
- Consumer Lag from a Specific Broker
- Slow (Faulty) Disk
- Real Production Issue: Detecting a Faulty Broker Using Disk Performance Metrics
- Discussion
- The Effect of Too Many disk.io Threads
- Discussion
- Looking at Disk Performance the Whole Time vs. During Peak Time Only
- Discussion
- The Effect of disk.io Threads on Broker, Producer, and Consumer Performance
- Request Queue Size.
- Produce Latency
- Number of JVM Threads
- Number of Context Switches
- CPU User Time, System Time, and Normalized Load Average
- Discussion
- Summary
- Chapter 8: Disk Configuration: RAID 10 vs. JBOD
- RAID 10 and JBOD Terminology
- RAID 0 (aka Stripe Set)
- RAID 1 (aka Mirror Set)
- RAID 1+0 (aka RAID 10)
- Comparing RAID 10 and JBOD
- Disk Failure
- Data Skew
- Storage Use
- Pros and Cons of RAID 10 and JBOD
- Performance of Write Operations
- Storage Usage
- Disk Failure Tolerance
- Considering the Maintenance Burden of Disk Failure in On-Premises Clusters
- Disk Health Monitoring
- Frequency of Replacing Disks
- Kafka Availability During Disk Replacement
- Balancing the Data Between the Disks in the Broker
- JBOD
- RAID 10
- Managing Disk Health in Kafka Clusters with JBOD Configuration
- Summary
- Chapter 9: A Deep Dive Into Producer Monitoring
- Producer Metrics
- Network I/O Rate Metric
- When Network I/O Rate Is High
- When Network I/O Rate Is Low
- Importance of the Network I/O Rate Metric
- Record Queue Time Metric
- When Record Queue Time Is High
- When Record Queue Time Is Low
- Mitigating a High Record Queue Time
- Importance of the Record Queue Time Metric
- Output Bytes Metric
- When Output Bytes Is High
- When Output Bytes Is Low
- Mitigating the Output Bytes Value
- Importance of the Output Bytes Metric
- Input Bytes Metric
- The Difference Between Output Bytes and Input Bytes
- Average Batch Size Metric
- When Average Batch Size Is High
- When Average Batch Size Is Low
- Mitigating the Average Batch Size Metric
- Importance of the Average Batch Size Metric
- Buffer Available Bytes Metric
- When Buffer Available Bytes Is High
- When Buffer Available Bytes Is Low
- Mitigating the Buffer Available Bytes Metric
- Importance of the Buffer Available Bytes Metric.
- Request Latency (Avg/Max) Metrics
- When Request Latency Is High
- When Request Latency Is Low
- Mitigating the Request Latency Metric
- Importance of the Request Latency Metric
- Understanding the Impact of Multiple Producers and Consumers on the Kafka Cluster
- Compression Rate: A Special Kind of Producer Metric
- Configuring Compression on the Producer and Broker Levels
- Compression Rate
- When Compression Rate Is High
- When Compression Rate Is Low
- Mitigating Compression Rate
- Importance of the Compression Rate Metric
- Summary
- Chapter 10: A Deep Dive Into Consumer Monitoring
- Consumer Metrics
- Consumer Lag Metrics
- When the Consumer Lag Metric Is High
- When the Consumer Lag Metric Is Low
- Mitigating the Consumer Lag Metric
- Importance of the Consumer Lag Metric
- Fetch Request Rate Metric
- When the Fetch Request Rate Is High
- When the Fetch Request Rate Is Low
- Mitigating the Fetch Request Rate
- Importance of the Fetch Request Rate
- Fetch Request Size (Avg/Max) Metrics
- When the Fetch Request Size Is High
- When the Fetch Request Size Is Low
- Mitigating the Fetch Request Size
- Importance of the Fetch Request Size Metrics
- Consumer I/O Wait Ratio Metric
- When the Consumer I/O Wait Ratio Is High
- When the Consumer I/O Wait Ratio Is Low
- Mitigating the Consumer I/O Wait Ratio
- Importance of the Consumer I/O Wait Ratio
- Records per Request Avg Metric
- When the Records per Request Metric Is High
- When the Records per Request Metric Is Low
- Mitigating the Records per Request Metric
- Importance of the Records per Request Metric
- Fetch Latency Avg/Max Metrics
- When the Fetch Latency Metrics Are High
- When the Fetch Latency Metrics Are Low
- Mitigating the Fetch Latency Metrics
- Importance of the Fetch Latency Metrics
- Consumer Request Rate Metric.
- When the Consumer Request Rate Metric Is High.