Kafka Troubleshooting in Production Stabilizing Kafka Clusters in the Cloud and On-Premises

This book provides Kafka administrators, site reliability engineers, and DataOps and DevOps practitioners with a list of real production issues that can occur in Kafka clusters and how to solve them. The production issues covered are assembled into a comprehensive troubleshooting guide for those eng...

Descripción completa

Detalles Bibliográficos
Autor principal: Eldor, Elad (-)
Formato: Libro electrónico
Idioma:Inglés
Publicado: Berkeley, CA : Apress L. P. 2023.
Edición:1st ed
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009786706506719
Tabla de Contenidos:
  • Intro
  • Table of Contents
  • About the Author
  • About the Technical Reviewer
  • Acknowledgments
  • Introduction
  • Chapter 1: Storage Usage in Kafka: Challenges, Strategies, and Best Practices
  • How Kafka Runs Out of Disk Space
  • A Retention Policy Can Cause Data Loss
  • Configuring a Retention Policy for Kafka Topics
  • Managing Consumer Lag and Preventing Data Loss
  • Handling Bursty Data Influx from Producers
  • Balancing Consumer Throttling and Avoiding Unintended Lag
  • Understanding Daily Traffic Variations and Their Impact on Data Retention
  • Ensuring Batch Duration Compliance with Topic Retention to Avoid Data Loss
  • Adding Storage to Kafka Clusters
  • When the Cluster Is On-Prem
  • Scaling Up
  • Scaling Out
  • When the Cluster Is in the Cloud
  • EBS Disks
  • NVME Disks (Ephemeral Disks)
  • Strategies and Considerations for Extended Retention in Kafka Clusters
  • Calculating Storage Capacity Based on Time-Based Retention
  • Retention Monitoring
  • Data Skew in Partitions
  • Message Rate Into Topic
  • Don't Write to the / Mount Point
  • Summary
  • Chapter 2: Strategies for Aggregation, Data Cardinality, and Batching
  • Balancing Message Distribution and Aggregation for Optimal Kafka Performance
  • Tuning Parameters to Increase Throughput and Reduce Latency
  • Optimizing Producer and Broker Performance: The Impact of Tuning linger.ms and batch.size in Kafka
  • Understanding Compression Rate
  • The Effect of Data Cardinality on Producers, Consumers, and Brokers
  • Defining Data Cardinality
  • Effects of High Data Cardinality
  • Reducing Cardinality Level and Distribution
  • Duplicating Data to Reduce Latency
  • Summary
  • Chapter 3: Understanding and Addressing Partition Skew in Kafka
  • Skew of Partition Leaders vs. Skew of Partition Followers
  • Potential Problems with Brokers That Host Many Partition Leaders.
  • Message Rate (or Incoming Bytes Rate)
  • Number of Consumers Consuming the Topic
  • Number of Producers Producing to the Topic
  • Follower (Replica) Skew in the Broker
  • Number of In-Sync Replicas in the Broker
  • Checking for an Imbalance of Partition Leaders
  • Reassigning Partitions to Achieve an Even Distribution
  • Data Distribution Among Disks
  • Summary
  • Chapter 4: Dealing with Skewed and Lost Leaders
  • When Partitions Lose Their Leadership
  • ZooKeeper
  • The Network Interface Card (NIC)
  • Should Leader Skew Always Be Solved?
  • When There Is High Traffic
  • When There Is a Large Number of Consumers/Producers
  • Understanding Leader Skew
  • Summary
  • Chapter 5: CPU Saturation in Kafka: Causes, Consequences, and Solutions
  • CPU Saturation
  • CPU Usage Types
  • Causes of High CPU User Times
  • Causes of High CPU System Times
  • Example of Kafka Brokers with High CPU %us and %sy
  • Causes of High CPU Wait Times
  • Causes of High CPU System Interrupt Times
  • The Effect of Compacted Topics with High Retention on Disk and CPU Use
  • What Is Log Compaction?
  • Real Production Issues Due to Log Compaction
  • The Number of Consumers per Topic vs. CPU Use
  • Summary
  • Chapter 6: RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
  • Adding RAM to a Kafka Cluster
  • The Strategic Role of RAM Over CPU and Disks
  • Cloud vs. On-Prem RAM Expansion: Considerations and Constraints
  • Adding RAM to the Cloud
  • Adding RAM to On-Prem Kafka Clusters
  • Enhancing Kafka's Performance: The Benefits of Increasing Broker RAM
  • Performance Boost
  • Disk I/O Reduction
  • Throughput Enhancement
  • Latency Reduction
  • Understanding the Linux Page Cache
  • Page Cache in Kafka: Accelerating Writes and Reads
  • Balancing Performance and Reliability: Kafka's Page Cache Utilization.
  • Monitoring Page Cache Usage Using the Cachestat Tool
  • Lack of RAM and its Effect on Disks
  • Optimize Kafka Disks When the Cluster Lacks RAM
  • Use SSDs Instead of HDDs
  • Distribute Logs Across Disks
  • Tune OS Disk Scheduling Algorithm
  • Adjust Kafka's Disk Flush Policies
  • Enable Log Compression
  • Enable OS Page Cache
  • Monitor Disk Usage and I/O
  • A Lack of RAM Can Cause Disks to Reach IOPS Saturation
  • Optimize Kafka in Terms of RAM Allocation
  • Set vm.swappiness to the Minimum Possible Value
  • Increase the File Descriptor Limits
  • Increase the Limit of Memory-Mapped Files
  • GIVE at Least 32GB RAM to Your Kafka Brokers
  • Monitor Garbage Collection Times Closely
  • Tuning JVM Options
  • Using Appropriate Instance Types When Deploying on a Cloud Platform
  • Balancing Topics and Partitions Across Brokers
  • Dealing with Garbage Collection (GC) and Out-Of-Memory (OOM)
  • Latency Spikes
  • Resource Utilization
  • System Stability
  • Impact on ZooKeeper Heartbeat
  • Measuring Kafka Memory Usage
  • The Crucial Role of RAM: Lessons from a Non-Kafka Cluster
  • Summary
  • Chapter 7: Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
  • Disk Performance Metrics
  • Detecting Whether Disks Cause Latency in Kafka Brokers, Consumers, or Producers
  • How Kafka Reads and Writes to the Disks
  • Writes
  • Reads
  • Disk Performance Detection
  • Data Skew in the Scope of a Single Broker
  • Data Skew in the Scope of a Kafka Cluster
  • Consumer Lag from a Specific Broker
  • Slow (Faulty) Disk
  • Real Production Issue: Detecting a Faulty Broker Using Disk Performance Metrics
  • Discussion
  • The Effect of Too Many disk.io Threads
  • Discussion
  • Looking at Disk Performance the Whole Time vs. During Peak Time Only
  • Discussion
  • The Effect of disk.io Threads on Broker, Producer, and Consumer Performance
  • Request Queue Size.
  • Produce Latency
  • Number of JVM Threads
  • Number of Context Switches
  • CPU User Time, System Time, and Normalized Load Average
  • Discussion
  • Summary
  • Chapter 8: Disk Configuration: RAID 10 vs. JBOD
  • RAID 10 and JBOD Terminology
  • RAID 0 (aka Stripe Set)
  • RAID 1 (aka Mirror Set)
  • RAID 1+0 (aka RAID 10)
  • Comparing RAID 10 and JBOD
  • Disk Failure
  • Data Skew
  • Storage Use
  • Pros and Cons of RAID 10 and JBOD
  • Performance of Write Operations
  • Storage Usage
  • Disk Failure Tolerance
  • Considering the Maintenance Burden of Disk Failure in On-Premises Clusters
  • Disk Health Monitoring
  • Frequency of Replacing Disks
  • Kafka Availability During Disk Replacement
  • Balancing the Data Between the Disks in the Broker
  • JBOD
  • RAID 10
  • Managing Disk Health in Kafka Clusters with JBOD Configuration
  • Summary
  • Chapter 9: A Deep Dive Into Producer Monitoring
  • Producer Metrics
  • Network I/O Rate Metric
  • When Network I/O Rate Is High
  • When Network I/O Rate Is Low
  • Importance of the Network I/O Rate Metric
  • Record Queue Time Metric
  • When Record Queue Time Is High
  • When Record Queue Time Is Low
  • Mitigating a High Record Queue Time
  • Importance of the Record Queue Time Metric
  • Output Bytes Metric
  • When Output Bytes Is High
  • When Output Bytes Is Low
  • Mitigating the Output Bytes Value
  • Importance of the Output Bytes Metric
  • Input Bytes Metric
  • The Difference Between Output Bytes and Input Bytes
  • Average Batch Size Metric
  • When Average Batch Size Is High
  • When Average Batch Size Is Low
  • Mitigating the Average Batch Size Metric
  • Importance of the Average Batch Size Metric
  • Buffer Available Bytes Metric
  • When Buffer Available Bytes Is High
  • When Buffer Available Bytes Is Low
  • Mitigating the Buffer Available Bytes Metric
  • Importance of the Buffer Available Bytes Metric.
  • Request Latency (Avg/Max) Metrics
  • When Request Latency Is High
  • When Request Latency Is Low
  • Mitigating the Request Latency Metric
  • Importance of the Request Latency Metric
  • Understanding the Impact of Multiple Producers and Consumers on the Kafka Cluster
  • Compression Rate: A Special Kind of Producer Metric
  • Configuring Compression on the Producer and Broker Levels
  • Compression Rate
  • When Compression Rate Is High
  • When Compression Rate Is Low
  • Mitigating Compression Rate
  • Importance of the Compression Rate Metric
  • Summary
  • Chapter 10: A Deep Dive Into Consumer Monitoring
  • Consumer Metrics
  • Consumer Lag Metrics
  • When the Consumer Lag Metric Is High
  • When the Consumer Lag Metric Is Low
  • Mitigating the Consumer Lag Metric
  • Importance of the Consumer Lag Metric
  • Fetch Request Rate Metric
  • When the Fetch Request Rate Is High
  • When the Fetch Request Rate Is Low
  • Mitigating the Fetch Request Rate
  • Importance of the Fetch Request Rate
  • Fetch Request Size (Avg/Max) Metrics
  • When the Fetch Request Size Is High
  • When the Fetch Request Size Is Low
  • Mitigating the Fetch Request Size
  • Importance of the Fetch Request Size Metrics
  • Consumer I/O Wait Ratio Metric
  • When the Consumer I/O Wait Ratio Is High
  • When the Consumer I/O Wait Ratio Is Low
  • Mitigating the Consumer I/O Wait Ratio
  • Importance of the Consumer I/O Wait Ratio
  • Records per Request Avg Metric
  • When the Records per Request Metric Is High
  • When the Records per Request Metric Is Low
  • Mitigating the Records per Request Metric
  • Importance of the Records per Request Metric
  • Fetch Latency Avg/Max Metrics
  • When the Fetch Latency Metrics Are High
  • When the Fetch Latency Metrics Are Low
  • Mitigating the Fetch Latency Metrics
  • Importance of the Fetch Latency Metrics
  • Consumer Request Rate Metric.
  • When the Consumer Request Rate Metric Is High.