CRE-2025-0070
Kafka Under-Replicated Partitions CrisisCriticalImpact: 10/10Mitigation: 6/10
CRE-2025-0070View on GitHub
Description
Critical Kafka cluster degradation detected: Multiple partitions have lost replicas due to broker failure,
resulting in an under-replicated state. This pattern indicates a broker has become unavailable, causing
partition leadership changes and In-Sync Replica (ISR) shrinkage across multiple topics.
Cause
- Broker process crash or unplanned shutdown
- Host-level failure (hardware, OS, or container failure)
- Network partition isolating broker from cluster
- Resource exhaustion (disk full, memory pressure, CPU saturation)
- JVM issues (OutOfMemoryError, garbage collection storms)
- Infrastructure issues (storage failures, network connectivity)
Mitigation
IMMEDIATE ACTIONS:
- Check status of all brokers: `kafka-broker-api-versions --bootstrap-server <broker>`
- Identify failed broker and investigate root cause (logs, system resources)
- Monitor under-replicated partitions: `kafka-topics --bootstrap-server <broker> --under-replicated-partitions`
- Verify cluster health: `kafka-log-dirs --bootstrap-server <broker> --describe`
RECOVERY STEPS:
- Restore failed broker if possible (restart service, fix resource issues)
- If broker cannot be restored, replace with new broker using same broker.id
- Monitor partition reassignment and ISR recovery
- Consider manual partition reassignment if automatic recovery is slow
PREVENTION:
- Implement comprehensive monitoring for UnderReplicatedPartitions metric
- Set up alerting for broker availability and resource utilization
- Consider increasing replication factor for critical topics
- Implement proper capacity planning and resource allocation
References
- https://kafka.apache.org/documentation/#design_replicatedlog
- https://docs.confluent.io/platform/current/kafka/monitoring.html#replica-management
- https://kafka.apache.org/documentation/#basic_ops_cluster_expansion
- https://cwiki.apache.org/confluence/display/KAFKA/KIP-91%3A+Provide+Broker+Metadata+in+LeaderAndIsrRequest