Skip to main content

CRE-2025-0070

Kafka Under-Replicated Partitions CrisisCritical
Impact: 10/10
Mitigation: 6/10

CRE-2025-0070View on GitHub

Description

Critical Kafka cluster degradation detected: Multiple partitions have lost replicas due to broker failure, resulting in an under-replicated state. This pattern indicates a broker has become unavailable, causing partition leadership changes and In-Sync Replica (ISR) shrinkage across multiple topics.

Mitigation

IMMEDIATE ACTIONS: - Check status of all brokers: `kafka-broker-api-versions --bootstrap-server <broker>` - Identify failed broker and investigate root cause (logs, system resources) - Monitor under-replicated partitions: `kafka-topics --bootstrap-server <broker> --under-replicated-partitions` - Verify cluster health: `kafka-log-dirs --bootstrap-server <broker> --describe` RECOVERY STEPS: 1. Restore failed broker if possible (restart service, fix resource issues) 2. If broker cannot be restored, replace with new broker using same broker.id 3. Monitor partition reassignment and ISR recovery 4. Consider manual partition reassignment if automatic recovery is slow PREVENTION: - Implement comprehensive monitoring for UnderReplicatedPartitions metric - Set up alerting for broker availability and resource utilization - Consider increasing replication factor for critical topics - Implement proper capacity planning and resource allocation

References