CRE-2025-0102
Redpanda Cluster Critical Failure - Node Loss, Quorum Lost, and Data Availability ImpactedHighImpact: 0/10Mitigation: 0/10
CRE-2025-0102View on GitHub
Description
- The Redpanda streaming data platform is experiencing a severe, cascading failure.
- This typically involves critical errors on one or more nodes (e.g., storage failures), leading to nodes becoming unresponsive or shutting down.
- Subsequently, this can cause loss of controller quorum, leadership election problems for partitions, and a significant degradation in overall cluster health and data availability.
Cause
- Persistent hardware failures on a broker node (e.g., disk I/O errors, failing disk, NIC issues).
- A Redpanda broker process crashing repeatedly due to software bugs, resource exhaustion (OOM), or critical misconfiguration.
- Severe network partitioning that isolates nodes or groups of nodes from each other, preventing Raft consensus.
- Failure of critical system resources (e.g., full disk preventing writes, insufficient memory).
- Cascading effects where an initial failure on one node triggers instability across other nodes.
Mitigation
- Initial Triage & Isolation:
- Identify the specific node(s) reporting critical errors (e.g., I/O errors, shutdown messages) from Redpanda logs.
- Check basic system health on affected nodes: `dmesg`, disk space (`df -h`), memory usage (`free -m`), CPU load (`top` or `htop`), network connectivity (`ping`, `ip addr`).
- Address Node-Specific Failures:
- Disk Issues: If I/O errors or "No space left on device" occur, check disk health (e.g., `smartctl`), free up space, or prepare for disk replacement.
- Node Shutdowns: Investigate Redpanda logs on the failed node for the root cause of the shutdown.
- Attempt to restart the Redpanda service on the affected node if the underlying issue is resolved.
- Cluster Stability:
- Controller Quorum: If controller quorum is lost, prioritize bringing controller nodes back online. This may require resolving issues on those specific nodes.
- Network Issues: Verify robust network connectivity and low latency between all Redpanda brokers. Check switches, firewalls, and MTU settings.
- Recovery Procedures (Consult Redpanda Documentation):
- If a node is permanently lost, follow official Redpanda procedures for removing the dead node from the cluster and replacing it. This will trigger data re-replication.
- Monitor partition health and re-replication progress (`rpk cluster partitions list`, `rpk cluster health`).
- If leadership elections are failing, ensure a majority of replicas for those partitions are online and healthy.
- Preventative Measures:
- Implement comprehensive monitoring for Redpanda metrics and system-level metrics (disk, CPU, memory, network).
- Regularly review Redpanda logs for warnings or errors.
- Ensure sufficient disk space and I/O capacity.
- Maintain up-to-date Redpanda versions.
- Test disaster recovery procedures.