CRE-2025-0102
Redpanda Cluster Critical Failure - Node Loss, Quorum Lost, and Data Availability ImpactedHighImpact: 0/10Mitigation: 0/10
Description
- The Redpanda streaming data platform is experiencing a severe, cascading failure.\n- This typically involves critical errors on one or more nodes (e.g., storage failures), leading to nodes becoming unresponsive or shutting down.\n- Subsequently, this can cause loss of controller quorum, leadership election problems for partitions, and a significant degradation in overall cluster health and data availability.\n
Mitigation
- **Initial Triage & Isolation:**\n - Identify the specific node(s) reporting critical errors (e.g., I/O errors, shutdown messages) from Redpanda logs.\n - Check basic system health on affected nodes: `dmesg`, disk space (`df -h`), memory usage (`free -m`), CPU load (`top` or `htop`), network connectivity (`ping`, `ip addr`).\n- **Address Node-Specific Failures:**\n - **Disk Issues:** If I/O errors or \"No space left on device\" occur, check disk health (e.g., `smartctl`), free up space, or prepare for disk replacement.\n - **Node Shutdowns:** Investigate Redpanda logs on the failed node for the root cause of the shutdown.\n - Attempt to restart the Redpanda service on the affected node if the underlying issue is resolved.\n- **Cluster Stability:**\n - **Controller Quorum:** If controller quorum is lost, prioritize bringing controller nodes back online. This may require resolving issues on those specific nodes.\n - **Network Issues:** Verify robust network connectivity and low latency between all Redpanda brokers. Check switches, firewalls, and MTU settings.\n- **Recovery Procedures (Consult Redpanda Documentation):**\n - If a node is permanently lost, follow official Redpanda procedures for removing the dead node from the cluster and replacing it. This will trigger data re-replication.\n - Monitor partition health and re-replication progress (`rpk cluster partitions list`, `rpk cluster health`).\n - If leadership elections are failing, ensure a majority of replicas for those partitions are online and healthy.\n- **Preventative Measures:**\n - Implement comprehensive monitoring for Redpanda metrics and system-level metrics (disk, CPU, memory, network).\n - Regularly review Redpanda logs for warnings or errors.\n - Ensure sufficient disk space and I/O capacity.\n - Maintain up-to-date Redpanda versions.\n - Test disaster recovery procedures.\n