CRE-2025-0102

Redpanda Cluster Critical Failure - Node Loss, Quorum Lost, and Data Availability ImpactedHigh
Impact: 0/10
Mitigation: 0/10

CRE-2025-0102View on GitHub

Redpanda Streaming Data Cluster Failure Node Down Quorum Loss Data Availability Errors Distributed System

Description

- The Redpanda streaming data platform is experiencing a severe, cascading failure. - This typically involves critical errors on one or more nodes (e.g., storage failures), leading to nodes becoming unresponsive or shutting down. - Subsequently, this can cause loss of controller quorum, leadership election problems for partitions, and a significant degradation in overall cluster health and data availability.

Mitigation

- **Initial Triage & Isolation:** - Identify the specific node(s) reporting critical errors (e.g., I/O errors, shutdown messages) from Redpanda logs. - Check basic system health on affected nodes: `dmesg`, disk space (`df -h`), memory usage (`free -m`), CPU load (`top` or `htop`), network connectivity (`ping`, `ip addr`). - **Address Node-Specific Failures:** - **Disk Issues:** If I/O errors or "No space left on device" occur, check disk health (e.g., `smartctl`), free up space, or prepare for disk replacement. - **Node Shutdowns:** Investigate Redpanda logs on the failed node for the root cause of the shutdown. - Attempt to restart the Redpanda service on the affected node if the underlying issue is resolved. - **Cluster Stability:** - **Controller Quorum:** If controller quorum is lost, prioritize bringing controller nodes back online. This may require resolving issues on those specific nodes. - **Network Issues:** Verify robust network connectivity and low latency between all Redpanda brokers. Check switches, firewalls, and MTU settings. - **Recovery Procedures (Consult Redpanda Documentation):** - If a node is permanently lost, follow official Redpanda procedures for removing the dead node from the cluster and replacing it. This will trigger data re-replication. - Monitor partition health and re-replication progress (`rpk cluster partitions list`, `rpk cluster health`). - If leadership elections are failing, ensure a majority of replicas for those partitions are online and healthy. - **Preventative Measures:** - Implement comprehensive monitoring for Redpanda metrics and system-level metrics (disk, CPU, memory, network). - Regularly review Redpanda logs for warnings or errors. - Ensure sufficient disk space and I/O capacity. - Maintain up-to-date Redpanda versions. - Test disaster recovery procedures.

Description

Mitigation

References