CRE-2025-0091
Redpanda Consumer Mass Disconnect → Coordinator FailureCriticalImpact: 10/10Mitigation: 7/10
Description
Detects high-severity failure when mass consumer disconnections overwhelm Redpanda's group coordinator.\n- Multiple consumers simultaneously leave consumer groups\n- Coordinator becomes unresponsive (NodeNotReadyError)\n- MemberIdRequiredError indicates coordinator state corruption\n- Can lead to complete message processing halt\n
Mitigation
IMMEDIATE:\n- Restart affected consumer applications in small batches\n- Monitor Redpanda coordinator health: `rpk cluster info`\n- Check for coordinator election: `rpk group describe <group-name>`\n- Reduce consumer concurrency during recovery\n\nRECOVERY ACTIONS:\n- Restart Redpanda if coordinator remains unresponsive\n- Clear consumer group state: `rpk group delete <group-name>`\n- Implement graceful consumer shutdown with staggered delays\n- Monitor system resources during consumer scaling\n\nPREVENTION STRATEGIES:\n- Implement circuit breaker pattern for consumer creation\n- Add jitter to consumer startup/shutdown timing\n- Monitor consumer group rebalance frequency\n- Set appropriate session.timeout.ms and heartbeat.interval.ms\n- Use consumer group instance IDs for better tracking\n