Skip to main content

CRE-2025-0091

Redpanda Consumer Mass Disconnect → Coordinator FailureCritical
Impact: 10/10
Mitigation: 7/10

CRE-2025-0091View on GitHub

Description

Detects high-severity failure when mass consumer disconnections overwhelm Redpanda's group coordinator.\n- Multiple consumers simultaneously leave consumer groups\n- Coordinator becomes unresponsive (NodeNotReadyError)\n- MemberIdRequiredError indicates coordinator state corruption\n- Can lead to complete message processing halt\n

Mitigation

IMMEDIATE:\n- Restart affected consumer applications in small batches\n- Monitor Redpanda coordinator health: `rpk cluster info`\n- Check for coordinator election: `rpk group describe <group-name>`\n- Reduce consumer concurrency during recovery\n\nRECOVERY ACTIONS:\n- Restart Redpanda if coordinator remains unresponsive\n- Clear consumer group state: `rpk group delete <group-name>`\n- Implement graceful consumer shutdown with staggered delays\n- Monitor system resources during consumer scaling\n\nPREVENTION STRATEGIES:\n- Implement circuit breaker pattern for consumer creation\n- Add jitter to consumer startup/shutdown timing\n- Monitor consumer group rebalance frequency\n- Set appropriate session.timeout.ms and heartbeat.interval.ms\n- Use consumer group instance IDs for better tracking\n

References