CRE-2025-0082

NATS JetStream HA failures: monitor goroutine, consumer stalls and unsynced replicasHigh
Mitigation: 8/10

CRE-2025-0082View on GitHub

NATS JetStream Raft Ack Deadlock Unsynced Replica

Description

Detects high-availability failures in NATS JetStream clusters due to:\n\n1. **Monitor goroutine failure** — after node restarts, Raft group fails to elect a leader \n2. **Consumer deadlock** — using DeliverPolicy=LastPerSubject + AckPolicy=Explicit with low MaxAckPending \n3. **Unsynced replicas** — object store replication appears healthy but data is lost or inconsistent between nodes\n\nThese issues lead to invisible data loss, stalled consumers, or stream unavailability.\n

Mitigation

- Always enable JetStream before ReadyForConnections \n- Use ProcessConfigString instead of on-the-fly JS enablement \n- Avoid MaxAckPending < 100 with DeliverPolicy=LastPerSubject \n- Run regular `nats stream-check --unsynced` checks \n- To recover object store: \n - Scale stream to replicas=1 and back \n - Or remove faulty replica via `nats stream cluster ... peer-remove` \n- Monitor for raftz and jsz inconsistencies in tooling\n

Description

Mitigation

References