Skip to main content

CRE-2025-0020

Self-hosted PostgreSQL HA: WAL Streaming & HA Controller Crisis (Replication Slot Loss, Disk Full, Etcd Quorum Failure)High
Impact: 10/10
Mitigation: 6/10

CRE-2025-0020View on GitHub

Description

Detects high-severity failures in self-hosted PostgreSQL high-availability clusters managed by Patroni, Zalando, or similar HA controllers.

This rule targets catastrophic conditions that break replication or cluster consensus:

  • WAL streaming failures due to missing replication slots (usually after disk full or crash events)
  • Persistent errors resolving HA controller endpoints (etcd/consul) and loss of HA controller quorum
  • Disk saturation leading to WAL write errors and replication breakage

Cause

  • Replication slot(s) "patroniN" missing or cannot be created due to disk full or corruption
  • PostgreSQL unable to stream WAL (Write-Ahead Log) to replicas, causing FATAL errors
  • HA controller (etcd/consul) DNS/name resolution failures or full cluster outage (quorum lost)
  • Disk full on primary prevents WAL writes or checkpointing

Mitigation

PREVENTION:

  • Monitor disk usage on all PostgreSQL nodes, especially WAL and archive directories
  • Set up alerting for replication lag and missing replication slots
  • Ensure HA controllers (etcd/consul) are running on redundant, reliable nodes

RESPONSE:

  • Restore or recreate missing replication slots
  • Free up disk space and restart affected PostgreSQL instances
  • Restore etcd/consul cluster quorum; check container/network status
  • Perform manual failover if automatic recovery fails

References