PREQUEL-2025-0020

Too many replicas scheduled on the same nodeHigh
Impact: 8/10
Mitigation: 2/10

PREQUEL-2025-0020View on GitHub

Description

80% or more of a deployment's replica pods are scheduled on the same Kubernetes node. If this node shuts down or experiences a problem, the service will experience an outage.\n

Mitigation

- Add a **`podAntiAffinity`** rule or a **`topologySpreadConstraints`** stanza to the Deployment so that at most a configurable percentage of replicas can share the same `kubernetes.io/hostname` (or `topology.kubernetes.io/zone`) value.\n- Review and relax **`nodeSelector` / `nodeAffinity` / taints & tolerations** so replicas have more than one viable node.\n- **Scale the node pool** (or uncordon nodes) to ensure at least N × (1 / replica-count) schedulable nodes are available.\n- Define a **`PodDisruptionBudget`** to guarantee a minimum number of replicas stay up during voluntary disruptions and drains.\n- Continuously **monitor replica distribution** (e.g., via PromQL or a controller) and alert when the ≥80 % threshold is crossed; auto-trigger a reschedule if needed.\n- During maintenance, **cordon & drain nodes gradually** so the scheduler is forced to redistribute Pods rather than piling replacements on the same host.\n

Description

Mitigation

References