CRE-2025-0061
Karpenter Stability Issues on EKS During Leader ElectionMediumImpact: 7/10Mitigation: 4/10
CRE-2025-0061View on GitHub
Description
- EKS may be able to handle steady, predictable scale, but struggles during large‑scale auto scaling events when many workloads and nodes are spinning up or down simultaneously.
- This instability affects components that implement leader election using the Kubernetes API, such as:
- aws‑load‑balancer‑controller
- karpenter
- keda‑operator
- ebs‑csi‑controller
- efs‑csi‑controller
Cause
- During high cluster activity, the etcd database enters a defragmentation phase, making it temporarily read‑only and blocking write operations such as Kubernetes leader election updates.
- The Kubernetes API server becomes overwhelmed and cannot fulfill PUT requests in time, causing clients to timeout.
Mitigation
- Use Kubernetes API Priority and Fairness (FlowSchema and PriorityLevelConfiguration) to prioritize leader election traffic during high load.
- Assign `workload‑high` priority to requests from critical components like the Karpenter controller.
- Monitor etcd size and schedule regular defragmentation to reduce unplanned contention.