Skip to main content

CRE-2025-0061

Karpenter Stability Issues on EKS During Leader ElectionMedium
Impact: 7/10
Mitigation: 4/10

CRE-2025-0061View on GitHub

Description

  • EKS may be able to handle steady, predictable scale, but struggles during large‑scale auto scaling events when many workloads and nodes are spinning up or down simultaneously.
  • This instability affects components that implement leader election using the Kubernetes API, such as:
  • aws‑load‑balancer‑controller
  • karpenter
  • keda‑operator
  • ebs‑csi‑controller
  • efs‑csi‑controller

Cause

  • During high cluster activity, the etcd database enters a defragmentation phase, making it temporarily read‑only and blocking write operations such as Kubernetes leader election updates.
  • The Kubernetes API server becomes overwhelmed and cannot fulfill PUT requests in time, causing clients to timeout.

Mitigation

  • Use Kubernetes API Priority and Fairness (FlowSchema and PriorityLevelConfiguration) to prioritize leader election traffic during high load.
  • Assign `workload‑high` priority to requests from critical components like the Karpenter controller.
  • Monitor etcd size and schedule regular defragmentation to reduce unplanned contention.

References