Skip to main content

CRE-2025-0112

AWS VPC CNI Node IP Pool Depletion CrisisCritical
Impact: 10/10
Mitigation: 4/10

CRE-2025-0112View on GitHub

Description

Critical AWS VPC CNI node IP pool depletion detected causing cascading pod scheduling failures.\nThis pattern indicates severe subnet IP address exhaustion combined with ENI allocation failures,\nleading to complete cluster networking breakdown. The failure sequence shows ipamd errors,\nkubelet scheduling failures, and controller-level pod creation blocks that render clusters\nunable to deploy new workloads, scale existing services, or recover from node failures.\n\nThis represents one of the most severe Kubernetes infrastructure failures, often requiring\nimmediate manual intervention including subnet expansion, secondary CIDR provisioning,\nor emergency workload termination to restore cluster functionality.\n

Mitigation

IMMEDIATE EMERGENCY RESPONSE:\n- Identify affected subnets: `aws ec2 describe-subnets --filters \"Name=vpc-id,Values=$(aws eks describe-cluster --name CLUSTER --query cluster.resourcesVpcConfig.vpcId --output text)\" --query 'Subnets[*].[SubnetId,AvailableIpAddressCount,CidrBlock]' --output table`\n- Check ENI allocation status: `aws ec2 describe-network-interfaces --filters \"Name=status,Values=in-use\" --query 'length(NetworkInterfaces)'`\n- Scale down non-critical workloads immediately: `kubectl scale deployment NON_CRITICAL_APP --replicas=0`\n- Monitor VPC CNI daemon logs: `kubectl logs -n kube-system -l k8s-app=aws-node --follow`\n\nRECOVERY ACTIONS (Execute in order):\n1. Associate secondary VPC CIDR: `aws ec2 associate-vpc-cidr-block --vpc-id VPC_ID --cidr-block 100.64.0.0/16`\n2. Create additional subnets with enhanced discovery tags:\n ```bash\n for az in a b c; do\n aws ec2 create-subnet --vpc-id VPC_ID --cidr-block 100.64.${az/a/1}.0/24 --availability-zone us-west-2${az} --tag-specifications 'ResourceType=subnet,Tags=[{Key=kubernetes.io/role/cni,Value=1},{Key=kubernetes.io/cluster/CLUSTER_NAME,Value=shared}]'\n done\n ```\n3. Enable prefix delegation for maximum IP efficiency: `kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true WARM_PREFIX_TARGET=1`\n4. Optimize warm pool configuration: `kubectl set env daemonset aws-node -n kube-system WARM_IP_TARGET=3 MINIMUM_IP_TARGET=1 WARM_ENI_TARGET=1`\n5. Force VPC CNI restart to discover new subnets: `kubectl rollout restart daemonset/aws-node -n kube-system && kubectl rollout status daemonset/aws-node -n kube-system --timeout=300s`\n6. Verify recovery: `kubectl get pods --all-namespaces | grep Pending && kubectl get nodes -o wide`\n\nPREVENTION AND MONITORING:\n- Implement subnet IP monitoring: CloudWatch alarm on `AvailableIpAddressCount < 50`\n- Enable Enhanced Subnet Discovery (VPC CNI v1.18.0+): `kubectl set env daemonset aws-node -n kube-system ENABLE_SUBNET_DISCOVERY=true`\n- Set up automated capacity planning with 6-month growth projections\n- Configure cluster autoscaler with IP-aware node provisioning\n- Implement emergency runbooks for IP exhaustion scenarios\n- Consider IPv6 adoption for long-term scalability: `kubectl set env daemonset aws-node -n kube-system ENABLE_IPv6=true`\n- Monitor warm pool efficiency: `kubectl get daemonset aws-node -n kube-system -o jsonpath='{.spec.template.spec.containers[0].env[?(@.name==\"WARM_IP_TARGET\")].value}'`\n- Set up automated secondary CIDR provisioning triggers\n

References