PREQUEL-2025-0092

AWS CNI intermittent runtime panics and failure to destroy pod networkHigh
Impact: 6/10
Mitigation: 4/10

PREQUEL-2025-0092View on GitHub

Description

This rule fires when the kubelet reports a series of `FailedKillPod /\nKillPodSandboxError` events that contain \n`rpc error: code = Unknown desc = failed to destroy network for sandbox…`\ntogether with a **SIGSEGV / nil-pointer panic** from\n`routed-eni-cni-plugin/cni.go` or `PluginMainFuncsWithError`. \nThese messages indicate that the Amazon VPC CNI plugin crashed while\ntearing down a Pod’s network namespace, leaving the sandbox in an\nindeterminate state.\n

Mitigation

1. **Upgrade the VPC CNI add-on** to the latest patched series\n (v1.19.5-eksbuild.* or newer). AWS explicitly recommends keeping the\n add-on current and using the managed add-on where possible.\n2. **Roll out the fix safely** \n ```bash\n # managed add-on\n aws eks update-addon \\\n --cluster-name ‹cluster› \\\n --addon-name vpc-cni --addon-version v1.19.5-eksbuild.3\n # or self-managed\n kubectl set image daemonset/aws-node -n kube-system \\\n aws-node=602401143452.dkr.ecr.$AWS_REGION.amazonaws.com/amazon-k8s-cni:v1.19.5\n ```\n3. **Short-term relief** \n * Force-delete stuck Pods: `kubectl delete pod ‹pod› --grace-period=0 --force` \n * Remove empty cache files (`/var/lib/cni/results/*-eth0`) and\n restart `aws-node` to reclaim IPs.\n4. **Prevent recurrence** – avoid mass Pod churn during CNI\n rollouts, ensure `aws-node` DaemonSet has adequate CPU/memory, and\n routinely run the `aws-cni-support.sh` script before raising AWS\n support cases.\n

References

https://github.com/aws/amazon-vpc-cni-k8s/issues/3230