Tag: Known Problem

This is a documented known problem with known mitigations

ID	Title	Description	Category	Technology	Tags
CRE-2024-0007 Critical Impact: 9/10 Mitigation: 8/10	RabbitMQ Mnesia overloaded	The underlying Erlang process, Mnesia, is overloaded (` WARNING Mnesia is overloaded`).	Message Queue Problems	rabbitmq	Known Problem RabbitMQ Public
CRE-2024-0008 High Impact: 9/10 Mitigation: 6/10	RabbitMQ memory alarm	A RabbitMQ node has entered the “memory alarm” state because the total memory used by the Erlang VM (plus allocated binaries, ETS tables, and processes) has exceeded the configured `vm_memory_high_watermark`. While the alarm is active the broker applies flow-control, blocking publishers and pausing most ingress activity to protect itself from running out of RAM.	Message Queue Problems	rabbitmq	Known Problem RabbitMQ Public
CRE-2024-0014 High Impact: 8/10 Mitigation: 5/10	RabbitMQ busy distribution port performance issue	The Erlang VM has reported a `busy_dist_port` condition, meaning the send buffer of a distribution port (used for inter-node traffic inside a RabbitMQ cluster) is full. When this happens the scheduler suspends the process owning the port, stalling inter-node replication, management calls, and any RabbitMQ process that must use that port. Throughput drops and latency rises until the buffer drains or the node is restarted.	Message Queue Performance	rabbitmq	Known Problem RabbitMQ Public
CRE-2024-0016 Low Impact: 4/10 Mitigation: 2/10	Google Kubernetes Engine metrics agent failing to export metrics	The Google Kubernetes Engine metrics agent is failing to export metrics.	Observability Problems	gke-metrics-agent	Known Problem GKE Public
CRE-2024-0018 Medium Impact: 4/10 Mitigation: 5/10	Neutron Open Virtual Network (OVN) high CPU usage	OVN daemons (e.g., ovn-controller) are stuck in a tight poll loop, driving CPU to 100 %. Logs show “Dropped … due to excessive rate” or “Unreasonably long … poll interval,” slowing port-binding and network traffic.	Networking Problems	neutron	Known Problem Ovn Public
CRE-2024-0021 High Impact: 4/10 Mitigation: 5/10	KEDA operator reconciler ScaledObject panic	KEDA allows for fine-grained autoscaling (including to/from zero) for event driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition.	Operator Problems	keda-operator	KEDA Crash Known Problem Public
CRE-2024-0043 Medium Impact: 6/10 Mitigation: 5/10	NGINX Upstream DNS Failure	When a NGINX upstream becomes unreachable or its DNS entry disappears, NGINX requests begin to fail.	Proxy Problems	nginx	Kafka Known Problem Public
CRE-2025-0025 Medium Impact: 6/10 Mitigation: 5/10	Kafka broker replication mismatch	When the configured replication factor for a Kafka topic is greater than the actual number of brokers in the cluster, Kafka repeatedly fails to assign partitions and logs replication-related errors. This results in persistent warnings or an `InvalidReplicationFactorException` when the broker tries to create internal or user-defined topics.	Message Queue Problems	topic-operator	Kafka Known Problem Public
CRE-2025-0112 Critical Impact: 10/10 Mitigation: 4/10	AWS VPC CNI Node IP Pool Depletion Crisis	Critical AWS VPC CNI node IP pool depletion detected causing cascading pod scheduling failures. This pattern indicates severe subnet IP address exhaustion combined with ENI allocation failures, leading to complete cluster networking breakdown. The failure sequence shows ipamd errors, kubelet scheduling failures, and controller-level pod creation blocks that render clusters unable to deploy new workloads, scale existing services, or recover from node failures. This represents one of the most severe Kubernetes infrastructure failures, often requiring immediate manual intervention including subnet expansion, secondary CIDR provisioning, or emergency workload termination to restore cluster functionality.	VPC CNI Problems	aws-vpc-cni	AWS EKS Kubernetes Networking VPC CNI AWS CNI IP Exhaustion ENI Allocation Subnet Exhaustion Pod Scheduling Failure Cluster Paralysis AWS API Limits Known Problem Critical Infrastructure Service Outage Cascading Failure Capacity Exceeded Scalability Issue Revenue Impact Compliance Violation Threshold Exceeded Infrastructure Public
CRE-2025-0119 High Impact: 8/10 Mitigation: 7/10	Kubernetes Pod Disruption Budget (PDB) Violation During Rolling Updates	During rolling updates, when a deployment's maxUnavailable setting conflicts with a Pod Disruption Budget's minAvailable requirement, it can cause service outages by terminating too many pods simultaneously, violating the availability guarantees. This can also occur during node drains, cluster autoscaling, or maintenance operations.	Kubernetes Problems	kubernetes	K8s Known Problem Misconfiguration Operational error High Availability