Skip to main content

Tag: Known Problem

This is a documented known problem with known mitigations

IDTitleDescriptionCategoryTechnologyTags
CRE-2024-0007
Critical
Impact: 9/10
Mitigation: 8/10
RabbitMQ Mnesia overloaded
The underlying Erlang process, Mnesia, is overloaded (` WARNING Mnesia is overloaded`).
Message Queue ProblemsrabbitmqKnown ProblemRabbitMQPublic
CRE-2024-0008
High
Impact: 9/10
Mitigation: 6/10
RabbitMQ memory alarm
A RabbitMQ node has entered the “memory alarm” state because the total memory used by the Erlang VM (plus allocated binaries, ETS tables,and processes) has exceeded the configured `vm_memory_high_watermark`. While the alarm is active the brokerapplies flow-control, blocking publishers and pausing most ingress activity to protect itself from running out of RAM.
Message Queue ProblemsrabbitmqKnown ProblemRabbitMQPublic
CRE-2024-0014
High
Impact: 8/10
Mitigation: 5/10
RabbitMQ busy distribution port performance issue
The Erlang VM has reported a `busy_dist_port` condition, meaning the send buffer of a distribution port (used for inter-node traffic inside aRabbitMQ cluster) is full. When this happens the scheduler suspends the process owning the port, stalling inter-node replication, managementcalls, and any RabbitMQ process that must use that port. Throughput drops and latency rises until the buffer drains or the node is restarted.
Message Queue PerformancerabbitmqKnown ProblemRabbitMQPublic
CRE-2024-0016
Low
Impact: 4/10
Mitigation: 2/10
Google Kubernetes Engine metrics agent failing to export metrics
The Google Kubernetes Engine metrics agent is failing to export metrics.
Observability Problemsgke-metrics-agentKnown ProblemGKEPublic
CRE-2024-0018
Medium
Impact: 4/10
Mitigation: 5/10
Neutron Open Virtual Network (OVN) high CPU usage
OVN daemons (e.g., ovn-controller) are stuck in a tight poll loop, driving CPU to 100 %. Logs show “Dropped … due to excessive rate” or“Unreasonably long … poll interval,” slowing port-binding and network traffic.
Networking ProblemsneutronKnown ProblemOvnPublic
CRE-2024-0021
High
Impact: 4/10
Mitigation: 5/10
KEDA operator reconciler ScaledObject panic
KEDA allows for fine-grained autoscaling (including to/from zero) for event driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition.
Operator ProblemsUnspecifiedKEDACrashKnown ProblemPublic
CRE-2024-0043
Medium
Impact: 6/10
Mitigation: 5/10
NGINX Upstream DNS Failure
When a NGINX upstream becomes unreachable or its DNS entry disappears, NGINX requests begin to fail.
Proxy ProblemsnginxKafkaKnown ProblemPublic
CRE-2025-0025
Medium
Impact: 6/10
Mitigation: 5/10
Kafka broker replication mismatch
When the configured replication factor for a Kafka topic is greater than the actual number of brokers in the cluster, Kafka repeatedly fails to assign partitions and logs replication-related errors. This results in persistent warnings or an `InvalidReplicationFactorException` when the broker tries to create internal or user-defined topics.
Message Queue Problemstopic-operatorKafkaKnown ProblemPublic
CRE-2025-0112
Critical
Impact: 10/10
Mitigation: 4/10
AWS VPC CNI Node IP Pool Depletion Crisis
Critical AWS VPC CNI node IP pool depletion detected causing cascading pod scheduling failures.This pattern indicates severe subnet IP address exhaustion combined with ENI allocation failures,leading to complete cluster networking breakdown. The failure sequence shows ipamd errors,kubelet scheduling failures, and controller-level pod creation blocks that render clustersunable to deploy new workloads, scale existing services, or recover from node failures.This represents one of the most severe Kubernetes infrastructure failures, often requiringimmediate manual intervention including subnet expansion, secondary CIDR provisioning,or emergency workload termination to restore cluster functionality.
VPC CNI Problemsaws-vpc-cniAWSEKSKubernetesNetworkingVPC CNIAWS CNIIP ExhaustionENI AllocationSubnet ExhaustionPod Scheduling FailureCluster ParalysisAWS API LimitsKnown ProblemCritical InfrastructureService OutageCascading FailureCapacity ExceededScalability IssueRevenue ImpactCompliance ViolationThreshold ExceededInfrastructurePublic
CRE-2025-0119
High
Impact: 8/10
Mitigation: 7/10
Kubernetes Pod Disruption Budget (PDB) Violation During Rolling Updates
During rolling updates, when a deployment's maxUnavailable setting conflicts with a Pod Disruption Budget's minAvailable requirement, it can cause service outages by terminating too many pods simultaneously, violating the availability guarantees.This can also occur during node drains, cluster autoscaling, or maintenance operations.
Kubernetes ProblemskubernetesK8sKnown ProblemMisconfigurationOperational errorHigh Availability