Skip to main content

Tag: High Availability

Problems related to high-availability systems and failover

IDTitleDescriptionCategoryTechnologyTags
CRE-2025-0070
Critical
Impact: 10/10
Mitigation: 6/10
Kafka Under-Replicated Partitions CrisisCritical Kafka cluster degradation detected: Multiple partitions have lost replicas due to broker failure, resulting in an under-replicated state. This pattern indicates a broker has become unavailable, causing partition leadership changes and In-Sync Replica (ISR) shrinkage across multiple topics.Message Queue ProblemskafkaKafkaReplicationData LossHigh AvailabilityBroker FailureCluster Degradation
CRE-2025-0071
High
Impact: 9/10
Mitigation: 8/10
CoreDNS unavailableCoreDNS deployment is unavailable or has no ready endpoints, indicating an imminent cluster-wide DNS outage.Kubernetes ProblemskubernetesKubernetesNetworkingDNSHigh Availability
CRE-2025-0075
Critical
Impact: 10/10
Mitigation: 6/10
Nginx Upstream Failure Cascade CrisisDetects critical Nginx upstream failure cascades that lead to complete service unavailability. This advanced rule identifies comprehensive upstream failure patterns including DNS resolution failures, connection timeouts, SSL/TLS handshake errors, protocol violations, and server unavailability, followed by HTTP 5xx error responses within a 60-second window. The rule uses optimized regex patterns for maximum detection coverage while maintaining high performance and low false-positive rates. It captures both the root cause (upstream failures) and the user-facing impact (HTTP errors) to provide complete incident context.Load Balancer ProblemsnginxNginxReverse ProxyService OutageHigh AvailabilityLoad BalancerCascading Failure
CRE-2025-0076
High
Impact: 0/10
Mitigation: 9/10
SlurmDBD Database Connection LostDetects when Slurm's accounting daemon (slurmdbd) or controller (slurmctld) loses connection to its MySQL database, causing job scheduling and recording to halt.HPC Database ProblemsslurmSLURMSlurmDBDMySQLHigh Availability
CRE-2025-0119
High
Impact: 8/10
Mitigation: 7/10
Kubernetes Pod Disruption Budget (PDB) Violation During Rolling UpdatesDuring rolling updates, when a deployment's maxUnavailable setting conflicts with a Pod Disruption Budget's minAvailable requirement, it can cause service outages by terminating too many pods simultaneously, violating the availability guarantees. This can also occur during node drains, cluster autoscaling, or maintenance operations.Kubernetes ProblemskubernetesK8sKnown ProblemMisconfigurationOperational errorHigh Availability
CRE-2025-0121
Critical
Impact: 10/10
Mitigation: 7/10
NGINX Ingress Controller SSL Certificate FailureCritical NGINX Ingress Controller SSL certificate validation failure detected. This pattern indicates cascading SSL failures where certificate verification errors lead to upstream connection failures and service unavailability. The failure sequence shows SSL handshake failures, certificate verification errors, and resulting HTTP error responses that affect client connectivity.Load Balancer ProblemsnginxNginxIngress ControllerSSL CertificateTLS HandshakeCertificate VerificationLoad BalancerKubernetesSecurityHigh AvailabilityService Unavailability
CRE-2025-0122
Critical
Impact: 10/10
Mitigation: 6/10
AWS VPC CNI IP Address Exhaustion CrisisCritical AWS VPC CNI IP address exhaustion detected. This pattern indicates cascading failures where subnet IP exhaustion leads to ENI allocation failures, pod scheduling failures, and complete service unavailability. The failure sequence shows IP allocation errors, ENI attachment failures, and resulting pod startup failures that affect cluster scalability and workload deployment.Networking Problemsaws-vpc-cniAWSVPC CNIKubernetesNetworkingIP ExhaustionENI AllocationPod SchedulingCluster ScalingHigh AvailabilityService Unavailability