Skip to main content

Tag: High Availability

Problems related to high-availability systems and failover

IDTitleDescriptionCategoryTechnologyTags
CRE-2025-0070
Critical
Impact: 10/10
Mitigation: 6/10
Kafka Under-Replicated Partitions Crisis
Critical Kafka cluster degradation detected: Multiple partitions have lost replicas due to broker failure,resulting in an under-replicated state. This pattern indicates a broker has become unavailable, causingpartition leadership changes and In-Sync Replica (ISR) shrinkage across multiple topics.
Message Queue ProblemskafkaKafkaReplicationData LossHigh AvailabilityBroker FailureCluster Degradation
CRE-2025-0071
High
Impact: 9/10
Mitigation: 8/10
CoreDNS unavailable
CoreDNS deployment is unavailable or has no ready endpoints, indicating an imminent cluster-wide DNS outage.
Kubernetes ProblemskubernetesKubernetesNetworkingDNSHigh Availability
CRE-2025-0075
Critical
Impact: 10/10
Mitigation: 6/10
Nginx Upstream Failure Cascade Crisis
Detects critical Nginx upstream failure cascades that lead to complete service unavailability.This advanced rule identifies comprehensive upstream failure patterns including DNS resolutionfailures, connection timeouts, SSL/TLS handshake errors, protocol violations, and serverunavailability, followed by HTTP 5xx error responses within a 60-second window.The rule uses optimized regex patterns for maximum detection coverage while maintaininghigh performance and low false-positive rates. It captures both the root cause (upstreamfailures) and the user-facing impact (HTTP errors) to provide complete incident context.
Load Balancer ProblemsnginxNginxReverse ProxyService OutageHigh AvailabilityLoad BalancerCascading Failure
CRE-2025-0076
High
Impact: 0/10
Mitigation: 9/10
SlurmDBD Database Connection Lost
Detects when Slurm's accounting daemon (slurmdbd) or controller (slurmctld) loses connection to its MySQL database, causing job scheduling and recording to halt.
HPC Database ProblemsslurmSLURMSlurmDBDMySQLHigh Availability
CRE-2025-0119
High
Impact: 8/10
Mitigation: 7/10
Kubernetes Pod Disruption Budget (PDB) Violation During Rolling Updates
During rolling updates, when a deployment's maxUnavailable setting conflicts with a Pod Disruption Budget's minAvailable requirement, it can cause service outages by terminating too many pods simultaneously, violating the availability guarantees.This can also occur during node drains, cluster autoscaling, or maintenance operations.
Kubernetes ProblemskubernetesK8sKnown ProblemMisconfigurationOperational errorHigh Availability
CRE-2025-0121
Critical
Impact: 10/10
Mitigation: 7/10
NGINX Ingress Controller SSL Certificate Failure
Critical NGINX Ingress Controller SSL certificate validation failure detected. This pattern indicatescascading SSL failures where certificate verification errors lead to upstream connection failuresand service unavailability. The failure sequence shows SSL handshake failures, certificate verificationerrors, and resulting HTTP error responses that affect client connectivity.
Load Balancer ProblemsnginxNginxIngress ControllerSSL CertificateTLS HandshakeCertificate VerificationLoad BalancerKubernetesSecurityHigh AvailabilityService Unavailability
CRE-2025-0122
Critical
Impact: 10/10
Mitigation: 6/10
AWS VPC CNI IP Address Exhaustion Crisis
Critical AWS VPC CNI IP address exhaustion detected. This pattern indicates cascading failureswhere subnet IP exhaustion leads to ENI allocation failures, pod scheduling failures, andcomplete service unavailability. The failure sequence shows IP allocation errors, ENI attachmentfailures, and resulting pod startup failures that affect cluster scalability and workload deployment.
Networking Problemsaws-vpc-cniAWSVPC CNIKubernetesNetworkingIP ExhaustionENI AllocationPod SchedulingCluster ScalingHigh AvailabilityService Unavailability