Tag: High Availability
Problems related to high-availability systems and failover
ID | Title | Description | Category | Technology | Tags |
---|---|---|---|---|---|
CRE-2025-0070 Critical Impact: 10/10 Mitigation: 6/10 | Kafka Under-Replicated Partitions Crisis | Critical Kafka cluster degradation detected: Multiple partitions have lost replicas due to broker failure,resulting in an under-replicated state. This pattern indicates a broker has become unavailable, causingpartition leadership changes and In-Sync Replica (ISR) shrinkage across multiple topics. | Message Queue Problems | kafka | KafkaReplicationData LossHigh AvailabilityBroker FailureCluster Degradation |
CRE-2025-0071 High Impact: 9/10 Mitigation: 8/10 | CoreDNS unavailable | CoreDNS deployment is unavailable or has no ready endpoints, indicating an imminent cluster-wide DNS outage. | Kubernetes Problems | kubernetes | KubernetesNetworkingDNSHigh Availability |
CRE-2025-0075 Critical Impact: 10/10 Mitigation: 6/10 | Nginx Upstream Failure Cascade Crisis | Detects critical Nginx upstream failure cascades that lead to complete service unavailability.This advanced rule identifies comprehensive upstream failure patterns including DNS resolutionfailures, connection timeouts, SSL/TLS handshake errors, protocol violations, and serverunavailability, followed by HTTP 5xx error responses within a 60-second window.The rule uses optimized regex patterns for maximum detection coverage while maintaininghigh performance and low false-positive rates. It captures both the root cause (upstreamfailures) and the user-facing impact (HTTP errors) to provide complete incident context. | Load Balancer Problems | nginx | NginxReverse ProxyService OutageHigh AvailabilityLoad BalancerCascading Failure |
CRE-2025-0076 High Impact: 0/10 Mitigation: 9/10 | SlurmDBD Database Connection Lost | Detects when Slurm's accounting daemon (slurmdbd) or controller (slurmctld) loses connection to its MySQL database, causing job scheduling and recording to halt. | HPC Database Problems | slurm | SLURMSlurmDBDMySQLHigh Availability |
CRE-2025-0119 High Impact: 8/10 Mitigation: 7/10 | Kubernetes Pod Disruption Budget (PDB) Violation During Rolling Updates | During rolling updates, when a deployment's maxUnavailable setting conflicts with a Pod Disruption Budget's minAvailable requirement, it can cause service outages by terminating too many pods simultaneously, violating the availability guarantees.This can also occur during node drains, cluster autoscaling, or maintenance operations. | Kubernetes Problems | kubernetes | K8sKnown ProblemMisconfigurationOperational errorHigh Availability |
CRE-2025-0121 Critical Impact: 10/10 Mitigation: 7/10 | NGINX Ingress Controller SSL Certificate Failure | Critical NGINX Ingress Controller SSL certificate validation failure detected. This pattern indicatescascading SSL failures where certificate verification errors lead to upstream connection failuresand service unavailability. The failure sequence shows SSL handshake failures, certificate verificationerrors, and resulting HTTP error responses that affect client connectivity. | Load Balancer Problems | nginx | NginxIngress ControllerSSL CertificateTLS HandshakeCertificate VerificationLoad BalancerKubernetesSecurityHigh AvailabilityService Unavailability |
CRE-2025-0122 Critical Impact: 10/10 Mitigation: 6/10 | AWS VPC CNI IP Address Exhaustion Crisis | Critical AWS VPC CNI IP address exhaustion detected. This pattern indicates cascading failureswhere subnet IP exhaustion leads to ENI allocation failures, pod scheduling failures, andcomplete service unavailability. The failure sequence shows IP allocation errors, ENI attachmentfailures, and resulting pod startup failures that affect cluster scalability and workload deployment. | Networking Problems | aws-vpc-cni | AWSVPC CNIKubernetesNetworkingIP ExhaustionENI AllocationPod SchedulingCluster ScalingHigh AvailabilityService Unavailability |