Tag: High Availability
Problems related to high-availability systems and failover
ID | Title | Description | Category | Technology | Tags |
---|---|---|---|---|---|
CRE-2025-0070 Critical Impact: 10/10 Mitigation: 6/10 | Kafka Under-Replicated Partitions Crisis | Critical Kafka cluster degradation detected: Multiple partitions have lost replicas due to broker failure, resulting in an under-replicated state. This pattern indicates a broker has become unavailable, causing partition leadership changes and In-Sync Replica (ISR) shrinkage across multiple topics. | Message Queue Problems | kafka | KafkaReplicationData LossHigh AvailabilityBroker FailureCluster Degradation |
CRE-2025-0071 High Impact: 9/10 Mitigation: 8/10 | CoreDNS unavailable | CoreDNS deployment is unavailable or has no ready endpoints, indicating an imminent cluster-wide DNS outage. | Kubernetes Problems | kubernetes | KubernetesNetworkingDNSHigh Availability |
CRE-2025-0075 Critical Impact: 10/10 Mitigation: 6/10 | Nginx Upstream Failure Cascade Crisis | Detects critical Nginx upstream failure cascades that lead to complete service unavailability. This advanced rule identifies comprehensive upstream failure patterns including DNS resolution failures, connection timeouts, SSL/TLS handshake errors, protocol violations, and server unavailability, followed by HTTP 5xx error responses within a 60-second window. The rule uses optimized regex patterns for maximum detection coverage while maintaining high performance and low false-positive rates. It captures both the root cause (upstream failures) and the user-facing impact (HTTP errors) to provide complete incident context. | load-balancer-problem | nginx | NginxReverse ProxyService OutageHigh AvailabilityLoad BalancerCascading Failure |
CRE-2025-0076 High Impact: 0/10 Mitigation: 9/10 | SlurmDBD Database Connection Lost | Detects when Slurm's accounting daemon (slurmdbd) or controller (slurmctld) loses connection to its MySQL database, causing job scheduling and recording to halt. | HPC Database Problems | slurm | SLURMSlurmDBDdatabase-problemMySQLHigh Availability |