Skip to main content

Tag: High Availability

Problems related to high-availability systems and failover

IDTitleDescriptionCategoryTechnologyTags
CRE-2025-0070
Critical
Impact: 10/10
Mitigation: 6/10
Kafka Under-Replicated Partitions CrisisCritical Kafka cluster degradation detected: Multiple partitions have lost replicas due to broker failure, resulting in an under-replicated state. This pattern indicates a broker has become unavailable, causing partition leadership changes and In-Sync Replica (ISR) shrinkage across multiple topics.Message Queue ProblemskafkaKafkaReplicationData LossHigh AvailabilityBroker FailureCluster Degradation
CRE-2025-0071
High
Impact: 9/10
Mitigation: 8/10
CoreDNS unavailableCoreDNS deployment is unavailable or has no ready endpoints, indicating an imminent cluster-wide DNS outage.Kubernetes ProblemskubernetesKubernetesNetworkingDNSHigh Availability
CRE-2025-0075
Critical
Impact: 10/10
Mitigation: 6/10
Nginx Upstream Failure Cascade CrisisDetects critical Nginx upstream failure cascades that lead to complete service unavailability. This advanced rule identifies comprehensive upstream failure patterns including DNS resolution failures, connection timeouts, SSL/TLS handshake errors, protocol violations, and server unavailability, followed by HTTP 5xx error responses within a 60-second window. The rule uses optimized regex patterns for maximum detection coverage while maintaining high performance and low false-positive rates. It captures both the root cause (upstream failures) and the user-facing impact (HTTP errors) to provide complete incident context.load-balancer-problemnginxNginxReverse ProxyService OutageHigh AvailabilityLoad BalancerCascading Failure
CRE-2025-0076
High
Impact: 0/10
Mitigation: 9/10
SlurmDBD Database Connection LostDetects when Slurm's accounting daemon (slurmdbd) or controller (slurmctld) loses connection to its MySQL database, causing job scheduling and recording to halt.HPC Database ProblemsslurmSLURMSlurmDBDdatabase-problemMySQLHigh Availability