Skip to main content

4 docs tagged with "high-availability"

View all tags

CRE-2025-0070

Critical Kafka cluster degradation detected\: Multiple partitions have lost replicas due to broker failure,

CRE-2025-0071

CoreDNS deployment is unavailable or has no ready endpoints, indicating an imminent cluster\-wide DNS outage.

CRE-2025-0075

Detects critical Nginx upstream failure cascades that lead to complete service unavailability.

CRE-2025-0076

Detects when Slurm's accounting daemon (slurmdbd) or controller (slurmctld) loses connection to its MySQL database, causing job scheduling and recording to halt.