CRE-2025-0076

SlurmDBD Database Connection LostHigh
Mitigation: 9/10

CRE-2025-0076View on GitHub

SLURM SlurmDBDdatabase-problemMySQL High Availability

Description

Detects when Slurm's accounting daemon (slurmdbd) or controller (slurmctld) loses connection to its MySQL database, causing job scheduling and recording to halt.

Mitigation

**Immediate Actions:** 1. Restart the MySQL container: ```bash docker start mysql ``` 2. Confirm MySQL is healthy: ```bash docker logs mysql --tail 20 ``` 3. Restart Slurm services to re-establish connections: ```bash docker restart slurmdbd slurmctld ``` 4. Check `slurmdbd` and `slurmctld` logs for any lingering errors. **Long-term Fixes:** - Deploy MySQL on a dedicated, persistent host or highly available service. - Monitor MySQL health (CPU/memory/disk) and configure automatic restart. - Configure slurmdbd retry and timeout parameters (`DBTimeout`, `DBConnectTimeout`) in `slurmdbd.conf` to better tolerate transient database outages. - Consider a hot backup slurmdbd node or clustering MySQL.

Description

Mitigation

References