CRE-2025-0076
SlurmDBD Database Connection LostHighMitigation: 9/10
CRE-2025-0076View on GitHub
Description
Detects when Slurm's accounting daemon (slurmdbd) or controller (slurmctld) loses connection to its MySQL database, causing job scheduling and recording to halt.
Cause
The MySQL server becomes unreachable (e.g., the container is stopped or crashes), so slurmdbd and slurmctld cannot connect. Consequently, job state updates and cluster accounting operations fail.
Mitigation
Immediate Actions:
- Restart the MySQL container:
docker start mysql
- Confirm MySQL is healthy:
docker logs mysql --tail 20
- Restart Slurm services to re-establish connections:
docker restart slurmdbd slurmctld
- Check `slurmdbd` and `slurmctld` logs for any lingering errors.
Long-term Fixes:
- Deploy MySQL on a dedicated, persistent host or highly available service.
- Monitor MySQL health (CPU/memory/disk) and configure automatic restart.
- Configure slurmdbd retry and timeout parameters (`DBTimeout`, `DBConnectTimeout`)
in `slurmdbd.conf` to better tolerate transient database outages.
- Consider a hot backup slurmdbd node or clustering MySQL.