CRE-2025-0076
SlurmDBD Database Connection LostHighMitigation: 9/10
Description
Detects when Slurm's accounting daemon (slurmdbd) or controller (slurmctld) loses connection to its MySQL database, causing job scheduling and recording to halt.
Mitigation
**Immediate Actions:** 1. Restart the MySQL container: ```bash docker start mysql ``` 2. Confirm MySQL is healthy: ```bash docker logs mysql --tail 20 ``` 3. Restart Slurm services to re-establish connections: ```bash docker restart slurmdbd slurmctld ``` 4. Check `slurmdbd` and `slurmctld` logs for any lingering errors. **Long-term Fixes:** - Deploy MySQL on a dedicated, persistent host or highly available service. - Monitor MySQL health (CPU/memory/disk) and configure automatic restart. - Configure slurmdbd retry and timeout parameters (`DBTimeout`, `DBConnectTimeout`) in `slurmdbd.conf` to better tolerate transient database outages. - Consider a hot backup slurmdbd node or clustering MySQL.