CRE-2025-0076
SlurmDBD Database Connection LostHighMitigation: 9/10
Description
Detects when Slurm's accounting daemon (slurmdbd) or controller (slurmctld) loses connection to its MySQL database, causing job scheduling and recording to halt.\n
Mitigation
**Immediate Actions:**\n 1. Restart the MySQL container:\n ```bash\n docker start mysql\n ```\n 2. Confirm MySQL is healthy:\n ```bash\n docker logs mysql --tail 20\n ```\n 3. Restart Slurm services to re-establish connections:\n ```bash\n docker restart slurmdbd slurmctld\n ```\n 4. Check `slurmdbd` and `slurmctld` logs for any lingering errors.\n**Long-term Fixes:**\n - Deploy MySQL on a dedicated, persistent host or highly available service.\n - Monitor MySQL health (CPU/memory/disk) and configure automatic restart.\n - Configure slurmdbd retry and timeout parameters (`DBTimeout`, `DBConnectTimeout`)\n in `slurmdbd.conf` to better tolerate transient database outages.\n - Consider a hot backup slurmdbd node or clustering MySQL.\n