CRE-2025-0081
Temporal Server Fails Persistence on Read-Only DatabaseCriticalImpact: 9/10Mitigation: 8/10
Description
Detects critical failures where Temporal Server is unable to perform essential database write operations (e.g., starting workflows, recording history, completing tasks) because its underlying SQL database (e.g., PostgreSQL) is in a read-only state. This leads to a halt or severe degradation in workflow processing and can cause cluster instability.
Mitigation
IMMEDIATE TRIAGE (0-10 minutes): 1. Check Temporal Server logs for errors like "cannot execute INSERT/UPDATE/DELETE in a read-only transaction" or "no usable database connection found". Example: `docker logs temporal_container_name | grep -E -i "read-only transaction|no usable database connection"` 2. Check PostgreSQL (or relevant SQL DB) logs for errors indicating read-only status or write failures. 3. Attempt a simple write operation directly on the database using a psql/SQL client with Temporal's credentials to confirm its read-only status. Example (PostgreSQL): `psql -U temporal_user -d temporal_db -c "CREATE TABLE test_write (id int); DROP TABLE test_write;"` 4. Check current database read-only settings (PostgreSQL): `psql -U temporal_user -d temporal_db -c "SHOW default_transaction_read_only;"` `psql -U temporal_user -d temporal_db -c "SELECT pg_is_in_recovery();"` (if it's a replica that should be primary). EMERGENCY RESPONSE (10-30 minutes): 1. Confirm the database is read-only. 2. Switch the database back to read-write mode. - For PostgreSQL: Connect as a superuser and run: `ALTER DATABASE <temporal_db_name> SET default_transaction_read_only = false;` 3. Force Temporal Server to re-establish connections: Terminate existing database connections from Temporal server nodes. - For PostgreSQL: `SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = '<temporal_db_name>' AND application_name <> 'psql' AND pid <> pg_backend_pid();` 4. Monitor Temporal Server logs: Verify that "read-only transaction" errors cease and "no usable database connection" errors resolve. 5. Restart Temporal Server components (history, matching, frontend services) if they don't recover automatically after DB connections are restored to read-write. This is often the quickest way to ensure they pick up fresh, healthy connections. Example: `docker-compose -f <your_compose_file> restart temporal` RECOVERY ACTIONS (30-90 minutes): 1. Verify new workflows can be started and existing workflows can make progress. 2. Check for any workflows that might have gotten stuck or failed permanently during the read-only period and require manual retry or reconciliation. 3. Investigate the root cause of why the database became read-only. 4. Document the incident. PREVENTION STRATEGIES: - Implement robust database monitoring to alert on read-only states, low disk space, or high replication lag (if applicable). - Review and strengthen database operational procedures and automation to prevent accidental read-only configurations. - Ensure proper failover procedures for database replicas promote them to read-write primaries correctly. - Use configuration management to enforce desired database settings. - Regularly test database failover and recovery scenarios.