CRE-2025-0075

Nginx Upstream Failure Cascade CrisisCritical
Impact: 10/10
Mitigation: 6/10

CRE-2025-0075View on GitHub

Nginx Reverse Proxy Service Outage High Availability Load Balancer Cascading Failure

Description

Detects critical Nginx upstream failure cascades that lead to complete service unavailability.\nThis advanced rule identifies comprehensive upstream failure patterns including DNS resolution\nfailures, connection timeouts, SSL/TLS handshake errors, protocol violations, and server\nunavailability, followed by HTTP 5xx error responses within a 60-second window.\n\nThe rule uses optimized regex patterns for maximum detection coverage while maintaining\nhigh performance and low false-positive rates. It captures both the root cause (upstream\nfailures) and the user-facing impact (HTTP errors) to provide complete incident context.\n

Mitigation

IMMEDIATE TRIAGE (0-5 minutes):\n- Check nginx error logs: `tail -f /var/log/nginx/error.log | grep -E \"upstream|backend\"`\n- Verify backend service health: `curl -f http://backend-server:port/health`\n- Test network connectivity: `ping backend-servers && telnet backend-server port`\n- Check nginx configuration: `nginx -t && nginx -s reload`\n- Monitor system resources: `top`, `free -h`, `df -h`\n- Review recent deployments and configuration changes\n\nEMERGENCY RESPONSE (5-15 minutes):\n1. Activate incident response team and communication channels\n2. Implement emergency maintenance page to inform users\n3. Scale up healthy backend instances if available\n4. Restart failed backend services with health checks\n5. Consider nginx failover to backup upstream pools\n6. Enable nginx static content serving for critical pages\n7. Implement traffic throttling to reduce backend load\n\nRECOVERY ACTIONS (15-60 minutes):\n1. Identify and resolve root cause (database, network, resources)\n2. Gradually restore backend services with canary deployments\n3. Monitor error rates and response times during recovery\n4. Validate end-to-end functionality before full restoration\n5. Update monitoring thresholds based on incident learnings\n6. Document incident timeline and resolution steps\n\nPREVENTION STRATEGIES:\n- Implement comprehensive health checks with circuit breakers\n- Configure nginx upstream backup servers and failover logic\n- Set up multi-region deployment with automatic failover\n- Implement graceful degradation and static content fallbacks\n- Monitor backend dependencies and cascade failure risks\n- Establish automated scaling policies for traffic spikes\n- Regular disaster recovery testing and runbook validation\n

Description

Mitigation

References