CRE-2025-0075
Nginx Upstream Failure Cascade CrisisCriticalImpact: 10/10Mitigation: 6/10
Description
Detects critical Nginx upstream failure cascades that lead to complete service unavailability. This advanced rule identifies comprehensive upstream failure patterns including DNS resolution failures, connection timeouts, SSL/TLS handshake errors, protocol violations, and server unavailability, followed by HTTP 5xx error responses within a 60-second window. The rule uses optimized regex patterns for maximum detection coverage while maintaining high performance and low false-positive rates. It captures both the root cause (upstream failures) and the user-facing impact (HTTP errors) to provide complete incident context.
Mitigation
IMMEDIATE TRIAGE (0-5 minutes): - Check nginx error logs: `tail -f /var/log/nginx/error.log | grep -E "upstream|backend"` - Verify backend service health: `curl -f http://backend-server:port/health` - Test network connectivity: `ping backend-servers && telnet backend-server port` - Check nginx configuration: `nginx -t && nginx -s reload` - Monitor system resources: `top`, `free -h`, `df -h` - Review recent deployments and configuration changes EMERGENCY RESPONSE (5-15 minutes): 1. Activate incident response team and communication channels 2. Implement emergency maintenance page to inform users 3. Scale up healthy backend instances if available 4. Restart failed backend services with health checks 5. Consider nginx failover to backup upstream pools 6. Enable nginx static content serving for critical pages 7. Implement traffic throttling to reduce backend load RECOVERY ACTIONS (15-60 minutes): 1. Identify and resolve root cause (database, network, resources) 2. Gradually restore backend services with canary deployments 3. Monitor error rates and response times during recovery 4. Validate end-to-end functionality before full restoration 5. Update monitoring thresholds based on incident learnings 6. Document incident timeline and resolution steps PREVENTION STRATEGIES: - Implement comprehensive health checks with circuit breakers - Configure nginx upstream backup servers and failover logic - Set up multi-region deployment with automatic failover - Implement graceful degradation and static content fallbacks - Monitor backend dependencies and cascade failure risks - Establish automated scaling policies for traffic spikes - Regular disaster recovery testing and runbook validation