CRE-2025-0075
Nginx Upstream Failure Cascade CrisisCriticalImpact: 10/10Mitigation: 6/10
Description
Detects critical Nginx upstream failure cascades that lead to complete service unavailability.
This advanced rule identifies comprehensive upstream failure patterns including DNS resolution
failures, connection timeouts, SSL/TLS handshake errors, protocol violations, and server
unavailability, followed by HTTP 5xx error responses within a 60-second window.
The rule uses optimized regex patterns for maximum detection coverage while maintaining
high performance and low false-positive rates. It captures both the root cause (upstream
failures) and the user-facing impact (HTTP errors) to provide complete incident context.
Cause
ROOT CAUSES:
- Complete backend server cluster failure (crash, hang, resource exhaustion)
- Critical database outages affecting all application instances
- Network partitions isolating nginx from backend infrastructure
- DNS resolution failures for backend service discovery
- SSL/TLS certificate expiration or misconfiguration
- Load balancer configuration errors or upstream pool exhaustion
- Infrastructure failures (cloud provider outages, data center issues)
- Cascading microservice dependencies causing widespread failures
- Memory/CPU exhaustion across entire backend server fleet
- Container orchestration failures (Kubernetes, Docker Swarm)
- Security policy changes blocking backend communication
Mitigation
IMMEDIATE TRIAGE (0-5 minutes):
- Check nginx error logs: `tail -f /var/log/nginx/error.log | grep -E "upstream|backend"`
- Verify backend service health: `curl -f http://backend-server:port/health`
- Test network connectivity: `ping backend-servers && telnet backend-server port`
- Check nginx configuration: `nginx -t && nginx -s reload`
- Monitor system resources: `top`, `free -h`, `df -h`
- Review recent deployments and configuration changes
EMERGENCY RESPONSE (5-15 minutes):
- Activate incident response team and communication channels
- Implement emergency maintenance page to inform users
- Scale up healthy backend instances if available
- Restart failed backend services with health checks
- Consider nginx failover to backup upstream pools
- Enable nginx static content serving for critical pages
- Implement traffic throttling to reduce backend load
RECOVERY ACTIONS (15-60 minutes):
- Identify and resolve root cause (database, network, resources)
- Gradually restore backend services with canary deployments
- Monitor error rates and response times during recovery
- Validate end-to-end functionality before full restoration
- Update monitoring thresholds based on incident learnings
- Document incident timeline and resolution steps
PREVENTION STRATEGIES:
- Implement comprehensive health checks with circuit breakers
- Configure nginx upstream backup servers and failover logic
- Set up multi-region deployment with automatic failover
- Implement graceful degradation and static content fallbacks
- Monitor backend dependencies and cascade failure risks
- Establish automated scaling policies for traffic spikes
- Regular disaster recovery testing and runbook validation