Skip to main content

CRE-2025-0118

Envoy proxy unable to connect to upstream servicesHigh
Impact: 7/10
Mitigation: 7/10

CRE-2025-0118View on GitHub

Description

This rule detects when Envoy proxy is experiencing consistent failures connecting to upstream services, resulting in HTTP 503 (Service Unavailable) or 504 (Gateway Timeout) errors. These errors are typically accompanied by \"UH\" (upstream service unhealthy) or \"UT\" (upstream request timeout) response flags in Envoy access logs, indicating backend service connectivity issues that require immediate attention.\n

Mitigation

**Immediate response:**\n- **Check Envoy Admin Stats**: Access the Envoy admin interface (e.g., `http://localhost:9901/stats`) to get detailed statistics about the health of your clusters.\n- **Inspect Envoy Logs**: Analyze the Envoy logs for patterns in the 503 and 504 errors. Look for the response flags `UH` and `UT`.\n- **Check Upstream Service Health**: Directly inspect the health of the upstream services that are failing. Check their logs and resource utilization (CPU, memory, etc.).\n- **Review recent changes**: Check for recent deployments or configuration changes that may have caused the issue\n\n**Resolution steps:**\n1. **Restart or scale upstream services**: If services are down or overloaded, restart them or increase capacity\n2. **Review Envoy configuration**: Examine `envoy.yaml` for proper cluster definitions, timeouts, circuit breaker settings, and health checks\n3. **Verify network connectivity**: Ensure DNS resolution and network paths between Envoy and upstream services are functioning\n4. **Reset circuit breakers**: If tripped, wait for automatic reset or manually clear them as appropriate\n\n**Prevention measures:**\n- **Implement comprehensive health checks**: Configure active health monitoring to detect service issues proactively\n- **Optimize circuit breaker settings**: Balance failure detection sensitivity with system stability\n- **Establish monitoring and alerting**: Set up alerts on key Envoy metrics (upstream_rq_5xx, upstream_cx_total) for early problem detection\n- **Implement auto-scaling**: Configure automatic scaling for upstream services to handle load variations\n- **Validate timeout configurations**: Ensure timeout values are appropriate for actual service response times\n

References