CRE-2025-0039
OpenTelemetry Collector exporter experiences retryable errors due to backend unavailabilityMediumImpact: 5/10Mitigation: 3/10
CRE-2025-0039View on GitHub
Description
The OpenTelemetry Collector may intermittently fail to export telemetry data when the backend API is unavailable or overloaded. These failures manifest as timeouts (`context deadline exceeded`) or transient HTTP 502 responses. While retry logic is typically enabled, repeated failures can introduce delay or backpressure.
Cause
Exporter components (e.g., `otlphttp`, `sumologic`, `splunk_hec`, `datadog`) may encounter:
- `context deadline exceeded`: the HTTP request took longer than the configured timeout.
- `502 Bad Gateway`: the remote server returned a generic proxy or load balancer error.
These issues are commonly caused by backend service instability, DNS timeouts, or networking issues in the cluster.
Mitigation
- Ensure `retry_on_failure` is enabled for all exporters.
- Tune `timeout` and `sending_queue` settings to absorb temporary backend disruptions.
- Monitor retry counts and dropped items via Collector metrics (e.g., `exporter/send_failed_spans`).
- Consider buffering via Kafka or persistent queues for high-reliability workloads.