Skip to main content

CRE-2025-0039

OpenTelemetry Collector exporter experiences retryable errors due to backend unavailabilityMedium
Impact: 5/10
Mitigation: 3/10

CRE-2025-0039View on GitHub

Description

The OpenTelemetry Collector may intermittently fail to export telemetry data when the backend API is unavailable or overloaded. These failures manifest as timeouts (`context deadline exceeded`) or transient HTTP 502 responses. While retry logic is typically enabled, repeated failures can introduce delay or backpressure.


Cause

Exporter components (e.g., `otlphttp`, `sumologic`, `splunk_hec`, `datadog`) may encounter:

  • `context deadline exceeded`: the HTTP request took longer than the configured timeout.
  • `502 Bad Gateway`: the remote server returned a generic proxy or load balancer error.

These issues are commonly caused by backend service instability, DNS timeouts, or networking issues in the cluster.


Mitigation

  • Ensure `retry_on_failure` is enabled for all exporters.
  • Tune `timeout` and `sending_queue` settings to absorb temporary backend disruptions.
  • Monitor retry counts and dropped items via Collector metrics (e.g., `exporter/send_failed_spans`).
  • Consider buffering via Kafka or persistent queues for high-reliability workloads.

References