PREQUEL-2025-0072

OTel Collector Dropped Data to to High Memory UsageLow
Impact: 3/10
Mitigation: 2/10

PREQUEL-2025-0072View on GitHub

Description

The OpenTelemetry Collector’s **memory_limiter** processor (added by\ndefault in most distro Helm charts) protects the process RSS by\nmonitoring the Go heap and rejecting exports once the *soft limit*\n(default 85 % of container/VM memory) is exceeded. \nAfter a queue/exporter exhausts its retry budget you’ll see log\nrecords such as:\n\n```\nno more retries left: rpc error: code = Unavailable\ndesc = data refused due to high memory usage\n```\n\nThe batches being dropped can be traces, metrics, or logs, depending\non which pipeline hit the limit.\n

Mitigation

1. **Raise the memory limits** on the Deployment\n ```yaml\n resources:\n requests:\n memory: \"512Mi\"\n limits:\n memory: \"1Gi\"\n ```\n2. **Tune the memory_limiter** to match available head-room: \n ```yaml\n processors:\n memory_limiter:\n check_interval: 2s\n limit_mib: 800 # hard stop\n spike_limit_mib: 150\n ```\n3. **Add a ballast** (Kubernetes example): \n ```yaml\n env:\n - name: MEMORY_BALLAST_SIZE_MIB\n value: \"256\"\n ```\n and reference it in the Collector args\n (`--mem-ballast-size-mib=${MEMORY_BALLAST_SIZE_MIB}`).\n4. **Scale out or shard pipelines** – run dedicated\n *traces-collector* vs *metrics-collector* instances so one noisy\n pipeline can’t starve the rest.\n5. **Short-circuit large batches** – use the `batch` processor with a\n smaller `send_batch_max_size` (e.g., 2048 spans) to keep payloads\n predictable.\n