Category: Observability Problems

Problems related to observability, like monitoring, logging, and tracing

ID	Title	Description	Category	Technology	Tags
CRE-2024-0016 Low Impact: 4/10 Mitigation: 2/10	Google Kubernetes Engine metrics agent failing to export metrics	The Google Kubernetes Engine metrics agent is failing to export metrics.	Observability Problems	gke-metrics-agent	Known Problem GKE Public
CRE-2025-0028 Low Impact: 6/10 Mitigation: 1/10	OpenTelemetry Python fails to detach context token across async boundaries	In OpenTelemetry Python, detaching a context token that was created in a different context can raise a `ValueError`. This occurs when asynchronous operations, such as generators or coroutines, are finalized in a different context than they were created, leading to context management errors and potential trace data loss.	Observability Problems	opentelemetry-python	Opentelemetry Python Contextvars Async Observability Public
CRE-2025-0032 Low Impact: 2/10 Mitigation: 4/10	Loki generates excessive logs when memcached service port name is incorrect	Loki instances using memcached for caching may emit excessive warning or error logs when the configured`memcached_client` service port name does not match the actual Kubernetes service port. This does not cause a crash or failure, but it results in noisy logs and ineffective caching behavior.	Observability Problems	loki	Loki Memcached Configuration Service Cache Known Issue Kubernetes Public
CRE-2025-0033 Low Impact: 7/10 Mitigation: 4/10	OpenTelemetry Collector refuses to scrape due to memory pressure	The OpenTelemetry Collector may refuse to ingest metrics during a Prometheus scrape if it exceeds its configured memory limits. When the `memory_limiter` processor is enabled, the Collector actively drops data to prevent out-of-memory errors, resulting in log messages indicating that data was refused due to high memory usage.	Observability Problems	opentelemetry-collector	Otel Collector Prometheus Memory Metrics Backpressure Data Loss Known Issue Public
CRE-2025-0034 Medium Impact: 6/10 Mitigation: 2/10	Datadog agent disabled due to missing API key	If the Datadog agent or client libraries do not detect a configured API key, they will skip sending metrics, logs, and events. This results in a silent failure of observability reporting, often visible only through startup log messages.	Observability Problems	datadog	Datadog Configuration Api Key Observability Environment Telemetry Known Issue Public
CRE-2025-0036 Low Impact: 6/10 Mitigation: 3/10	OpenTelemetry Collector drops data due to 413 Payload Too Large from exporter target	The OpenTelemetry Collector may drop telemetry data when an exporter backend responds with a 413 Payload Too Large error. This typically happens when large batches of metrics, logs, or traces exceed the maximum payload size accepted by the backend. By default, the collector drops these payloads unless retry behavior is explicitly enabled.	Observability Problems	opentelemetry-collector	Otel Collector Exporter Payload Batch Drop Observability Telemetry Known Issue Public
CRE-2025-0037 Low Impact: 8/10 Mitigation: 4/10	OpenTelemetry Collector panics on nil attribute value in Prometheus Remote Write translator	The OpenTelemetry Collector can panic due to a nil pointer dereference in the Prometheus Remote Write exporter. The issue occurs when attribute values are assumed to be strings, but the internal representation is nil or incompatible, leading to a runtime `SIGSEGV` segmentation fault and crashing the collector.	Observability Problems	opentelemetry-collector	Crash Prometheus Otel Collector Exporter Panic Translation Attribute Nil Pointer Known Issue Public
CRE-2025-0038 Low Impact: 5/10 Mitigation: 3/10	Loki fails to cache entries due to Memcached out-of-memory error	Grafana Loki may emit errors when attempting to write to a Memcached backend that has run out of available memory. This results in dropped index or query cache entries, which can degrade query performance but does not interrupt ingestion.	Observability Problems	loki	Loki Memcached Cache Memory Infrastructure Known Issue Public
CRE-2025-0039 Medium Impact: 5/10 Mitigation: 3/10	OpenTelemetry Collector exporter experiences retryable errors due to backend unavailability	The OpenTelemetry Collector may intermittently fail to export telemetry data when the backend API is unavailable or overloaded. These failures manifest as timeouts (`context deadline exceeded`) or transient HTTP 502 responses. While retry logic is typically enabled, repeated failures can introduce delay or backpressure.	Observability Problems	opentelemetry-collector	Otel Collector Exporter Timeout Retry Network Telemetry Known Issue Public
CRE-2025-0043 Medium Impact: 4/10 Mitigation: 2/10	Grafana fails to load plugin due to missing signature	Grafana may reject custom or third-party plugins at runtime if they are not digitally signed. When plugin signature validation is enabled (default since Grafana 8+), unsigned plugins are blocked and logged as validation errors during startup or plugin loading.	Observability Problems	grafana	Grafana Plugin Validation Signature Configuration Security Known Issue Public
CRE-2025-0090 Low Impact: 0/10 Mitigation: 0/10	Loki Log Line Exceeds Max Size Limit	Alloy detects the Loki is dropping log lines because they exceed the configured maximum line size. This typically indicates that applications are emitting extremely long log entries, which Loki is configured to reject by default.	Observability Problems	alloy log	Alloy Loki Logs Observability Grafana