Skip to main content

Category: Observability Problems

Problems related to observability, like monitoring, logging, and tracing

IDTitleDescriptionCategoryTechnologyTags
CRE-2024-0016
Low
Impact: 4/10
Mitigation: 2/10
Google Kubernetes Engine metrics agent failing to export metricsThe Google Kubernetes Engine metrics agent is failing to export metrics.Observability Problemsgke-metrics-agentKnown ProblemGKEPublic
CRE-2025-0028
Low
Impact: 6/10
Mitigation: 1/10
OpenTelemetry Python fails to detach context token across async boundariesIn OpenTelemetry Python, detaching a context token that was created in a different context can raise a `ValueError`. This occurs when asynchronous operations, such as generators or coroutines, are finalized in a different context than they were created, leading to context management errors and potential trace data loss.Observability Problemsopentelemetry-pythonOpentelemetryPythonContextvarsAsyncObservabilityPublic
CRE-2025-0032
Low
Impact: 2/10
Mitigation: 4/10
Loki generates excessive logs when memcached service port name is incorrectLoki instances using memcached for caching may emit excessive warning or error logs when the configured`memcached_client` service port name does not match the actual Kubernetes service port. This does not cause a crash or failure, but it results in noisy logs and ineffective caching behavior.Observability ProblemslokiLokiMemcachedConfigurationServiceCacheKnown IssueKubernetesPublic
CRE-2025-0033
Low
Impact: 7/10
Mitigation: 4/10
OpenTelemetry Collector refuses to scrape due to memory pressureThe OpenTelemetry Collector may refuse to ingest metrics during a Prometheus scrape if it exceeds its configured memory limits. When the `memory_limiter` processor is enabled, the Collector actively drops data to prevent out-of-memory errors, resulting in log messages indicating that data was refused due to high memory usage.Observability Problemsopentelemetry-collectorOtel CollectorPrometheusMemoryMetricsBackpressureData LossKnown IssuePublic
CRE-2025-0034
Medium
Impact: 6/10
Mitigation: 2/10
Datadog agent disabled due to missing API keyIf the Datadog agent or client libraries do not detect a configured API key, they will skip sending metrics, logs, and events. This results in a silent failure of observability reporting, often visible only through startup log messages.Observability ProblemsdatadogDatadogConfigurationApi KeyObservabilityEnvironmentTelemetryKnown IssuePublic
CRE-2025-0036
Low
Impact: 6/10
Mitigation: 3/10
OpenTelemetry Collector drops data due to 413 Payload Too Large from exporter targetThe OpenTelemetry Collector may drop telemetry data when an exporter backend responds with a 413 Payload Too Large error. This typically happens when large batches of metrics, logs, or traces exceed the maximum payload size accepted by the backend. By default, the collector drops these payloads unless retry behavior is explicitly enabled.Observability Problemsopentelemetry-collectorOtel CollectorExporterPayloadBatchDropObservabilityTelemetryKnown IssuePublic
CRE-2025-0037
Low
Impact: 8/10
Mitigation: 4/10
OpenTelemetry Collector panics on nil attribute value in Prometheus Remote Write translatorThe OpenTelemetry Collector can panic due to a nil pointer dereference in the Prometheus Remote Write exporter. The issue occurs when attribute values are assumed to be strings, but the internal representation is nil or incompatible, leading to a runtime `SIGSEGV` segmentation fault and crashing the collector.Observability Problemsopentelemetry-collectorCrashPrometheusOtel CollectorExporterPanicTranslationAttributeNil PointerKnown IssuePublic
CRE-2025-0038
Low
Impact: 5/10
Mitigation: 3/10
Loki fails to cache entries due to Memcached out-of-memory errorGrafana Loki may emit errors when attempting to write to a Memcached backend that has run out of available memory. This results in dropped index or query cache entries, which can degrade query performance but does not interrupt ingestion.Observability ProblemslokiLokiMemcachedCacheMemoryInfrastructureKnown IssuePublic
CRE-2025-0039
Medium
Impact: 5/10
Mitigation: 3/10
OpenTelemetry Collector exporter experiences retryable errors due to backend unavailabilityThe OpenTelemetry Collector may intermittently fail to export telemetry data when the backend API is unavailable or overloaded. These failures manifest as timeouts (`context deadline exceeded`) or transient HTTP 502 responses. While retry logic is typically enabled, repeated failures can introduce delay or backpressure.Observability Problemsopentelemetry-collectorOtel CollectorExporterTimeoutRetryNetworkTelemetryKnown IssuePublic
CRE-2025-0043
Medium
Impact: 4/10
Mitigation: 2/10
Grafana fails to load plugin due to missing signatureGrafana may reject custom or third-party plugins at runtime if they are not digitally signed. When plugin signature validation is enabled (default since Grafana 8+), unsigned plugins are blocked and logged as validation errors during startup or plugin loading.Observability ProblemsgrafanaGrafanaPluginValidationSignatureConfigurationSecurityKnown IssuePublic
CRE-2025-0090
Low
Impact: 0/10
Mitigation: 0/10
Loki Log Line Exceeds Max Size LimitAlloy detects the Loki is dropping log lines because they exceed the configured maximum line size. This typically indicates that applications are emitting extremely long log entries, which Loki is configured to reject by default.Observability Problemsalloy logAlloyLokiLogsObservabilityGrafana