Tag: Public

Open source CREs contributed by the problem detection community

ID	Title	Description	Category	Technology	Tags
CRE-2024-0007 Critical Impact: 9/10 Mitigation: 8/10	RabbitMQ Mnesia overloaded	The underlying Erlang process, Mnesia, is overloaded (` WARNING Mnesia is overloaded`).	Message Queue Problems	rabbitmq	Known Problem RabbitMQ Public
CRE-2024-0008 High Impact: 9/10 Mitigation: 6/10	RabbitMQ memory alarm	A RabbitMQ node has entered the “memory alarm” state because the total memory used by the Erlang VM (plus allocated binaries, ETS tables, and processes) has exceeded the configured `vm_memory_high_watermark`. While the alarm is active the broker applies flow-control, blocking publishers and pausing most ingress activity to protect itself from running out of RAM.	Message Queue Problems	rabbitmq	Known Problem RabbitMQ Public
CRE-2024-0014 High Impact: 8/10 Mitigation: 5/10	RabbitMQ busy distribution port performance issue	The Erlang VM has reported a `busy_dist_port` condition, meaning the send buffer of a distribution port (used for inter-node traffic inside a RabbitMQ cluster) is full. When this happens the scheduler suspends the process owning the port, stalling inter-node replication, management calls, and any RabbitMQ process that must use that port. Throughput drops and latency rises until the buffer drains or the node is restarted.	Message Queue Performance	rabbitmq	Known Problem RabbitMQ Public
CRE-2024-0016 Low Impact: 4/10 Mitigation: 2/10	Google Kubernetes Engine metrics agent failing to export metrics	The Google Kubernetes Engine metrics agent is failing to export metrics.	Observability Problems	gke-metrics-agent	Known Problem GKE Public
CRE-2024-0018 Medium Impact: 4/10 Mitigation: 5/10	Neutron Open Virtual Network (OVN) high CPU usage	OVN daemons (e.g., ovn-controller) are stuck in a tight poll loop, driving CPU to 100 %. Logs show “Dropped … due to excessive rate” or “Unreasonably long … poll interval,” slowing port-binding and network traffic.	Networking Problems	neutron	Known Problem Ovn Public
CRE-2024-0019 Low Impact: 3/10 Mitigation: 2/10	Alloy entries too far behind	Grafana can get into a state where it writes more errors messages than it can process. The problem is compounded when Grafana is collecting its own error logs that include the related warnings that it can no longer keep up. This can consume several GB per day of storage.	Storage	alloy	Grafana Alloy Loki Public
CRE-2024-0020 Medium Impact: 5/10 Mitigation: 2/10	Grafana alloy Loki fanout crash	Grafana alloy Loki fanout crashes when the number of log files exceeds the number of ingesters.	Storage	alloy	Grafana Alloy Loki Public
CRE-2024-0021 High Impact: 4/10 Mitigation: 5/10	KEDA operator reconciler ScaledObject panic	KEDA allows for fine-grained autoscaling (including to/from zero) for event driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition.	Operator Problems	Unspecified	KEDA Crash Known Problem Public
CRE-2024-0043 Medium Impact: 6/10 Mitigation: 5/10	NGINX Upstream DNS Failure	When a NGINX upstream becomes unreachable or its DNS entry disappears, NGINX requests begin to fail.	Proxy Problems	nginx	Kafka Known Problem Public
CRE-2025-0025 Medium Impact: 6/10 Mitigation: 5/10	Kafka broker replication mismatch	When the configured replication factor for a Kafka topic is greater than the actual number of brokers in the cluster, Kafka repeatedly fails to assign partitions and logs replication-related errors. This results in persistent warnings or an `InvalidReplicationFactorException` when the broker tries to create internal or user-defined topics.	Message Queue Problems	topic-operator	Kafka Known Problem Public
CRE-2025-0026 Low Impact: 6/10 Mitigation: 1/10	AWS EBS CSI Driver fails to detach volume when VolumeAttachment has empty nodeName	In clusters using the AWS EBS CSI driver, the controller may fail to detach a volume if the associated VolumeAttachment resource has an empty `spec.nodeName`. This results in a log error and skipped detachment, which may block PVC reuse or node cleanup.	Storage	eks-nodeagent	ebs csi AWS Storage Public
CRE-2025-0027 Low Impact: 7/10 Mitigation: 2/10	Neutron Open Virtual Network (OVN) and Virtual Interface (VIF) allows port binding to dead agents, causing VIF plug timeouts	In OpenStack deployments using Neutron with the OVN ML2 driver, ports could be bound to agents that were not alive. This behavior led to virtual machines experiencing network interface plug timeouts during provisioning, as the port binding would not complete successfully.	Networking Problems	neutron	Neutron Ovn Timeout Networking Openstack Known Issue Public
CRE-2025-0028 Low Impact: 6/10 Mitigation: 1/10	OpenTelemetry Python fails to detach context token across async boundaries	In OpenTelemetry Python, detaching a context token that was created in a different context can raise a `ValueError`. This occurs when asynchronous operations, such as generators or coroutines, are finalized in a different context than they were created, leading to context management errors and potential trace data loss.	Observability Problems	opentelemetry-python	Opentelemetry Python Contextvars Async Observability Public
CRE-2025-0029 Low Impact: 6/10 Mitigation: 5/10	Loki fails to retrieve AWS credentials when specifying S3 endpoint with IRSA	- When deploying Grafana Loki with AWS S3 as the storage backend and specifying a custom S3 endpoint (e.g., for FIPS compliance or GovCloud regions), Loki may fail to retrieve AWS credentials via IAM Roles for Service Accounts (IRSA). This results in errors during startup or when attempting to upload index tables, preventing Loki from functioning correctly.	Storage	loki	Loki S3 AWS Irsa Storage Authentication Helm Public
CRE-2025-0030 Medium Impact: 6/10 Mitigation: 2/10	SQLAlchemy create_engine fails when password contains special characters like @	SQLAlchemy applications using `create_engine()` may fail to connect to a database if the username or password contains special characters (e.g., `@`, `:`, `/`, `#`). These characters must be URL-encoded when included in the database connection string. Failure to encode them leads to parsing errors or incorrect credential usage.	Orm	sqlalchemy	Sqlalchemy Configuration Password Uri Escaping Connection Known Issue Public
CRE-2025-0031 Medium Impact: 5/10 Mitigation: 5/10	Django returns DisallowedHost error for untrusted HTTP_HOST headers	Django applications may return a \"DisallowedHost\" error when receiving requests with an unrecognized or missing Host header. This typically occurs in production environments where reverse proxies, load balancers, or external clients send requests using an unexpected domain or IP address. Django blocks these requests unless the domain is explicitly listed in `ALLOWED_HOSTS`.	Framework Problems	django	Django Disallowedhost Configuration Web Security Host Header Public
CRE-2025-0032 Low Impact: 2/10 Mitigation: 4/10	Loki generates excessive logs when memcached service port name is incorrect	Loki instances using memcached for caching may emit excessive warning or error logs when the configured`memcached_client` service port name does not match the actual Kubernetes service port. This does not cause a crash or failure, but it results in noisy logs and ineffective caching behavior.	Observability Problems	loki	Loki Memcached Configuration Service Cache Known Issue Kubernetes Public
CRE-2025-0033 Low Impact: 7/10 Mitigation: 4/10	OpenTelemetry Collector refuses to scrape due to memory pressure	The OpenTelemetry Collector may refuse to ingest metrics during a Prometheus scrape if it exceeds its configured memory limits. When the `memory_limiter` processor is enabled, the Collector actively drops data to prevent out-of-memory errors, resulting in log messages indicating that data was refused due to high memory usage.	Observability Problems	opentelemetry-collector	Otel Collector Prometheus Memory Metrics Backpressure Data Loss Known Issue Public
CRE-2025-0034 Medium Impact: 6/10 Mitigation: 2/10	Datadog agent disabled due to missing API key	If the Datadog agent or client libraries do not detect a configured API key, they will skip sending metrics, logs, and events. This results in a silent failure of observability reporting, often visible only through startup log messages.	Observability Problems	datadog	Datadog Configuration Api Key Observability Environment Telemetry Known Issue Public
CRE-2025-0035 Critical Impact: 7/10 Mitigation: 6/10	psycopg2 SSL error due to thread or forked process state	Applications using psycopg2 with OpenTelemetry instrumentation or threading may fail with SSL-related errors such as \"decryption failed or bad record mac\". This often occurs when a database connection is created before a fork or from an unsafe thread context, causing the SSL state to become invalid.	Database Problems	django	Ssl Psycopg2 Fork Threads Django Instrumentation Opentelemetry Known Issue Public
CRE-2025-0036 Low Impact: 6/10 Mitigation: 3/10	OpenTelemetry Collector drops data due to 413 Payload Too Large from exporter target	The OpenTelemetry Collector may drop telemetry data when an exporter backend responds with a 413 Payload Too Large error. This typically happens when large batches of metrics, logs, or traces exceed the maximum payload size accepted by the backend. By default, the collector drops these payloads unless retry behavior is explicitly enabled.	Observability Problems	opentelemetry-collector	Otel Collector Exporter Payload Batch Drop Observability Telemetry Known Issue Public
CRE-2025-0037 Low Impact: 8/10 Mitigation: 4/10	OpenTelemetry Collector panics on nil attribute value in Prometheus Remote Write translator	The OpenTelemetry Collector can panic due to a nil pointer dereference in the Prometheus Remote Write exporter. The issue occurs when attribute values are assumed to be strings, but the internal representation is nil or incompatible, leading to a runtime `SIGSEGV` segmentation fault and crashing the collector.	Observability Problems	opentelemetry-collector	Crash Prometheus Otel Collector Exporter Panic Translation Attribute Nil Pointer Known Issue Public
CRE-2025-0038 Low Impact: 5/10 Mitigation: 3/10	Loki fails to cache entries due to Memcached out-of-memory error	Grafana Loki may emit errors when attempting to write to a Memcached backend that has run out of available memory. This results in dropped index or query cache entries, which can degrade query performance but does not interrupt ingestion.	Observability Problems	loki	Loki Memcached Cache Memory Infrastructure Known Issue Public
CRE-2025-0039 Medium Impact: 5/10 Mitigation: 3/10	OpenTelemetry Collector exporter experiences retryable errors due to backend unavailability	The OpenTelemetry Collector may intermittently fail to export telemetry data when the backend API is unavailable or overloaded. These failures manifest as timeouts (`context deadline exceeded`) or transient HTTP 502 responses. While retry logic is typically enabled, repeated failures can introduce delay or backpressure.	Observability Problems	opentelemetry-collector	Otel Collector Exporter Timeout Retry Network Telemetry Known Issue Public
CRE-2025-0040 Low Impact: 6/10 Mitigation: 4/10	Neutron Open Virtual Network (OVN) fails to bind logical switch due to race condition during load balancer creation	During load balancer creation or other operations involving logical router and logical switch associations, Neutron OVN may raise a `RowNotFound` exception when attempting to reference a logical switch that has just been deleted. This leads to a port binding failure and a rollback of the affected operation.	Networking Problems	neutron	Neutron Ovn Openstack Load Balancer Logical Switch Ovsdb Known Issue Public
CRE-2025-0041 Low Impact: 5/10 Mitigation: 4/10	redis-py client fails with AttributeError when reused across async or process contexts	- In redis-py v5.x, sharing a single Redis client across async tasks or subprocesses can result in: - `AttributeError: ''NoneType'' object has no attribute ''getpid''`. - This typically occurs when the client or connection pool is reused across forks or when event loop context is lost, especially in async frameworks or multiprocessing setups.	Cache Problems	redis-py	Redis Redis Py Python Async Multiprocessing Context Attributeerror Known Issue Public
CRE-2025-0042 Critical Impact: 7/10 Mitigation: 5/10	PostgreSQL transaction fails with deadlock detected error in psycopg2 and Django	- Applications using Django with PostgreSQL and psycopg2 may encounter `deadlock detected` errors under concurrent write-heavy workloads. - PostgreSQL raises this error when two or more transactions block each other cyclically while waiting for locks, and one must be aborted. - Django surfaces this as an `OperationalError`, and the affected transaction is rolled back.	Database Problems	django	PostgreSQL Psycopg2 Django Transaction Deadlock Operational error Public Known Issue
CRE-2025-0043 Medium Impact: 4/10 Mitigation: 2/10	Grafana fails to load plugin due to missing signature	Grafana may reject custom or third-party plugins at runtime if they are not digitally signed. When plugin signature validation is enabled (default since Grafana 8+), unsigned plugins are blocked and logged as validation errors during startup or plugin loading.	Observability Problems	grafana	Grafana Plugin Validation Signature Configuration Security Known Issue Public
CRE-2025-0044 High Impact: 9/10 Mitigation: 1/10	NGINX Config Uses Insecure TLS Ciphers	Detects NGINX configuration files that advertise obsolete and cryptographically weak ciphers (RC4-MD5, RC4-SHA, DES-CBC3-SHA). These ciphers are vulnerable to several well-known attacks—including BEAST, BAR-Mitzvah, Lucky-13, and statistical biases in RC4—placing any client–server communication at risk of interception or tampering.	Insecure Configuration	nginx	Nginx Weak Ciphers Security Configuration TLS Known Issue Public
CRE-2025-0045 Medium Impact: 4/10 Mitigation: 4/10	NATS Authorization Failure Detected	The NATS server has emitted an Authorization Violation log entry, meaning a client attempted to connect, publish, subscribe, or perform another operation for which it lacks permission. Intermittent violations often point to misconfiguration or start-up chaos. However, sustained or widespread violations can signal credential expiry or missing secrets.	Authorization Problems	nats	NATS Security Authorization Public
CRE-2025-0046 Medium Impact: 4/10 Mitigation: 4/10	NATS Permissions Violation Detected	The NATS server has emitted an Permission Violation log entry, meaning a client attempted to publish or subscribe to a subject for which it lacks permission.	Authorization Problems	nats	NATS Security Authorization Public
CRE-2025-0048 Low Impact: 5/10 Mitigation: 3/10	Kubelet node not ready due to a DNS hostname resolution failure	A Kubernetes worker node has entered the NotReady state.	Kubernetes Problems	kubelet	Kubelet Kubernetes DNS Public
CRE-2025-0049 Low Impact: 2/10 Mitigation: 8/10	NATS Payload Size Too Big	The NATS server is configured to publish messages with payloads that may exceed the recommended maximum of 8 MB (the server’s default hard limit is 1 MB but it can be raised to 64 MB). Large messages put disproportionate pressure on broker memory, network buffers, and client back-pressure mechanisms. This warning signals NATS is at risk of degraded throughput, slow consumers, and forced connection closures intended to protect cluster stability.	Message Queue Problems	nats	NATS Public
CRE-2025-0056 Medium Impact: 8/10 Mitigation: 3/10	NGINX worker connections limit exceeded	NGINX has reported that the configured worker_connections limit has been reached. This indicates that the web server has exhausted the available connection slots for handling concurrent client requests. When this limit is reached, new connection attempts may be rejected until existing connections are closed, causing service degradation or outages.	Web Server Problems	nginx	Nginx Capacity Issue Web Server Configuration Public
CRE-2025-0073 High Impact: 9/10 Mitigation: 6/10	Redis Rejects Writes Due to Reaching 'maxmemory' Limit	The Redis instance has reached its configured 'maxmemory' limit. Because its active memory management policy does not permit the eviction of existing keys to free up space (as is the case when the 'noeviction' policy is in effect, which is often the default), Redis rejects new write commands by sending an \"OOM command not allowed\" error to the client.	Database Problems	redis-cli	Redis Redis CLI Memory Pressure Memory Data Loss Public
CRE-2025-0077 High Impact: 9/10 Mitigation: 7/10	PostgreSQL Fails to Extend File Due to Disk Full	PostgreSQL logs an error when it cannot extend a data file (table/index) because the filesystem is out of disk space. This prevents writes requiring new allocation.	Database Problems	postgresql	PostgreSQL Disk Full Write Failure Public
CRE-2025-0112 Critical Impact: 10/10 Mitigation: 4/10	AWS VPC CNI Node IP Pool Depletion Crisis	Critical AWS VPC CNI node IP pool depletion detected causing cascading pod scheduling failures. This pattern indicates severe subnet IP address exhaustion combined with ENI allocation failures, leading to complete cluster networking breakdown. The failure sequence shows ipamd errors, kubelet scheduling failures, and controller-level pod creation blocks that render clusters unable to deploy new workloads, scale existing services, or recover from node failures. This represents one of the most severe Kubernetes infrastructure failures, often requiring immediate manual intervention including subnet expansion, secondary CIDR provisioning, or emergency workload termination to restore cluster functionality.	VPC CNI Problems	aws-vpc-cni	AWS EKS Kubernetes Networking VPC CNI AWS CNI IP Exhaustion ENI Allocation Subnet Exhaustion Pod Scheduling Failure Cluster Paralysis AWS API Limits Known Problem Critical Infrastructure Service Outage Cascading Failure Capacity Exceeded Scalability Issue Revenue Impact Compliance Violation Threshold Exceeded Infrastructure Public