Public CREs

Welcome to the Public CRE feed, where you can explore and discover open-source CRES by tags, categories, or other details. Use the tabs below to navigate between different views.

Categories
Tags
Technologies
CREs

Observability Problems

11 CREs

Problems related to observability, like monitoring, logging, and tracing

Message Queue Problems

10 CREs

Problems related to message queues, like Kafka, RabbitMQ, NATS and others

Problems related to troubleshooting Istio's Ambient service mesh mode, including CNI sandbox creation failures, ztunnel connectivity issues, traffic capture errors, and waypoint configuration problems.

Database Problems

4 CREs

Problems related to databases, like MySQL, PostgreSQL, MongoDB, and others

Networking Problems

4 CREs

Connectivity, DNS, or routing issues affecting system communication.

Storage

4 CREs

Disk, object storage, or volume-related issues that impact data availability.

Kubernetes Problems

4 CREs

Problems related to Kubernetes

Load Balancer Problems

4 CREs

Problems related to load balancers

Authorization Problems

3 CREs

Problems related to authorization

Authorization Systems

3 CREs

Failures in systems that manage access control, identity, or permissions. This includes tools like SpiceDB, OPA, or Auth0 where schema, policy, or integration issues can block authentication or authorization flows.

Web Server Problems

2 CREs

Problems related to web servers

Configuration Problem

2 CREs

Problems related to system or application configurations

Redpanda Problems

2 CREs

Problems related to Redpanda cluster failures, including node loss, quorum loss, and data availability impact.

Data Streaming Platforms

2 CREs

Failures in distributed streaming data platforms used for real-time event processing. This includes platforms like Redpanda, Apache Kafka, Pulsar, and compatible systems where startup, configuration, or operational issues can disrupt data streaming pipelines and impact downstream applications relying on event-driven architectures.

Proxy Problems

1 CREs

Problems related to proxies, like NGINX, HAProxy, and others

Operator Problems

1 CREs

Problems related to operators

Orm

1 CREs

Object Relational Mapper issue that impacts data availability.

Cache Problems

1 CREs

Cache related problems

Framework Problems

1 CREs

Problems in frameworks such as Django

Distributed Messaging Connectivity Issues

1 CREs

Failures in distributed messaging systems where message brokers, consumers, or producers lose connectivity or coordination, leading to processing halts or data loss

Workflow Orchestration Connectivity

1 CREs

Connection failures between workflow orchestration components like Temporal workers and servers

Insecure Configuration

1 CREs

Problems related to insecure configuration

Message Queue Performance

1 CREs

Problems related to message queue performance

Web Server Problem

1 CREs

Problems related to web servers

Proxy Timeout Problems

1 CREs

Problems related to proxy timeouts

Monitoring Problem

1 CREs

Problems related to system or application monitoring

Incompatibility Problem

1 CREs

Problems due to incompatible components or versions.

Logging Problems

1 CREs

Issues related to logging mechanisms and processes.

Provisioning Problems

1 CREs

Issues related to the provisioning of resources and infrastructure.

Stability Problems

1 CREs

Issues that affect the stability and uptime of systems and services.

Task Management Problems

1 CREs

Issues related to the management and execution of tasks and workflows.

Ubuntu Desktop Problems

1 CREs

Problems related to Ubuntu Desktop

HPC Database Problems

1 CREs

Database issues specific to high-performance computing systems like SLURM

In-Memory Database Problems

1 CREs

Problems specific to in-memory data stores (e.g. Redis, Memcached)

PostgreSQL High Availability

1 CREs

High-severity problems related to PostgreSQL in high-availability (HA) clusters, including replication, failover, WAL streaming, and HA controller outages.

Kubernetes Storage Problems

1 CREs

Problems related to container storage in Kubernetes

Demo Problems

1 CREs

This is a category for demos

Redpanda High Availability

1 CREs

High-severity issues related to quorum, leader election, Raft consensus, and node isolation in Redpanda clusters

MongoDB Resource Exhaustion

1 CREs

MongoDB node becomes unresponsive due to memory or CPU exhaustion under high load

Temporal Server Failure

1 CREs

Temporal Server Failure Temporal Server Fails Persistence on Read-Only Database

VPC CNI Problems

1 CREs

Critical failures in AWS VPC Container Network Interface affecting pod networking and cluster operations

Envoy Upstream Failures

1 CREs

Problems where Envoy fails to communicate with an upstream service. This includes connection timeouts, connection refused errors, upstream overflow and cases where no healthy upstream hosts are available. It also covers request timeouts and unexpected stream resets from the upstream service.

Database Availability

1 CREs

Critical failures affecting database service availability, including crashes, OOM kills, and unplanned terminations that require immediate recovery

MongoDB Startup Failure

1 CREs

Failures that prevent MongoDB from starting successfully due to corrupted metadata, invalid configurations, or unrecoverable internal errors (e.g., WiredTiger metadata corruption). These failures often require manual repair or backup restoration.

ID	Title	Description	Category	Tags
CRE-2024-0007 Critical Impact: 9/10 Mitigation: 8/10	RabbitMQ Mnesia overloaded rabbitmq	The underlying Erlang process, Mnesia, is overloaded (` WARNING Mnesia is overloaded`).	Message Queue Problems	Known Problem RabbitMQ Public
CRE-2024-0008 High Impact: 9/10 Mitigation: 6/10	RabbitMQ memory alarm rabbitmq	A RabbitMQ node has entered the “memory alarm” state because the total memory used by the Erlang VM (plus allocated binaries, ETS tables, and processes) has exceeded the configured `vm_memory_high_watermark`. While the alarm is active the broker applies flow-control, blocking publishers and pausing most ingress activity to protect itself from running out of RAM.	Message Queue Problems	Known Problem RabbitMQ Public
CRE-2024-0014 High Impact: 8/10 Mitigation: 5/10	RabbitMQ busy distribution port performance issue rabbitmq	The Erlang VM has reported a `busy_dist_port` condition, meaning the send buffer of a distribution port (used for inter-node traffic inside a RabbitMQ cluster) is full. When this happens the scheduler suspends the process owning the port, stalling inter-node replication, management calls, and any RabbitMQ process that must use that port. Throughput drops and latency rises until the buffer drains or the node is restarted.	Message Queue Performance	Known Problem RabbitMQ Public
CRE-2024-0016 Low Impact: 4/10 Mitigation: 2/10	Google Kubernetes Engine metrics agent failing to export metrics gke-metrics-agent	The Google Kubernetes Engine metrics agent is failing to export metrics.	Observability Problems	Known Problem GKE Public
CRE-2024-0018 Medium Impact: 4/10 Mitigation: 5/10	Neutron Open Virtual Network (OVN) high CPU usage neutron	OVN daemons (e.g., ovn-controller) are stuck in a tight poll loop, driving CPU to 100 %. Logs show “Dropped … due to excessive rate” or “Unreasonably long … poll interval,” slowing port-binding and network traffic.	Networking Problems	Known Problem Ovn Public
CRE-2024-0019 Low Impact: 3/10 Mitigation: 2/10	Alloy entries too far behind alloy	Grafana can get into a state where it writes more errors messages than it can process. The problem is compounded when Grafana is collecting its own error logs that include the related warnings that it can no longer keep up. This can consume several GB per day of storage.	Storage	Grafana Alloy Loki Public
CRE-2024-0020 Medium Impact: 5/10 Mitigation: 2/10	Grafana alloy Loki fanout crash alloy	Grafana alloy Loki fanout crashes when the number of log files exceeds the number of ingesters.	Storage	Grafana Alloy Loki Public
CRE-2024-0021 High Impact: 4/10 Mitigation: 5/10	KEDA operator reconciler ScaledObject panic Unspecified	KEDA allows for fine-grained autoscaling (including to/from zero) for event driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition.	Operator Problems	KEDA Crash Known Problem Public
CRE-2024-0043 Medium Impact: 6/10 Mitigation: 5/10	NGINX Upstream DNS Failure nginx	When a NGINX upstream becomes unreachable or its DNS entry disappears, NGINX requests begin to fail.	Proxy Problems	Kafka Known Problem Public
CRE-2025-0020 High Impact: 10/10 Mitigation: 6/10	Self-hosted PostgreSQL HA: WAL Streaming & HA Controller Crisis (Replication Slot Loss, Disk Full, Etcd Quorum Failure) postgresql	Detects high-severity failures in self-hosted PostgreSQL high-availability clusters managed by Patroni, Zalando, or similar HA controllers. This rule targets catastrophic conditions that break replication or cluster consensus: - WAL streaming failures due to missing replication slots (usually after disk full or crash events) - Persistent errors resolving HA controller endpoints (etcd/consul) and loss of HA controller quorum - Disk saturation leading to WAL write errors and replication breakage	PostgreSQL High Availability	High Availability Patroni Zalando Etcd Replication WAL Storage Quorum Crash Data Loss Timeout
CRE-2025-0025 Medium Impact: 6/10 Mitigation: 5/10	Kafka broker replication mismatch topic-operator	When the configured replication factor for a Kafka topic is greater than the actual number of brokers in the cluster, Kafka repeatedly fails to assign partitions and logs replication-related errors. This results in persistent warnings or an `InvalidReplicationFactorException` when the broker tries to create internal or user-defined topics.	Message Queue Problems	Kafka Known Problem Public
CRE-2025-0026 Low Impact: 6/10 Mitigation: 1/10	AWS EBS CSI Driver fails to detach volume when VolumeAttachment has empty nodeName eks-nodeagent	In clusters using the AWS EBS CSI driver, the controller may fail to detach a volume if the associated VolumeAttachment resource has an empty `spec.nodeName`. This results in a log error and skipped detachment, which may block PVC reuse or node cleanup.	Storage	ebs csi AWS Storage Public
CRE-2025-0027 Low Impact: 7/10 Mitigation: 2/10	Neutron Open Virtual Network (OVN) and Virtual Interface (VIF) allows port binding to dead agents, causing VIF plug timeouts neutron	In OpenStack deployments using Neutron with the OVN ML2 driver, ports could be bound to agents that were not alive. This behavior led to virtual machines experiencing network interface plug timeouts during provisioning, as the port binding would not complete successfully.	Networking Problems	Neutron Ovn Timeout Networking Openstack Known Issue Public
CRE-2025-0028 Low Impact: 6/10 Mitigation: 1/10	OpenTelemetry Python fails to detach context token across async boundaries opentelemetry-python	In OpenTelemetry Python, detaching a context token that was created in a different context can raise a `ValueError`. This occurs when asynchronous operations, such as generators or coroutines, are finalized in a different context than they were created, leading to context management errors and potential trace data loss.	Observability Problems	Opentelemetry Python Contextvars Async Observability Public
CRE-2025-0029 Low Impact: 6/10 Mitigation: 5/10	Loki fails to retrieve AWS credentials when specifying S3 endpoint with IRSA loki	- When deploying Grafana Loki with AWS S3 as the storage backend and specifying a custom S3 endpoint (e.g., for FIPS compliance or GovCloud regions), Loki may fail to retrieve AWS credentials via IAM Roles for Service Accounts (IRSA). This results in errors during startup or when attempting to upload index tables, preventing Loki from functioning correctly.	Storage	Loki S3 AWS Irsa Storage Authentication Helm Public
CRE-2025-0030 Medium Impact: 6/10 Mitigation: 2/10	SQLAlchemy create_engine fails when password contains special characters like @ sqlalchemy	SQLAlchemy applications using `create_engine()` may fail to connect to a database if the username or password contains special characters (e.g., `@`, `:`, `/`, `#`). These characters must be URL-encoded when included in the database connection string. Failure to encode them leads to parsing errors or incorrect credential usage.	Orm	Sqlalchemy Configuration Password Uri Escaping Connection Known Issue Public
CRE-2025-0031 Medium Impact: 5/10 Mitigation: 5/10	Django returns DisallowedHost error for untrusted HTTP_HOST headers django	Django applications may return a \"DisallowedHost\" error when receiving requests with an unrecognized or missing Host header. This typically occurs in production environments where reverse proxies, load balancers, or external clients send requests using an unexpected domain or IP address. Django blocks these requests unless the domain is explicitly listed in `ALLOWED_HOSTS`.	Framework Problems	Django Disallowedhost Configuration Web Security Host Header Public
CRE-2025-0032 Low Impact: 2/10 Mitigation: 4/10	Loki generates excessive logs when memcached service port name is incorrect loki	Loki instances using memcached for caching may emit excessive warning or error logs when the configured`memcached_client` service port name does not match the actual Kubernetes service port. This does not cause a crash or failure, but it results in noisy logs and ineffective caching behavior.	Observability Problems	Loki Memcached Configuration Service Cache Known Issue Kubernetes Public
CRE-2025-0033 Low Impact: 7/10 Mitigation: 4/10	OpenTelemetry Collector refuses to scrape due to memory pressure opentelemetry-collector	The OpenTelemetry Collector may refuse to ingest metrics during a Prometheus scrape if it exceeds its configured memory limits. When the `memory_limiter` processor is enabled, the Collector actively drops data to prevent out-of-memory errors, resulting in log messages indicating that data was refused due to high memory usage.	Observability Problems	Otel Collector Prometheus Memory Metrics Backpressure Data Loss Known Issue Public
CRE-2025-0034 Medium Impact: 6/10 Mitigation: 2/10	Datadog agent disabled due to missing API key datadog	If the Datadog agent or client libraries do not detect a configured API key, they will skip sending metrics, logs, and events. This results in a silent failure of observability reporting, often visible only through startup log messages.	Observability Problems	Datadog Configuration Api Key Observability Environment Telemetry Known Issue Public
CRE-2025-0035 Critical Impact: 7/10 Mitigation: 6/10	psycopg2 SSL error due to thread or forked process state django	Applications using psycopg2 with OpenTelemetry instrumentation or threading may fail with SSL-related errors such as \"decryption failed or bad record mac\". This often occurs when a database connection is created before a fork or from an unsafe thread context, causing the SSL state to become invalid.	Database Problems	Ssl Psycopg2 Fork Threads Django Instrumentation Opentelemetry Known Issue Public
CRE-2025-0036 Low Impact: 6/10 Mitigation: 3/10	OpenTelemetry Collector drops data due to 413 Payload Too Large from exporter target opentelemetry-collector	The OpenTelemetry Collector may drop telemetry data when an exporter backend responds with a 413 Payload Too Large error. This typically happens when large batches of metrics, logs, or traces exceed the maximum payload size accepted by the backend. By default, the collector drops these payloads unless retry behavior is explicitly enabled.	Observability Problems	Otel Collector Exporter Payload Batch Drop Observability Telemetry Known Issue Public
CRE-2025-0037 Low Impact: 8/10 Mitigation: 4/10	OpenTelemetry Collector panics on nil attribute value in Prometheus Remote Write translator opentelemetry-collector	The OpenTelemetry Collector can panic due to a nil pointer dereference in the Prometheus Remote Write exporter. The issue occurs when attribute values are assumed to be strings, but the internal representation is nil or incompatible, leading to a runtime `SIGSEGV` segmentation fault and crashing the collector.	Observability Problems	Crash Prometheus Otel Collector Exporter Panic Translation Attribute Nil Pointer Known Issue Public
CRE-2025-0038 Low Impact: 5/10 Mitigation: 3/10	Loki fails to cache entries due to Memcached out-of-memory error loki	Grafana Loki may emit errors when attempting to write to a Memcached backend that has run out of available memory. This results in dropped index or query cache entries, which can degrade query performance but does not interrupt ingestion.	Observability Problems	Loki Memcached Cache Memory Infrastructure Known Issue Public
CRE-2025-0039 Medium Impact: 5/10 Mitigation: 3/10	OpenTelemetry Collector exporter experiences retryable errors due to backend unavailability opentelemetry-collector	The OpenTelemetry Collector may intermittently fail to export telemetry data when the backend API is unavailable or overloaded. These failures manifest as timeouts (`context deadline exceeded`) or transient HTTP 502 responses. While retry logic is typically enabled, repeated failures can introduce delay or backpressure.	Observability Problems	Otel Collector Exporter Timeout Retry Network Telemetry Known Issue Public
CRE-2025-0040 Low Impact: 6/10 Mitigation: 4/10	Neutron Open Virtual Network (OVN) fails to bind logical switch due to race condition during load balancer creation neutron	During load balancer creation or other operations involving logical router and logical switch associations, Neutron OVN may raise a `RowNotFound` exception when attempting to reference a logical switch that has just been deleted. This leads to a port binding failure and a rollback of the affected operation.	Networking Problems	Neutron Ovn Openstack Load Balancer Logical Switch Ovsdb Known Issue Public
CRE-2025-0041 Low Impact: 5/10 Mitigation: 4/10	redis-py client fails with AttributeError when reused across async or process contexts redis-py	- In redis-py v5.x, sharing a single Redis client across async tasks or subprocesses can result in: - `AttributeError: ''NoneType'' object has no attribute ''getpid''`. - This typically occurs when the client or connection pool is reused across forks or when event loop context is lost, especially in async frameworks or multiprocessing setups.	Cache Problems	Redis Redis Py Python Async Multiprocessing Context Attributeerror Known Issue Public
CRE-2025-0042 Critical Impact: 7/10 Mitigation: 5/10	PostgreSQL transaction fails with deadlock detected error in psycopg2 and Django django	- Applications using Django with PostgreSQL and psycopg2 may encounter `deadlock detected` errors under concurrent write-heavy workloads. - PostgreSQL raises this error when two or more transactions block each other cyclically while waiting for locks, and one must be aborted. - Django surfaces this as an `OperationalError`, and the affected transaction is rolled back.	Database Problems	PostgreSQL Psycopg2 Django Transaction Deadlock Operational error Public Known Issue
CRE-2025-0043 Medium Impact: 4/10 Mitigation: 2/10	Grafana fails to load plugin due to missing signature grafana	Grafana may reject custom or third-party plugins at runtime if they are not digitally signed. When plugin signature validation is enabled (default since Grafana 8+), unsigned plugins are blocked and logged as validation errors during startup or plugin loading.	Observability Problems	Grafana Plugin Validation Signature Configuration Security Known Issue Public
CRE-2025-0044 High Impact: 9/10 Mitigation: 1/10	NGINX Config Uses Insecure TLS Ciphers nginx	Detects NGINX configuration files that advertise obsolete and cryptographically weak ciphers (RC4-MD5, RC4-SHA, DES-CBC3-SHA). These ciphers are vulnerable to several well-known attacks—including BEAST, BAR-Mitzvah, Lucky-13, and statistical biases in RC4—placing any client–server communication at risk of interception or tampering.	Insecure Configuration	Nginx Weak Ciphers Security Configuration TLS Known Issue Public
CRE-2025-0045 Medium Impact: 4/10 Mitigation: 4/10	NATS Authorization Failure Detected nats	The NATS server has emitted an Authorization Violation log entry, meaning a client attempted to connect, publish, subscribe, or perform another operation for which it lacks permission. Intermittent violations often point to misconfiguration or start-up chaos. However, sustained or widespread violations can signal credential expiry or missing secrets.	Authorization Problems	NATS Security Authorization Public
CRE-2025-0046 Medium Impact: 4/10 Mitigation: 4/10	NATS Permissions Violation Detected nats	The NATS server has emitted an Permission Violation log entry, meaning a client attempted to publish or subscribe to a subject for which it lacks permission.	Authorization Problems	NATS Security Authorization Public
CRE-2025-0048 Low Impact: 5/10 Mitigation: 3/10	Kubelet node not ready due to a DNS hostname resolution failure kubelet	A Kubernetes worker node has entered the NotReady state.	Kubernetes Problems	Kubelet Kubernetes DNS Public
CRE-2025-0049 Low Impact: 2/10 Mitigation: 8/10	NATS Payload Size Too Big nats	The NATS server is configured to publish messages with payloads that may exceed the recommended maximum of 8 MB (the server’s default hard limit is 1 MB but it can be raised to 64 MB). Large messages put disproportionate pressure on broker memory, network buffers, and client back-pressure mechanisms. This warning signals NATS is at risk of degraded throughput, slow consumers, and forced connection closures intended to protect cluster stability.	Message Queue Problems	NATS Public
CRE-2025-0051 High Impact: 9/10 Mitigation: 5/10	NGINX No Live Upstreams Available nginx	NGINX is reporting that all backend servers in an upstream group are unavailable. This means that NGINX cannot route requests to any of its configured backend servers, resulting in client-facing errors.	Load Balancer Problems	Nginx Load Balancer Upstream Failure Connectivity
CRE-2025-0053 Medium Impact: 5/10 Mitigation: 3/10	NGINX Client Upload Size Limit Exceeded nginx	NGINX server is receiving upload requests with bodies that exceed the configured size limits. This occurs when clients attempt to send files or data that are larger than what the server is configured to accept.	Web Server Problem	Nginx Upload Limits Configuration
CRE-2025-0054 Low Impact: 7/10 Mitigation: 5/10	NGINX upstream connection timeout nginx	NGINX reports an upstream timeout error when it cannot establish or maintain a connection to backend services within the configured timeout threshold. This occurs when backend services are unresponsive, overloaded, or when the timeout values are set too low for normal operation conditions. The error indicates that NGINX attempted to proxy a request to an upstream server, but the connection or read operation timed out before completion.	Proxy Timeout Problems	Nginx Timeout Proxy Backend Issue Networking
CRE-2025-0055 Medium Impact: 8/10 Mitigation: 3/10	Nginx upstream buffer size too small nginx	Nginx reports that an upstream server is sending headers that exceed the configured buffer size limits. This typically happens when the upstream application sends responses with large headers, cookies, or other header fields that don't fit in the default buffer allocation. When this occurs, Nginx cannot properly proxy the response to clients, resulting in HTTP errors.	Web Server Problems	Nginx Configuration Proxy Header Size Buffer
CRE-2025-0056 Medium Impact: 8/10 Mitigation: 3/10	NGINX worker connections limit exceeded nginx	NGINX has reported that the configured worker_connections limit has been reached. This indicates that the web server has exhausted the available connection slots for handling concurrent client requests. When this limit is reached, new connection attempts may be rejected until existing connections are closed, causing service degradation or outages.	Web Server Problems	Nginx Capacity Issue Web Server Configuration Public
CRE-2025-0057 Low Impact: 3/10 Mitigation: 1/10	Verbose Logging in AWS Network Policy Agent During Policy Verdicts eks-nodeagent	- When using AWS Network Policy Agent with VPC CNI addon v1.17.1, the log message `failed to get caller` may appear frequently. - This behavior correlates with policy verdicts being evaluated, and the volume increases in environments with higher traffic or more active policies. - The issue does not indicate functional failure, but it increases log volume and may obscure real issues.	Logging Problems	AWS VPC CNI Log Noise
CRE-2025-0058 Medium Impact: 7/10 Mitigation: 4/10	Celery Worker Stops Consuming Tasks After Redis Restart redis	- When Redis is restarted, Celery workers using Redis as a broker may stop consuming tasks without exiting or logging a fatal error. - Although Celery Beat continues to publish tasks successfully, the worker remains in a broken state until manually restarted. - This results in a silent backlog of scheduled but unprocessed tasks.	Task Management Problems	Celery Silent Failure Redis Kombu
CRE-2025-0059 Low Impact: 6/10 Mitigation: 2/10	Datadog CWS Instrumentation webhook registration fails without service account datadog	- Datadog Cluster Agent fails to register its CWS (Container Workload Security) instrumentation webhook when running in `remote_copy` mode without a configured service account.	Configuration Problem	Datadog CWS Admission Controller webhook Configuration Known Issue
CRE-2025-0060 Low Impact: 5/10 Mitigation: 2/10	Datadog OpenMetrics Scrape Returns 404 datadog	- The Datadog Agent is unable to scrape metrics from an OpenMetrics endpoint, returning a 404 Not Found error. - This typically indicates that the target service is either not exposing the `/metrics` path, the port or path is misconfigured, or the service is not running.	Monitoring Problem	Observability Datadog
CRE-2025-0061 Medium Impact: 7/10 Mitigation: 4/10	Karpenter Stability Issues on EKS During Leader Election karpenter	- EKS may be able to handle steady, predictable scale, but struggles during large‑scale auto scaling events when many workloads and nodes are spinning up or down simultaneously. - This instability affects components that implement leader election using the Kubernetes API, such as: - aws‑load‑balancer‑controller - karpenter - keda‑operator - ebs‑csi‑controller - efs‑csi‑controller	Stability Problems	Karpenter KEDA AWS EKS
CRE-2025-0062 Medium Impact: 6/10 Mitigation: 2/10	Karpenter Version Incompatible with Kubernetes Version karpenter	- Karpenter logs an error when its current version is not compatible with the running Kubernetes control plane version. - This results in provisioning failures and indicates a required upgrade to align compatibility. - The issue is surfaced via structured logs from the controller.	Incompatibility Problem	Version Incompatibility Karpenter
CRE-2025-0063 Medium Impact: 6/10 Mitigation: 3/10	RabbitMQ disk monitor fails to initialize rabbitmq	- RabbitMQ's disk monitor process cannot start or retrieve free‐space metrics, preventing it from detecting low‐disk conditions.	Message Queue Problems	RabbitMQ Disk Monitor Monitoring Plugin
CRE-2025-0064 High Impact: 5/10 Mitigation: 2/10	Terraform Cloud Authentication Failure terraform	- This error occurs when Terraform Cloud authentication fails due to missing or invalid API tokens, workspace or organization misconfiguration, or insufficient permissions.	Provisioning Problems	Terraform Permissions Authentication
CRE-2025-0068 Low Impact: 6/10 Mitigation: 5/10	Gnome input lag on Nvidia Ubuntu desktops syslog	Keyboard input or the entire screen may freeze at times on systems using the Nvidia Xorg driver with GNOME.	Ubuntu Desktop Problems	Gnome Nvidia Ubuntu
CRE-2025-0069 Medium Impact: 6/10 Mitigation: 4/10	Kubernetes fsGroup ignored on NFS volumes manifest	Pods that mount NFS volumes and set `securityContext.fsGroup` still have the directory owned by `root:root`. The kubelet does not chown the share, so non-root containers fail with \"Permission denied\".	Kubernetes Storage Problems	Kubernetes NFS securityContext
CRE-2025-0070 Critical Impact: 10/10 Mitigation: 6/10	Kafka Under-Replicated Partitions Crisis kafka	Critical Kafka cluster degradation detected: Multiple partitions have lost replicas due to broker failure, resulting in an under-replicated state. This pattern indicates a broker has become unavailable, causing partition leadership changes and In-Sync Replica (ISR) shrinkage across multiple topics.	Message Queue Problems	Kafka Replication Data Loss High Availability Broker Failure Cluster Degradation
CRE-2025-0071 High Impact: 9/10 Mitigation: 8/10	CoreDNS unavailable kubernetes	CoreDNS deployment is unavailable or has no ready endpoints, indicating an imminent cluster-wide DNS outage.	Kubernetes Problems	Kubernetes Networking DNS High Availability
CRE-2025-0072 Critical Impact: 10/10 Mitigation: 7/10	Redis Out-Of-Memory → Persistence Crash → Replica/ACL Write Failures redis	Detects a cascade of critical Redis failure modes in a single session: - Redis refuses writes when maxmemory is exceeded (OOM). - RDB snapshot (BGSAVE) fails (MISCONF) due to simulated full-disk. - Replica refuses writes (READONLY). - ACL denies a write (NOPERM).	In-Memory Database Problems	Redis Out of Memory Persistence RDB MISCONF READONLY ACL Security
CRE-2025-0073 High Impact: 9/10 Mitigation: 6/10	Redis Rejects Writes Due to Reaching 'maxmemory' Limit redis-cli	The Redis instance has reached its configured 'maxmemory' limit. Because its active memory management policy does not permit the eviction of existing keys to free up space (as is the case when the 'noeviction' policy is in effect, which is often the default), Redis rejects new write commands by sending an \"OOM command not allowed\" error to the client.	Database Problems	Redis Redis CLI Memory Pressure Memory Data Loss Public
CRE-2025-0074 Critical Impact: 10/10 Mitigation: 6/10	Temporal Worker → Server Downtime → Connection Refused Failure worker	Detects failure when a Temporal worker is unable to reach the Temporal server. - This typically occurs during startup or after server downtime. - Worker log contains gRPC error: \"connection refused\".	Workflow Orchestration Connectivity	Temporal Worker Problems gRPC Connection Refused Startup Failure
CRE-2025-0075 Critical Impact: 10/10 Mitigation: 6/10	Nginx Upstream Failure Cascade Crisis nginx	Detects critical Nginx upstream failure cascades that lead to complete service unavailability. This advanced rule identifies comprehensive upstream failure patterns including DNS resolution failures, connection timeouts, SSL/TLS handshake errors, protocol violations, and server unavailability, followed by HTTP 5xx error responses within a 60-second window. The rule uses optimized regex patterns for maximum detection coverage while maintaining high performance and low false-positive rates. It captures both the root cause (upstream failures) and the user-facing impact (HTTP errors) to provide complete incident context.	Load Balancer Problems	Nginx Reverse Proxy Service Outage High Availability Load Balancer Cascading Failure
CRE-2025-0076 High Impact: 0/10 Mitigation: 9/10	SlurmDBD Database Connection Lost slurm	Detects when Slurm's accounting daemon (slurmdbd) or controller (slurmctld) loses connection to its MySQL database, causing job scheduling and recording to halt.	HPC Database Problems	SLURM SlurmDBD MySQL High Availability
CRE-2025-0077 High Impact: 9/10 Mitigation: 7/10	PostgreSQL Fails to Extend File Due to Disk Full postgresql	PostgreSQL logs an error when it cannot extend a data file (table/index) because the filesystem is out of disk space. This prevents writes requiring new allocation.	Database Problems	PostgreSQL Disk Full Write Failure Public
CRE-2025-0078 High Impact: 10/10 Mitigation: 2/10	SpiceDB Database Schema Failures: Missing Core Tables spicedb	Detects critical SpiceDB database schema failures caused by missing core tables like `metadata`, `alembic_version`, or `relation_tuple_transaction`. These errors often stem from incomplete migrations, startup race conditions, or schema corruption, resulting in a complete breakdown of SpiceDB authorization capabilities.	Authorization Systems	SpiceDB Migration Failure Schema Error PostgreSQL
CRE-2025-0079 Critical Impact: 10/10 Mitigation: 3/10	SpiceDB Database Corruption: Critical Table Loss postgresql	Detects catastrophic SpiceDB database corruption where critical core tables like `alembic_version` and `relation_tuple_transaction` are missing or dropped. This represents complete database corruption that renders SpiceDB unable to perform any authorization operations, causing total permission system failure.	Authorization Systems	SpiceDB Database Corruption Authorization PostgreSQL
CRE-2025-0080 High Impact: 0/10 Mitigation: 9/10	Redpanda High Severity Issues redpanda	Detects when Redpanda hits any of these on startup or early runtime: 1. Fails to create its crash_reports directory (POSIX error 13). 2. Heartbeat or node-status RPC failures indicating a broker is down. 3. Raft group failure. 4. Data center failure	Data Streaming Platforms	Redpanda Startup Failure Permission Failure RPC Raft Node Down Cluster Degradation Data Availability Database Corruption
CRE-2025-0081 Critical Impact: 9/10 Mitigation: 8/10	Temporal Server Fails Persistence on Read-Only Database temporal	Detects critical failures where Temporal Server is unable to perform essential database write operations (e.g., starting workflows, recording history, completing tasks) because its underlying SQL database (e.g., PostgreSQL) is in a read-only state. This leads to a halt or severe degradation in workflow processing and can cause cluster instability.	Temporal Server Failure	Temporal PostgreSQL READONLY
CRE-2025-0082 High Impact: 0/10 Mitigation: 8/10	NATS JetStream HA failures: monitor goroutine, consumer stalls and unsynced replicas nats	Detects high-availability failures in NATS JetStream clusters due to: 1. Monitor goroutine failure — after node restarts, Raft group fails to elect a leader 2. Consumer deadlock — using DeliverPolicy=LastPerSubject + AckPolicy=Explicit with low MaxAckPending 3. Unsynced replicas — object store replication appears healthy but data is lost or inconsistent between nodes These issues lead to invisible data loss, stalled consumers, or stream unavailability.	Message Queue Problems	NATS JetStream Raft Ack Deadlock Unsynced Replica
CRE-2025-0085 High Impact: 8/10 Mitigation: 7/10	SpiceDB Schema Validation Failures Block Authorization Updates spicedb	Detects SpiceDB schema validation failures that prevent authorization logic updates and deployments. These failures occur when invalid schema definitions are submitted, including syntax errors, circular dependencies, type conflicts, or malformed permission expressions, blocking critical authorization system updates.	Authorization Problems	SpiceDB Authorization Configuration Validation Crash Startup Failure
CRE-2025-0088 High Impact: 9/10 Mitigation: 8/10	NATS JetStream Storage Exhaustion Detection jetstream	Detects NATS JetStream storage exhaustion conditions when streams reach configured storage limits (maximum bytes, maximum messages) causing message storage failures. These patterns indicate insufficient stream storage capacity relative to message production rate, leading to message rejection and potential data loss.	Message Queue Problems	NATS JetStream Storage Exhaustion Message Storage Failure Capacity Exceeded Data Loss Risk
CRE-2025-0090 Low Impact: 0/10 Mitigation: 0/10	Loki Log Line Exceeds Max Size Limit alloy log	Alloy detects the Loki is dropping log lines because they exceed the configured maximum line size. This typically indicates that applications are emitting extremely long log entries, which Loki is configured to reject by default.	Observability Problems	Alloy Loki Logs Observability Grafana
CRE-2025-0091 Critical Impact: 10/10 Mitigation: 7/10	Redpanda Consumer Mass Disconnect → Coordinator Failure consumer	Detects high-severity failure when mass consumer disconnections overwhelm Redpanda's group coordinator. - Multiple consumers simultaneously leave consumer groups - Coordinator becomes unresponsive (NodeNotReadyError) - MemberIdRequiredError indicates coordinator state corruption - Can lead to complete message processing halt	Distributed Messaging Connectivity Issues	Redpanda Consumer Groups Coordinator Failure Mass Disconnect Kafka Compatibility Message Processing Halt
CRE-2025-0092 High Impact: 0/10 Mitigation: 9/10	Redpanda Quorum Loss redpanda	Detects when a Redpanda node becomes isolated (heartbeats fail) and triggers a Raft re-election, indicating quorum loss.	Redpanda High Availability	Redpanda Raft Quorum Leader Election
CRE-2025-0095 High Impact: 9/10 Mitigation: 7/10	NATS Connection Exhaustion: Maximum Connections Exceeded nats	Detects NATS server connection exhaustion where the configured maximum connection limit is exceeded, preventing new clients from establishing connections. This represents a critical messaging infrastructure failure that can cause cascading outages across distributed systems.	Message Queue Problems	NATS Connection Exhaustion Critical Infrastructure
CRE-2025-0099 High Impact: 8/10 Mitigation: 7/10	Redpanda Crash Due to Memory Exhaustion and Startup Failures application-logs	Redpanda streaming platform crashes due to a combination of system-level failures including permission denied errors for performance monitoring subsystems, missing critical configuration files, and memory allocation failures.	Data Streaming Platforms	Redpanda Container Crash Memory Exhaustion Configuration Failure Streaming Platform Kafka Compatible Permission Denied SIGKILL
CRE-2025-0102 High Impact: 0/10 Mitigation: 0/10	Redpanda Cluster Critical Failure - Node Loss, Quorum Lost, and Data Availability Impacted redpanda	- The Redpanda streaming data platform is experiencing a severe, cascading failure. - This typically involves critical errors on one or more nodes (e.g., storage failures), leading to nodes becoming unresponsive or shutting down. - Subsequently, this can cause loss of controller quorum, leadership election problems for partitions, and a significant degradation in overall cluster health and data availability.	Redpanda Problems	Redpanda Streaming Data Cluster Failure Node Down Quorum Loss Data Availability Errors Distributed System
CRE-2025-0103 Medium Impact: 0/10 Mitigation: 0/10	NATS Connection Failures and Network Partitions nats	Detects NATS connection failures and network partitions that can impact message delivery and system reliability.	Message Queue Problems	NATS Connectivity
CRE-2025-0104 Medium Impact: 0/10 Mitigation: 0/10	Istio Ambient traffic fails with timed out waiting for workload from xds ambient	Ztunnel must fetch pod workload info from Istiod over XDS before tunneling. If it doesn't receive a response within ~5s, it rejects the connection with: `timed out waiting for workload … from xds`. Intermittent XDS delays may indicate Istiod overload or misconfiguration (e.g. PILOT_DEBOUNCE_AFTER).	Istio Ambient Troubleshooting	Istio Ambient Ztunnel
CRE-2025-0105 High Impact: 9/10 Mitigation: 3/10	SpiceDB Datastore Startup Failure spicedb	Detects critical failures where a SpiceDB instance cannot start due to an invalid schema or an uninitialized datastore during the bootstrap process. This is a common configuration error that prevents the service from initializing and serving requests, leading to a total service outage.	Authorization Systems	SpiceDB Authorization Datastore Misconfiguration Startup Failure
CRE-2025-0106 High Impact: 0/10 Mitigation: 0/10	Ambient CNI Sandbox Creation Failure ambient	Detects when the Istio CNI plugin fails to set up a pod's network sandbox in Ambient mode. Two common root causes are: 1. No ztunnel connection (CNI cannot contact the node-level ztunnel agent).	Istio Ambient Troubleshooting	Istio CNI Ambient
CRE-2025-0107 Medium Impact: 3/10 Mitigation: 6/10	Redpanda Node Missing State Files on Startup redpanda	Detects when a Redpanda node starts up but cannot find key state files, such as the key-value store snapshot or configuration cache. This is normal behavior for a brand-new node starting for the first time but can indicate a problem (like a cleared or misconfigured volume) if it occurs on an existing node that is expected to have state.	Redpanda Problems	Redpanda Redpanda Startup Redpanda State Missing Snapshot
CRE-2025-0108 High Impact: 0/10 Mitigation: 0/10	Ambient mode readiness probe failures ambient	In Ambient mode, Istio applies a SNAT rule so that kubelet probe traffic appears from 169.254.7.127 and is bypassed by the data-plane. If you see Readiness probe failed events begin only after enabling Ambient, it almost always means that SNAT/bypass isn't working in your CNI or networking environment.	Istio Ambient Troubleshooting	Istio Ambient CNI
CRE-2025-0109 Medium Impact: 0/10 Mitigation: 0/10	Ambient HTTP status codes by Ztunnel ambient	When Ambient mode is enabled, Ztunnel tunnels HTTP over HBONE (HTTP CONNECT) and although it's a TCP proxy, it still tags its \"connection complete\" log lines with the HTTP status code from the upstream response (e.g. 503, 401). This CRE verifies that non-2xx responses are correctly surfaced.	Istio Ambient Troubleshooting	Istio Ambient Ztunnel
CRE-2025-0110 High Impact: 0/10 Mitigation: 0/10	Ztunnel Traffic timeouts in Istio Ambient Mode ambient	Detects when Istio Ambient-mode HBONE (mTLS) traffic is blocked or dropped— resulting in Ztunnel logging timeouts such as `io error: deadline has elapsed` or `connection timed out, maybe a NetworkPolicy is blocking HBONE port 15008`.	Istio Ambient Troubleshooting	Istio Ambient Ztunnel
CRE-2025-0111 Medium Impact: 0/10 Mitigation: 0/10	Ztunnel IPv6 Bind Failure ambient	Detects when Ztunnel's DNS proxy or control-plane component attempts to bind to the IPv6 loopback address `[::1]:15053` on a node where IPv6 is disabled, resulting in an `Address family not supported` error.	Istio Ambient Troubleshooting	Istio Ambient Ztunnel Network
CRE-2025-0112 Critical Impact: 10/10 Mitigation: 4/10	AWS VPC CNI Node IP Pool Depletion Crisis aws-vpc-cni	Critical AWS VPC CNI node IP pool depletion detected causing cascading pod scheduling failures. This pattern indicates severe subnet IP address exhaustion combined with ENI allocation failures, leading to complete cluster networking breakdown. The failure sequence shows ipamd errors, kubelet scheduling failures, and controller-level pod creation blocks that render clusters unable to deploy new workloads, scale existing services, or recover from node failures. This represents one of the most severe Kubernetes infrastructure failures, often requiring immediate manual intervention including subnet expansion, secondary CIDR provisioning, or emergency workload termination to restore cluster functionality.	VPC CNI Problems	AWS EKS Kubernetes Networking VPC CNI AWS CNI IP Exhaustion ENI Allocation Subnet Exhaustion Pod Scheduling Failure Cluster Paralysis AWS API Limits Known Problem Critical Infrastructure Service Outage Cascading Failure Capacity Exceeded Scalability Issue Revenue Impact Compliance Violation Threshold Exceeded Infrastructure Public
CRE-2025-0113 Critical Impact: 10/10 Mitigation: 8/10	MongoDB WiredTiger Cache OOM Kill mongodb	Detects MongoDB Out-Of-Memory (OOM) kill caused by WiredTiger cache pressure. This occurs when the cache fills with dirty pages faster than they can be evicted, causing uncontrolled memory growth until the OS kills the process.	Database Availability	MongoDB WiredTiger OOM Kill Memory Pressure Cache Eviction Critical Failure
CRE-2025-0114 High Impact: 0/10 Mitigation: 0/10	Nginx Ingress Controller rewritten URI has a zero length nginx	Detects rewrite error which leads to service unavailability. Wrong rewrite causes responses with HTTP code 500 or 400. This CRE detects empty rewrite.	Load Balancer Problems	Nginx Reverse Proxy Service Outage Ingress Controller NGINX Ingress Load Balancer Kubernetes
CRE-2025-0115 High Impact: 0/10 Mitigation: 0/10	MongoDB client disconnects and socket exceptions under load mongodb	- Under high client load or resource exhaustion (e.g., 1 CPU, 500MB RAM), MongoDB can begin to log a large number of socket disconnections and broken pipe errors. - This results in frequent connection churn, interrupted operations, and possible failed writes.	MongoDB Resource Exhaustion	CPU/Memory Exhaustion
CRE-2025-0118 High Impact: 7/10 Mitigation: 7/10	Envoy proxy unable to connect to upstream services envoy	This rule detects when Envoy proxy is experiencing consistent failures connecting to upstream services, resulting in HTTP 503 (Service Unavailable) or 504 (Gateway Timeout) errors. These errors are typically accompanied by \"UH\" (upstream service unhealthy) or \"UT\" (upstream request timeout) response flags in Envoy access logs, indicating backend service connectivity issues that require immediate attention.	Envoy Upstream Failures	Envoy Proxy Load Balancer Envoy Upstream Errors Envoy Service Unavailable Envoy Gateway Timeout
CRE-2025-0119 High Impact: 8/10 Mitigation: 7/10	Kubernetes Pod Disruption Budget (PDB) Violation During Rolling Updates kubernetes	During rolling updates, when a deployment's maxUnavailable setting conflicts with a Pod Disruption Budget's minAvailable requirement, it can cause service outages by terminating too many pods simultaneously, violating the availability guarantees. This can also occur during node drains, cluster autoscaling, or maintenance operations.	Kubernetes Problems	K8s Known Problem Misconfiguration Operational error High Availability
CRE-2025-0120 Critical Impact: 0/10 Mitigation: 0/10	NGINX Ingress ConfigMap Size Limit Exceeded nginx	The NGINX Ingress Controller fails to load or update its configuration because the ConfigMap containing the nginx.conf exceeds Kubernetes' 1MB size limit. This prevents new Ingress resources from being applied and can cause routing failures for new services.	Configuration Problem	NGINX Ingress ConfigMap Limit Configuration Management Kubernetes Limits
CRE-2025-0121 Critical Impact: 10/10 Mitigation: 7/10	NGINX Ingress Controller SSL Certificate Failure nginx	Critical NGINX Ingress Controller SSL certificate validation failure detected. This pattern indicates cascading SSL failures where certificate verification errors lead to upstream connection failures and service unavailability. The failure sequence shows SSL handshake failures, certificate verification errors, and resulting HTTP error responses that affect client connectivity.	Load Balancer Problems	Nginx Ingress Controller SSL Certificate TLS Handshake Certificate Verification Load Balancer Kubernetes Security High Availability Service Unavailability
CRE-2025-0122 Critical Impact: 10/10 Mitigation: 6/10	AWS VPC CNI IP Address Exhaustion Crisis aws-vpc-cni	Critical AWS VPC CNI IP address exhaustion detected. This pattern indicates cascading failures where subnet IP exhaustion leads to ENI allocation failures, pod scheduling failures, and complete service unavailability. The failure sequence shows IP allocation errors, ENI attachment failures, and resulting pod startup failures that affect cluster scalability and workload deployment.	Networking Problems	AWS VPC CNI Kubernetes Networking IP Exhaustion ENI Allocation Pod Scheduling Cluster Scaling High Availability Service Unavailability
CRE-2025-0125 High Impact: 9/10 Mitigation: 6/10	Kubelet EventedPLEG Panic Causes NodeFailure kubernetes	Detects a critical kubelet panic in the EventedPLEG subsystem under rapid pod launch pressure. When triggered, the node's kubelet crashes, the node becomes NotReady and all resident pods are evicted resulting in a full node-level outage until manual intervention.	Kubernetes Problems	Kubernetes Kubelet Panic
CRE-2025-0129 Critical Impact: 10/10 Mitigation: 9/10	MongoDB Crash Loop Due to WiredTiger Metadata Corruption mongodb	Detects MongoDB failures where the server enters a crash loop due to corrupted WiredTiger metadata. This prevents startup and often requires manual repair or restoration from backups.	MongoDB Startup Failure	MongoDB WiredTiger Metadata Corruption Crash Loop Startup Failure
CRE-2025-0918 Critical Impact: 0/10 Mitigation: 0/10	Demo Application Crashing due to ENV Misconfiguration demo	- The author of the demo application thought it would be interesting to make the application break if a secret environment variable was not set.	Demo Problems	Demo Problem

Observability Problems

Message Queue Problems

Istio Ambient Troubleshooting

Database Problems

Networking Problems

Storage

Kubernetes Problems

Load Balancer Problems

Authorization Problems

Authorization Systems

Web Server Problems

Configuration Problem

Redpanda Problems

Data Streaming Platforms

Proxy Problems

Operator Problems

Orm

Cache Problems

Framework Problems

Distributed Messaging Connectivity Issues

Workflow Orchestration Connectivity

Insecure Configuration

Message Queue Performance

Web Server Problem

Proxy Timeout Problems

Monitoring Problem

Incompatibility Problem

Logging Problems

Provisioning Problems

Stability Problems

Task Management Problems

Ubuntu Desktop Problems

HPC Database Problems

In-Memory Database Problems

PostgreSQL High Availability

Kubernetes Storage Problems

Demo Problems

Redpanda High Availability

MongoDB Resource Exhaustion

Temporal Server Failure

VPC CNI Problems

Envoy Upstream Failures

Database Availability

MongoDB Startup Failure

Public

Known Issue

Configuration

Known Problem

Kubernetes

Nginx

Security

NATS

High Availability

AWS

Loki

Redpanda

Istio

Ambient

Load Balancer

Networking

Observability

Authorization

Startup Failure

PostgreSQL

Crash

RabbitMQ

Otel Collector

Redis

Timeout

Data Loss

Grafana

SpiceDB

Ztunnel

Kafka

Ovn

Storage

Telemetry

Datadog

Django

Exporter