Public CREs
Welcome to the Public CRE feed, where you can explore and discover open-source CRES by tags, categories, or other details. Use the tabs below to navigate between different views.
- Categories
- Tags
- Technologies
- CREs
Observability Problems
11 CREs
Problems related to observability, like monitoring, logging, and tracing
Message Queue Problems
6 CREs
Problems related to message queues, like Kafka, RabbitMQ, NATS, and others
Database Problems
4 CREs
Problems related to databases, like MySQL, PostgreSQL, MongoDB, and others
Storage
4 CREs
Disk, object storage, or volume-related issues that impact data availability.
Networking Problems
3 CREs
Connectivity, DNS, or routing issues affecting system communication.
Authorization Problems
3 CREs
Problems related to authorization
Authorization Systems
3 CREs
Failures in systems that manage access control, identity, or permissions. This includes tools like SpiceDB, OPA, or Auth0 where schema, policy, or integration issues can block authentication or authorization flows.
Kubernetes Problems
2 CREs
Problems related to Kubernetes
Web Server Problems
2 CREs
Problems related to web servers
Data Streaming Platforms
2 CREs
Failures in distributed streaming data platforms used for real-time event processing. This includes platforms like Redpanda, Apache Kafka, Pulsar, and compatible systems where startup, configuration, or operational issues can disrupt data streaming pipelines and impact downstream applications relying on event-driven architectures.
Proxy Problems
1 CREs
Problems related to proxies, like NGINX, HAProxy, and others
Operator Problems
1 CREs
Problems related to operators
Orm
1 CREs
Object Relational Mapper issue that impacts data availability.
Cache Problems
1 CREs
Cache related problems
Framework Problems
1 CREs
Problems in frameworks such as Django
Insecure Configuration
1 CREs
Problems related to insecure configuration
Web Server Problem
1 CREs
Problems related to web servers
Load Balancer Problems
1 CREs
Problems related to load balancers
Proxy Timeout Problems
1 CREs
Problems related to proxy timeouts
Configuration Problem
1 CREs
Problems related to system or application configurations
Monitoring Problem
1 CREs
Problems related to system or application monitoring
Incompatibility Problem
1 CREs
Problems due to incompatible components or versions.
Logging Problems
1 CREs
Issues related to logging mechanisms and processes.
Provisioning Problems
1 CREs
Issues related to the provisioning of resources and infrastructure.
Stability Problems
1 CREs
Issues that affect the stability and uptime of systems and services.
Task Management Problems
1 CREs
Issues related to the management and execution of tasks and workflows.
Ubuntu Desktop Problems
1 CREs
Problems related to Ubuntu Desktop
HPC Database Problems
1 CREs
Database issues specific to high-performance computing systems like SLURM
In-Memory Database Problems
1 CREs
Problems specific to in-memory data stores (e.g. Redis, Memcached)
Kubernetes Storage Problems
1 CREs
Problems related to container storage in Kubernetes
Redpanda Problems
1 CREs
Problems related to Redpanda cluster failures, including node loss, quorum loss, and data availability impact.
Temporal Server Failure
1 CREs
Temporal Server Failure Temporal Server Fails Persistence on Read-Only Database
Public
35 CREs
Open source CREs contributed by the problem detection community
Known Issue
16 CREs
Problems already identified and documented as known issues
Configuration
11 CREs
Problems caused by incorrect or missing configuration settings
Known Problem
7 CREs
This is a documented known problem with known mitigations
Nginx
7 CREs
Problems related to Nginx, such as weak ciphers, configuration errors, or performance issues
Loki
6 CREs
Problems with Grafana Loki
Security
6 CREs
Misconfigurations or vulnerabilities in authentication, authorization, or encryption.
Observability
5 CREs
Problems in observability tooling, such as unintended performance impact or missing telemetry
Authorization
5 CREs
Problems related to authorization, such as missing or invalid credentials, or misconfigurations
AWS
4 CREs
Amazon Web Services
PostgreSQL
4 CREs
Problems with PostgreSQL
Otel Collector
4 CREs
Failures in OpenTelemetry Collector pipelines or exporters.
Redis
4 CREs
Issues involving Redis availability, eviction policies, or timeouts.
Grafana
4 CREs
Problems related to Grafana services, that may impact performance, or telemetry collection and storage
Kubernetes
4 CREs
Problems related to Kubernetes, such as pod failures, API errors, or scheduling issues
High Availability
4 CREs
Problems related to high-availability systems and failover
SpiceDB
4 CREs
Problems related to SpiceDB authorization service, including schema corruption, permission failures, and database connectivity issues
Startup Failure
4 CREs
Problems related to application or service startup failures, such as missing dependencies or configuration errors
Kafka
3 CREs
Problems with Apache Kafka
Crash
3 CREs
Problems with applications crashing
RabbitMQ
3 CREs
Problems with RabbitMQ
Ovn
3 CREs
Issues in Open Virtual Network components used with SDN setups.
Telemetry
3 CREs
Issues with emitting, collecting, or transforming observability data.
Timeout
3 CREs
Operations that exceeded their allotted execution window.
Data Loss
3 CREs
Problems where data is lost or dropped due to system failures or processing errors
Datadog
3 CREs
Problems related to Datadog integration, such as missing metrics, reporting failures, or misconfigurations
Django
3 CREs
Problems related to the Django framework, such as view errors, middleware faults, or misconfigurations
Exporter
3 CREs
Problems related to metric exporters, such as missing, malformed, or unreported metric data
Load Balancer
3 CREs
Problems related to load balancers, such as misrouting, unhealthy backends, or configuration faults
Memory
3 CREs
Problems related to memory usage, such as leaks, pressure, or out-of-memory crashes
Networking
3 CREs
Problems within networking components, such as interface misconfigurations or routing errors
NATS
3 CREs
Problems related to NATS, such as authorization failures, message loss, or configuration issues
Alloy
3 CREs
Problems related to Grafana alloy, such as Loki fanout crashes, or entries too far behind.
Redpanda
3 CREs
Issues specifically related to the Redpanda streaming platform.
Karpenter
2 CREs
Problems with Karpenter
KEDA
2 CREs
Problems with KEDA Operator
Openstack
2 CREs
Problems specific to OpenStack infrastructure components and deployments.
Opentelemetry
2 CREs
Errors or gaps in tracing and metrics collection using OpenTelemetry libraries.
Plugin
2 CREs
Failures or misbehavior in third-party or custom plugin systems.
Prometheus
2 CREs
Problems with scraping, rule evaluation, or querying Prometheus data.
Psycopg2
2 CREs
Python client errors related to connecting or querying PostgreSQL using psycopg2.
Python
2 CREs
General Python runtime errors or stack traces.
Storage
2 CREs
Failures in block, object, or ephemeral storage backends.
Validation
2 CREs
Input or schema validation failures in form submissions or APIs.
Async
2 CREs
Problems related to asynchronous execution, such as hung tasks, race conditions, or callback errors
Authentication
2 CREs
Problems related to user or service authentication, such as invalid tokens or failed logins
Cache
2 CREs
Problems related to caching mechanisms, including stale data, cache misses, or eviction faults
Memcached
2 CREs
Problems related to Memcached, such as cache evictions, connection errors, or stale entries
Neutron
2 CREs
Problems related to OpenStack Neutron, such as network provisioning or connectivity failures
DNS
2 CREs
Problems related to DNS, such as hostname resolution failures, or DNS server misconfigurations
Proxy
2 CREs
Problems related to proxy configurations or usage
READONLY
2 CREs
Errors when writing to a read-only Redis replica.
Cluster Degradation
2 CREs
Problems related to cluster availability
Temporal
2 CREs
Problems related to Temporal
Node Down
2 CREs
Problems related to nodes going down in a cluster, impacting availability and performance
Data Availability
2 CREs
Problems related to data availability in distributed systems, such as loss of access to critical data
Database Corruption
2 CREs
Problems where database tables, schemas, or data become corrupted, leading to missing relations or inaccessible data
GKE
1 CREs
Google Kubernetes Engine
EKS
1 CREs
Amazon Elastic Kubernetes Service
Celery
1 CREs
Problems with Celery
Errors
1 CREs
Problems with application errors
Misconfiguration
1 CREs
Problems with misconfigurations
Operational error
1 CREs
A runtime issue caused by system-level factors like resource limits or connectivity.
Ovsdb
1 CREs
Failures involving the OVSDB (Open vSwitch Database) protocol or schema.
Panic
1 CREs
Crashes due to unrecoverable errors, especially in Go or Rust applications.
Password
1 CREs
Problems with password policies, validation, or storage.
Redis CLI
1 CREs
Problems with the Redis command-line interface, such as connection issues, command errors or rejections.
Redis Py
1 CREs
Errors with the `redis-py` client library in Python.
Retry
1 CREs
Logic or policy failures when retrying failed operations.
S3
1 CREs
Errors related to object access, buckets, or permissions in Amazon S3.
Service
1 CREs
Failures at the service or API layer of an application.
Signature
1 CREs
Problems with signing or verifying cryptographic signatures.
Sqlalchemy
1 CREs
Errors in SQLAlchemy ORM usage, session handling, or migrations.
Ssl
1 CREs
SSL/TLS handshake errors or expired/invalid certificates.
Threads
1 CREs
Race conditions, deadlocks, or errors in multithreaded environments.
Transaction
1 CREs
Database or service transaction failures due to commits or rollbacks.
Translation
1 CREs
Errors in i18n/l10n string resolution or missing language assets.
Uri
1 CREs
Malformed or invalid Uniform Resource Identifier usage.
Web
1 CREs
Browser-facing issues in HTTP, HTML, or frontend integration layers.
ebs
1 CREs
Problems with Amazon EBS (Elastic Block Store).
csi
1 CREs
Container Storage Interface (CSI)
Api Key
1 CREs
Problems related to API keys, such as missing, invalid, or expired credentials
Attribute
1 CREs
Problems related to missing or unexpected object attributes, causing attribute access failures
Attributeerror
1 CREs
Problems where code fails due to attribute lookup errors, such as missing attributes on objects
Backpressure
1 CREs
Problems where producers overwhelm consumers, causing resource exhaustion or unhandled pressure
Batch
1 CREs
Problems related to batch processing, such as job failures, incorrect batch sizing, or order issues
Connection
1 CREs
Problems related to network connections, such as timeouts, refusals, or resets
Context
1 CREs
Problems related to context propagation, such as lost, overwritten, or mismatched context values
Contextvars
1 CREs
Problems specifically with Python context variables, such as improper isolation or missing context
Deadlock
1 CREs
Problems where threads or processes enter deadlock, preventing further progress
Disallowedhost
1 CREs
Problems where incoming requests are blocked due to disallowed Host header settings
Drop
1 CREs
Problems where messages or data are unexpectedly dropped or discarded
Environment
1 CREs
Problems related to environment variables or runtime environment settings
Escaping
1 CREs
Problems related to improper escaping of strings or data, leading to injection or parsing issues
Fork
1 CREs
Problems related to process forking, such as unsafe forks or resource duplication
Helm
1 CREs
Problems related to Helm deployments, such as chart rendering failures or template errors
Host Header
1 CREs
Problems due to incorrect or malicious Host header values
Infrastructure
1 CREs
Problems at the infrastructure level, such as resource outages or provisioning failures
Instrumentation
1 CREs
Problems related to instrumentation code, such as missing spans, broken traces, or metric gaps
Irsa
1 CREs
Problems related to IAM Roles for Service Accounts (IRSA), such as permission denials or misbindings
Logical Switch
1 CREs
Problems related to logical switch configurations in virtual networking
Memory Pressure
1 CREs
Problems where applications or services experience high memory usage, leading to performance degradation or crashes
Metrics
1 CREs
Problems related to metrics collection or reporting, such as missing, delayed, or incorrect data
Multiprocessing
1 CREs
Problems related to multiprocessing, such as process spawning failures or inter-process communication issues
Network
1 CREs
Problems related to network communication, such as packet loss, latency spikes, or unreachable hosts
Nil Pointer
1 CREs
Problems where code dereferences nil pointers, causing runtime crashes
Payload
1 CREs
Problems related to message payloads, such as malformed data or size limit violations
TLS
1 CREs
Problems related to TLS, such as weak ciphers, configuration errors, or performance issues
Weak Ciphers
1 CREs
Problems related to weak ciphers, such as RC4, DES, or MD5
Kubelet
1 CREs
Problems related to Kubelet, such as node not ready, or pod failures
Upstream Failure
1 CREs
Problems where Nginx cannot successfully forward requests to backend services
Connection Refused
1 CREs
Problems where a connection attempt is rejected by the target server
Buffer
1 CREs
Problems related to buffering
Capacity Issue
1 CREs
Problems related to system capacity
Connectivity
1 CREs
Problems related to network connectivity
Header Size
1 CREs
Problems related to the size of headers
Upload Limits
1 CREs
Problems related to upload size limits
Web Server
1 CREs
Problems related to web server configurations or issues
Admission Controller
1 CREs
Problems related to Kubernetes admission controllers
Disk Monitor
1 CREs
Problems related to disk monitoring
Disk Full
1 CREs
Problems related to disk full errors, such as insufficient space for writes or data storage
Kombu
1 CREs
Problems related to the Kombu messaging library
Backend Issue
1 CREs
Problems related to the backend systems or services
CWS
1 CREs
Problems related to Cloud Workload Security
Monitoring
1 CREs
Problems related to system or application monitoring
Permissions
1 CREs
Problems related to user or system permissions
Log Noise
1 CREs
Problems related to excessive or irrelevant log entries that obscure meaningful information.
PostgreSQL
1 CREs
Problems related to the PostgreSQL database system.
Silent Failure
1 CREs
Problems that do not produce visible errors or logs, making them hard to detect.
Terraform
1 CREs
Problems related to the Terraform infrastructure as code tool.
Version Incompatibility
1 CREs
Problems arising from incompatible versions of software components or libraries.
VPC CNI
1 CREs
Problems related to the VPC CNI (Container Network Interface) plugin.
webhook
1 CREs
Problems related to webhooks.
Ubuntu
1 CREs
Problems related to Ubuntu, such as package updates, or desktop issues
Gnome
1 CREs
Problems related to Gnome, such as input lag, or performance issues
Nvidia
1 CREs
Problems related to Nvidia, such as driver issues, or performance issues
SLURM
1 CREs
Problems related to SLURM workload manager
SlurmDBD
1 CREs
Problems related to SLURM Database Daemon
MySQL
1 CREs
Problems related to MySQL database
Write Failure
1 CREs
Problems where writes to a database or storage system fail due to insufficient space or other issues
Out of Memory
1 CREs
Errors due to Redis (or other) exhausting its configured RAM.
Persistence
1 CREs
Issues around writing data to disk (RDB/AOF) or failing to persist.
RDB
1 CREs
Redis RDB snapshot errors (e.g. BGSAVE failures).
MISCONF
1 CREs
Redis "MISCONF" errors (stop-writes due to snapshot or AOF failures).
ACL
1 CREs
Redis ACL (NOPERM) permission-denied events.
NFS
1 CREs
Problems related to NFS (network file systems)
securityContext
1 CREs
Problems related to Kubernetes securityContext
Broker Failure
1 CREs
Problems related to Kakfa broker failures
Replication
1 CREs
Replication failures, lag, or divergence in stateful systems.
Reverse Proxy
1 CREs
Problems related to reverse proxy configurations or issues
Service Outage
1 CREs
Problems related to service outages, such as complete service unavailability or critical failures
Cascading Failure
1 CREs
Problems related to cascading failures, where one failure leads to multiple dependent failures
Worker Problems
1 CREs
Problems related to process workers
gRPC
1 CREs
Problems related to gRPC
Streaming Data
1 CREs
Problems related to streaming data platforms and systems
Cluster Failure
1 CREs
Problems related to cluster failures, including node loss, quorum loss, and data availability impact
Quorum Loss
1 CREs
Problems related to loss of quorum in distributed systems, impacting consensus and availability
RPC
1 CREs
Remote Procedure Call errors or connectivity issues (includes timeouts, client-request failures, handler-not-found, etc.).
Raft
1 CREs
Issues related to the Raft consensus protocol—leader elections, step-downs, append-entries rejections, vote requests/replies, etc.
Migration Failure
1 CREs
Errors caused by failed or skipped database migrations
Schema Error
1 CREs
Missing or corrupted database schema elements such as tables or columns
Permission Failure
1 CREs
Problems where authorization checks fail due to system issues rather than legitimate access denials
Logs
1 CREs
Problems with log processing
Distributed System
1 CREs
Problems specific to distributed systems, including coordination, consistency, and network partition issues
Datastore
1 CREs
Problems with data storage systems, such as databases or object stores
Container Crash
1 CREs
Failures causing container crashes or unexpected terminations.
Memory Exhaustion
1 CREs
Failures due to running out of memory or excessive memory consumption.
Configuration Failure
1 CREs
Problems caused by incorrect or invalid configuration settings.
Streaming Platform
1 CREs
Issues related to distributed streaming platforms and their operations.
Kafka Compatible
1 CREs
Problems affecting Kafka-compatible systems or APIs, impacting interoperability.
Permission Denied
1 CREs
Failures caused by insufficient access rights or permission errors.
SIGKILL
1 CREs
Failures caused by processes being terminated with a SIGKILL signal.
nginx
8 CREs
opentelemetry-collector
4 CREs
rabbitmq
3 CREs
neutron
3 CREs
loki
3 CREs
django
3 CREs
datadog
3 CREs
nats
3 CREs
spicedb
3 CREs
alloy
2 CREs
eks-nodeagent
2 CREs
redis
2 CREs
karpenter
2 CREs
postgresql
2 CREs
redpanda
2 CREs
gke-metrics-agent
1 CREs
Unspecified
1 CREs
topic-operator
1 CREs
opentelemetry-python
1 CREs
sqlalchemy
1 CREs
redis-py
1 CREs
grafana
1 CREs
kubelet
1 CREs
terraform
1 CREs
syslog
1 CREs
manifest
1 CREs
kafka
1 CREs
kubernetes
1 CREs
redis-cli
1 CREs
worker
1 CREs
slurm
1 CREs
temporal
1 CREs
alloy log
1 CREs
application-logs
1 CREs
ID | Title | Description | Category | Tags |
---|---|---|---|---|
CRE-2024-0007 Critical Impact: 9/10 Mitigation: 8/10 | RabbitMQ Mnesia overloaded recovering persistent queues | The RabbitMQ cluster is processing a large number of persistent mirrored queues at boot. The underlying Erlang process, Mnesia, is overloaded (`** WARNING ** Mnesia is overloaded`). | Message Queue Problems | Known ProblemRabbitMQPublic |
CRE-2024-0008 High Impact: 9/10 Mitigation: 6/10 | RabbitMQ memory alarm | A RabbitMQ node has entered the “memory alarm” state because the total memory used by the Erlang VM (plus allocated binaries, ETS tables, and processes) has exceeded the configured `vm_memory_high_watermark`. While the alarm is active the broker applies flow-control, blocking publishers and pausing most ingress activity to protect itself from running out of RAM. | Message Queue Problems | Known ProblemRabbitMQPublic |
CRE-2024-0016 Low Impact: 4/10 Mitigation: 2/10 | Google Kubernetes Engine metrics agent failing to export metrics | The Google Kubernetes Engine metrics agent is failing to export metrics. | Observability Problems | Known ProblemGKEPublic |
CRE-2024-0018 Medium Impact: 4/10 Mitigation: 5/10 | Neutron Open Virtual Network (OVN) high CPU usage | OVN daemons (e.g., ovn-controller) are stuck in a tight poll loop, driving CPU to 100 %. Logs show “Dropped … due to excessive rate” or “Unreasonably long … poll interval,” slowing port-binding and network traffic. | Networking Problems | Known ProblemOvnPublic |
CRE-2024-0021 High Impact: 4/10 Mitigation: 5/10 | KEDA operator reconciler ScaledObject panic | KEDA allows for fine-grained autoscaling (including to/from zero) for event driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition. | Operator Problems | KEDACrashKnown ProblemPublic |
CRE-2024-0043 Medium Impact: 6/10 Mitigation: 5/10 | NGINX Upstream DNS Failure | When a NGINX upstream becomes unreachable or its DNS entry disappears, NGINX requests begin to fail. | Proxy Problems | KafkaKnown ProblemPublic |
CRE-2025-0019 Low Impact: 3/10 Mitigation: 2/10 | Alloy entries too far behind | Grafana can get into a state where it writes more errors messages than it can process. The problem is compounded when Grafana is collecting its own error logs that include the related warnings that it can no longer keep up. This can consume several GB per day of storage. | Storage | GrafanaAlloyLokiPublic |
CRE-2025-0020 Medium Impact: 5/10 Mitigation: 2/10 | Grafana alloy Loki fanout crash | Grafana alloy Loki fanout crashes when the number of log files exceeds the number of ingesters. | Storage | GrafanaAlloyLokiPublic |
CRE-2025-0025 Medium Impact: 6/10 Mitigation: 5/10 | Kafka broker replication mismatch | When the configured replication factor for a Kafka topic is greater than the actual number of brokers in the cluster, Kafka repeatedly fails to assign partitions and logs replication-related errors. This results in persistent warnings or an `InvalidReplicationFactorException` when the broker tries to create internal or user-defined topics. | Message Queue Problems | KafkaKnown ProblemPublic |
CRE-2025-0026 Low Impact: 6/10 Mitigation: 1/10 | AWS EBS CSI Driver fails to detach volume when VolumeAttachment has empty nodeName | In clusters using the AWS EBS CSI driver, the controller may fail to detach a volume if the associated VolumeAttachment resource has an empty `spec.nodeName`. This results in a log error and skipped detachment, which may block PVC reuse or node cleanup. | Storage | ebscsiAWSStoragePublic |
CRE-2025-0027 Low Impact: 7/10 Mitigation: 2/10 | Neutron Open Virtual Network (OVN) and Virtual Interface (VIF) allows port binding to dead agents, causing VIF plug timeouts | In OpenStack deployments using Neutron with the OVN ML2 driver, ports could be bound to agents that were not alive. This behavior led to virtual machines experiencing network interface plug timeouts during provisioning, as the port binding would not complete successfully. | Networking Problems | NeutronOvnTimeoutNetworkingOpenstackKnown IssuePublic |
CRE-2025-0028 Low Impact: 6/10 Mitigation: 1/10 | OpenTelemetry Python fails to detach context token across async boundaries | In OpenTelemetry Python, detaching a context token that was created in a different context can raise a `ValueError`. This occurs when asynchronous operations, such as generators or coroutines, are finalized in a different context than they were created, leading to context management errors and potential trace data loss. | Observability Problems | OpentelemetryPythonContextvarsAsyncObservabilityPublic |
CRE-2025-0029 Low Impact: 6/10 Mitigation: 5/10 | Loki fails to retrieve AWS credentials when specifying S3 endpoint with IRSA | - When deploying Grafana Loki with AWS S3 as the storage backend and specifying a custom S3 endpoint (e.g., for FIPS compliance or GovCloud regions), Loki may fail to retrieve AWS credentials via IAM Roles for Service Accounts (IRSA). This results in errors during startup or when attempting to upload index tables, preventing Loki from functioning correctly. | Storage | LokiS3AWSIrsaStorageAuthenticationHelmPublic |
CRE-2025-0030 Medium Impact: 6/10 Mitigation: 2/10 | SQLAlchemy create_engine fails when password contains special characters like @ | SQLAlchemy applications using `create_engine()` may fail to connect to a database if the username or password contains special characters (e.g., `@`, `:`, `/`, `#`). These characters must be URL-encoded when included in the database connection string. Failure to encode them leads to parsing errors or incorrect credential usage. | Orm | SqlalchemyConfigurationPasswordUriEscapingConnectionKnown IssuePublic |
CRE-2025-0031 Medium Impact: 5/10 Mitigation: 5/10 | Django returns DisallowedHost error for untrusted HTTP_HOST headers | Django applications may return a "DisallowedHost" error when receiving requests with an unrecognized or missing Host header. This typically occurs in production environments where reverse proxies, load balancers, or external clients send requests using an unexpected domain or IP address. Django blocks these requests unless the domain is explicitly listed in `ALLOWED_HOSTS`. | Framework Problems | DjangoDisallowedhostConfigurationWebSecurityHost HeaderPublic |
CRE-2025-0032 Low Impact: 2/10 Mitigation: 4/10 | Loki generates excessive logs when memcached service port name is incorrect | Loki instances using memcached for caching may emit excessive warning or error logs when the configured`memcached_client` service port name does not match the actual Kubernetes service port. This does not cause a crash or failure, but it results in noisy logs and ineffective caching behavior. | Observability Problems | LokiMemcachedConfigurationServiceCacheKnown IssueKubernetesPublic |
CRE-2025-0033 Low Impact: 7/10 Mitigation: 4/10 | OpenTelemetry Collector refuses to scrape due to memory pressure | The OpenTelemetry Collector may refuse to ingest metrics during a Prometheus scrape if it exceeds its configured memory limits. When the `memory_limiter` processor is enabled, the Collector actively drops data to prevent out-of-memory errors, resulting in log messages indicating that data was refused due to high memory usage. | Observability Problems | Otel CollectorPrometheusMemoryMetricsBackpressureData LossKnown IssuePublic |
CRE-2025-0034 Medium Impact: 6/10 Mitigation: 2/10 | Datadog agent disabled due to missing API key | If the Datadog agent or client libraries do not detect a configured API key, they will skip sending metrics, logs, and events. This results in a silent failure of observability reporting, often visible only through startup log messages. | Observability Problems | DatadogConfigurationApi KeyObservabilityEnvironmentTelemetryKnown IssuePublic |
CRE-2025-0035 Critical Impact: 7/10 Mitigation: 6/10 | psycopg2 SSL error due to thread or forked process state | Applications using psycopg2 with OpenTelemetry instrumentation or threading may fail with SSL-related errors such as "decryption failed or bad record mac". This often occurs when a database connection is created before a fork or from an unsafe thread context, causing the SSL state to become invalid. | Database Problems | SslPsycopg2ForkThreadsDjangoInstrumentationOpentelemetryKnown IssuePublic |
CRE-2025-0036 Low Impact: 6/10 Mitigation: 3/10 | OpenTelemetry Collector drops data due to 413 Payload Too Large from exporter target | The OpenTelemetry Collector may drop telemetry data when an exporter backend responds with a 413 Payload Too Large error. This typically happens when large batches of metrics, logs, or traces exceed the maximum payload size accepted by the backend. By default, the collector drops these payloads unless retry behavior is explicitly enabled. | Observability Problems | Otel CollectorExporterPayloadBatchDropObservabilityTelemetryKnown IssuePublic |
CRE-2025-0037 Low Impact: 8/10 Mitigation: 4/10 | OpenTelemetry Collector panics on nil attribute value in Prometheus Remote Write translator | The OpenTelemetry Collector can panic due to a nil pointer dereference in the Prometheus Remote Write exporter. The issue occurs when attribute values are assumed to be strings, but the internal representation is nil or incompatible, leading to a runtime `SIGSEGV` segmentation fault and crashing the collector. | Observability Problems | CrashPrometheusOtel CollectorExporterPanicTranslationAttributeNil PointerKnown IssuePublic |
CRE-2025-0038 Low Impact: 5/10 Mitigation: 3/10 | Loki fails to cache entries due to Memcached out-of-memory error | Grafana Loki may emit errors when attempting to write to a Memcached backend that has run out of available memory. This results in dropped index or query cache entries, which can degrade query performance but does not interrupt ingestion. | Observability Problems | LokiMemcachedCacheMemoryInfrastructureKnown IssuePublic |
CRE-2025-0039 Medium Impact: 5/10 Mitigation: 3/10 | OpenTelemetry Collector exporter experiences retryable errors due to backend unavailability | The OpenTelemetry Collector may intermittently fail to export telemetry data when the backend API is unavailable or overloaded. These failures manifest as timeouts (`context deadline exceeded`) or transient HTTP 502 responses. While retry logic is typically enabled, repeated failures can introduce delay or backpressure. | Observability Problems | Otel CollectorExporterTimeoutRetryNetworkTelemetryKnown IssuePublic |
CRE-2025-0040 Low Impact: 6/10 Mitigation: 4/10 | Neutron Open Virtual Network (OVN) fails to bind logical switch due to race condition during load balancer creation | During load balancer creation or other operations involving logical router and logical switch associations, Neutron OVN may raise a `RowNotFound` exception when attempting to reference a logical switch that has just been deleted. This leads to a port binding failure and a rollback of the affected operation. | Networking Problems | NeutronOvnOpenstackLoad BalancerLogical SwitchOvsdbKnown IssuePublic |
CRE-2025-0041 Low Impact: 5/10 Mitigation: 4/10 | redis-py client fails with AttributeError when reused across async or process contexts | - In redis-py v5.x, sharing a single Redis client across async tasks or subprocesses can result in: - `AttributeError: ''NoneType'' object has no attribute ''getpid''`. - This typically occurs when the client or connection pool is reused across forks or when event loop context is lost, especially in async frameworks or multiprocessing setups. | Cache Problems | RedisRedis PyPythonAsyncMultiprocessingContextAttributeerrorKnown IssuePublic |
CRE-2025-0042 Critical Impact: 7/10 Mitigation: 5/10 | PostgreSQL transaction fails with deadlock detected error in psycopg2 and Django | - Applications using Django with PostgreSQL and psycopg2 may encounter `deadlock detected` errors under concurrent write-heavy workloads. - PostgreSQL raises this error when two or more transactions block each other cyclically while waiting for locks, and one must be aborted. - Django surfaces this as an `OperationalError`, and the affected transaction is rolled back. | Database Problems | PostgreSQLPsycopg2DjangoTransactionDeadlockOperational errorPublicKnown Issue |
CRE-2025-0043 Medium Impact: 4/10 Mitigation: 2/10 | Grafana fails to load plugin due to missing signature | Grafana may reject custom or third-party plugins at runtime if they are not digitally signed. When plugin signature validation is enabled (default since Grafana 8+), unsigned plugins are blocked and logged as validation errors during startup or plugin loading. | Observability Problems | GrafanaPluginValidationSignatureConfigurationSecurityKnown IssuePublic |
CRE-2025-0044 High Impact: 9/10 Mitigation: 1/10 | NGINX Config Uses Insecure TLS Ciphers | Detects NGINX configuration files that advertise obsolete and cryptographically weak ciphers (RC4-MD5, RC4-SHA, DES-CBC3-SHA). These ciphers are vulnerable to several well-known attacks—including BEAST, BAR-Mitzvah, Lucky-13, and statistical biases in RC4—placing any client–server communication at risk of interception or tampering. | Insecure Configuration | NginxWeak CiphersSecurityConfigurationTLSKnown IssuePublic |
CRE-2025-0045 Medium Impact: 4/10 Mitigation: 4/10 | NATS Authorization Failure Detected | The NATS server has emitted an **Authorization Violation** log entry, meaning a client attempted to connect, publish, subscribe, or perform another operation for which it lacks permission. Intermittent violations often point to misconfiguration or start-up chaos. However, sustained or widespread violations can signal credential expiry or missing secrets. | Authorization Problems | NATSSecurityAuthorizationPublic |
CRE-2025-0046 Medium Impact: 4/10 Mitigation: 4/10 | NATS Permissions Violation Detected | The NATS server has emitted an **Permission Violation** log entry, meaning a client attempted to publish or subscribe to a subject for which it lacks permission. | Authorization Problems | NATSSecurityAuthorizationPublic |
CRE-2025-0048 Low Impact: 5/10 Mitigation: 3/10 | Kubelet node not ready due to a DNS hostname resolution failure | A Kubernetes worker node has entered the **NotReady** state. | Kubernetes Problems | KubeletKubernetesDNSPublic |
CRE-2025-0049 Low Impact: 2/10 Mitigation: 8/10 | NATS Payload Size Too Big | The NATS server is configured to publish messages with payloads that may exceed the recommended maximum of 8 MB (the server’s default hard limit is 1 MB but it can be raised to 64 MB). Large messages put disproportionate pressure on broker memory, network buffers, and client back-pressure mechanisms. This warning signals NATS is at risk of degraded throughput, slow consumers, and forced connection closures intended to protect cluster stability. | Message Queue Problems | NATSPublic |
CRE-2025-0051 High Impact: 9/10 Mitigation: 5/10 | NGINX No Live Upstreams Available | NGINX is reporting that all backend servers in an upstream group are unavailable. This means that NGINX cannot route requests to any of its configured backend servers, resulting in client-facing errors. | Load Balancer Problems | NginxLoad BalancerUpstream FailureConnectivity |
CRE-2025-0053 Medium Impact: 5/10 Mitigation: 3/10 | NGINX Client Upload Size Limit Exceeded | NGINX server is receiving upload requests with bodies that exceed the configured size limits. This occurs when clients attempt to send files or data that are larger than what the server is configured to accept. | Web Server Problem | NginxUpload LimitsConfiguration |
CRE-2025-0054 Medium Impact: 7/10 Mitigation: 5/10 | NGINX upstream connection timeout | NGINX reports an upstream timeout error when it cannot establish or maintain a connection to backend services within the configured timeout threshold. This occurs when backend services are unresponsive, overloaded, or when the timeout values are set too low for normal operation conditions. The error indicates that NGINX attempted to proxy a request to an upstream server, but the connection or read operation timed out before completion. | Proxy Timeout Problems | NginxTimeoutProxyBackend IssueNetworking |
CRE-2025-0055 Medium Impact: 8/10 Mitigation: 3/10 | Nginx upstream buffer size too small | Nginx reports that an upstream server is sending headers that exceed the configured buffer size limits. This typically happens when the upstream application sends responses with large headers, cookies, or other header fields that don't fit in the default buffer allocation. When this occurs, Nginx cannot properly proxy the response to clients, resulting in HTTP errors. | Web Server Problems | NginxConfigurationProxyHeader SizeBuffer |
CRE-2025-0056 Medium Impact: 8/10 Mitigation: 3/10 | NGINX worker connections limit exceeded | NGINX has reported that the configured worker_connections limit has been reached. This indicates that the web server has exhausted the available connection slots for handling concurrent client requests. When this limit is reached, new connection attempts may be rejected until existing connections are closed, causing service degradation or outages. | Web Server Problems | NginxCapacity IssueWeb ServerConfigurationPublic |
CRE-2025-0057 Low Impact: 3/10 Mitigation: 1/10 | Verbose Logging in AWS Network Policy Agent During Policy Verdicts | - When using AWS Network Policy Agent with VPC CNI addon v1.17.1, the log message `failed to get caller` may appear frequently. - This behavior correlates with policy verdicts being evaluated, and the volume increases in environments with higher traffic or more active policies. - The issue does not indicate functional failure, but it increases log volume and may obscure real issues. | Logging Problems | AWSVPC CNILog Noise |
CRE-2025-0058 Medium Impact: 7/10 Mitigation: 4/10 | Celery Worker Stops Consuming Tasks After Redis Restart | - When Redis is restarted, Celery workers using Redis as a broker may stop consuming tasks without exiting or logging a fatal error. - Although Celery Beat continues to publish tasks successfully, the worker remains in a broken state until manually restarted. - This results in a silent backlog of scheduled but unprocessed tasks. | Task Management Problems | CelerySilent FailureRedisKombu |
CRE-2025-0059 Low Impact: 6/10 Mitigation: 2/10 | Datadog CWS Instrumentation webhook registration fails without service account | - Datadog Cluster Agent fails to register its CWS (Container Workload Security) instrumentation webhook when running in `remote_copy` mode without a configured service account. | Configuration Problem | DatadogCWSAdmission ControllerwebhookConfigurationKnown Issue |
CRE-2025-0060 Low Impact: 5/10 Mitigation: 2/10 | Datadog OpenMetrics Scrape Returns 404 | - The Datadog Agent is unable to scrape metrics from an OpenMetrics endpoint, returning a 404 Not Found error. - This typically indicates that the target service is either not exposing the `/metrics` path, the port or path is misconfigured, or the service is not running. | Monitoring Problem | ObservabilityDatadog |
CRE-2025-0061 Medium Impact: 7/10 Mitigation: 4/10 | Karpenter Stability Issues on EKS During Leader Election | - EKS may be able to handle steady, predictable scale, but struggles during large‑scale auto scaling events when many workloads and nodes are spinning up or down simultaneously. - This instability affects components that implement leader election using the Kubernetes API, such as: - aws‑load‑balancer‑controller - karpenter - keda‑operator - ebs‑csi‑controller - efs‑csi‑controller | Stability Problems | KarpenterKEDAAWSEKS |
CRE-2025-0062 Medium Impact: 6/10 Mitigation: 2/10 | Karpenter Version Incompatible with Kubernetes Version | - Karpenter logs an error when its current version is not compatible with the running Kubernetes control plane version. - This results in provisioning failures and indicates a required upgrade to align compatibility. - The issue is surfaced via structured logs from the controller. | Incompatibility Problem | Version IncompatibilityKarpenter |
CRE-2025-0063 Medium Impact: 6/10 Mitigation: 3/10 | RabbitMQ disk monitor fails to initialize | - RabbitMQ's disk monitor process cannot start or retrieve free‐space metrics, preventing it from detecting low‐disk conditions. | Message Queue Problems | RabbitMQDisk MonitorMonitoringPlugin |
CRE-2025-0064 High Impact: 5/10 Mitigation: 2/10 | Terraform Cloud Authentication Failure | - This error occurs when Terraform Cloud authentication fails due to missing or invalid API tokens, workspace or organization misconfiguration, or insufficient permissions. | Provisioning Problems | TerraformPermissionsAuthentication |
CRE-2025-0068 Low Impact: 6/10 Mitigation: 5/10 | Gnome input lag on Nvidia Ubuntu desktops | Keyboard input or the entire screen may freeze at times on systems using the Nvidia Xorg driver with GNOME. | Ubuntu Desktop Problems | GnomeNvidiaUbuntu |
CRE-2025-0069 Medium Impact: 6/10 Mitigation: 4/10 | Kubernetes fsGroup ignored on NFS volumes | Pods that mount NFS volumes and set `securityContext.fsGroup` still have the directory owned by `root:root`. The kubelet does not chown the share, so non-root containers fail with "Permission denied". | Kubernetes Storage Problems | KubernetesNFSsecurityContext |
CRE-2025-0070 Critical Impact: 10/10 Mitigation: 6/10 | Kafka Under-Replicated Partitions Crisis | Critical Kafka cluster degradation detected: Multiple partitions have lost replicas due to broker failure, resulting in an under-replicated state. This pattern indicates a broker has become unavailable, causing partition leadership changes and In-Sync Replica (ISR) shrinkage across multiple topics. | Message Queue Problems | KafkaReplicationData LossHigh AvailabilityBroker FailureCluster Degradation |
CRE-2025-0071 High Impact: 9/10 Mitigation: 8/10 | CoreDNS unavailable | CoreDNS deployment is unavailable or has no ready endpoints, indicating an imminent cluster-wide DNS outage. | Kubernetes Problems | KubernetesNetworkingDNSHigh Availability |
CRE-2025-0072 Critical Impact: 10/10 Mitigation: 7/10 | Redis Out-Of-Memory → Persistence Crash → Replica/ACL Write Failures | Detects a cascade of critical Redis failure modes in a single session: - Redis refuses writes when maxmemory is exceeded (OOM). - RDB snapshot (BGSAVE) fails (MISCONF) due to simulated full-disk. - Replica refuses writes (READONLY). - ACL denies a write (NOPERM). | In-Memory Database Problems | RedisOut of MemoryPersistenceRDBMISCONFREADONLYACLSecurity |
CRE-2025-0073 High Impact: 9/10 Mitigation: 6/10 | Redis Rejects Writes Due to Reaching 'maxmemory' Limit | The Redis instance has reached its configured 'maxmemory' limit. Because its active memory management policy does not permit the eviction of existing keys to free up space (as is the case when the 'noeviction' policy is in effect, which is often the default), Redis rejects new write commands by sending an "OOM command not allowed" error to the client. | Database Problems | RedisRedis CLIMemory PressureMemoryData LossPublic |
CRE-2025-0074 Critical Impact: 10/10 Mitigation: 6/10 | Temporal Worker → Server Downtime → Connection Refused Failure | Detects failure when a Temporal worker is unable to reach the Temporal server. - This typically occurs during startup or after server downtime. - Worker log contains gRPC error: "connection refused". | workflow-orchestration-connectivity | TemporalWorker ProblemsgRPCConnection RefusedStartup Failure |
CRE-2025-0075 Critical Impact: 10/10 Mitigation: 6/10 | Nginx Upstream Failure Cascade Crisis | Detects critical Nginx upstream failure cascades that lead to complete service unavailability. This advanced rule identifies comprehensive upstream failure patterns including DNS resolution failures, connection timeouts, SSL/TLS handshake errors, protocol violations, and server unavailability, followed by HTTP 5xx error responses within a 60-second window. The rule uses optimized regex patterns for maximum detection coverage while maintaining high performance and low false-positive rates. It captures both the root cause (upstream failures) and the user-facing impact (HTTP errors) to provide complete incident context. | load-balancer-problem | NginxReverse ProxyService OutageHigh AvailabilityLoad BalancerCascading Failure |
CRE-2025-0076 High Impact: 0/10 Mitigation: 9/10 | SlurmDBD Database Connection Lost | Detects when Slurm's accounting daemon (slurmdbd) or controller (slurmctld) loses connection to its MySQL database, causing job scheduling and recording to halt. | HPC Database Problems | SLURMSlurmDBDdatabase-problemMySQLHigh Availability |
CRE-2025-0077 High Impact: 9/10 Mitigation: 7/10 | PostgreSQL Fails to Extend File Due to Disk Full | PostgreSQL logs an error when it cannot extend a data file (table/index) because the filesystem is out of disk space. This prevents writes requiring new allocation. | Database Problems | PostgreSQLDisk FullWrite FailurePublic |
CRE-2025-0078 High Impact: 10/10 Mitigation: 2/10 | SpiceDB Database Schema Failures: Missing Core Tables | Detects critical SpiceDB database schema failures caused by missing core tables like `metadata`, `alembic_version`, or `relation_tuple_transaction`. These errors often stem from incomplete migrations, startup race conditions, or schema corruption, resulting in a complete breakdown of SpiceDB authorization capabilities. | Authorization Systems | SpiceDBMigration FailureSchema ErrorPostgreSQL |
CRE-2025-0079 Critical Impact: 10/10 Mitigation: 3/10 | SpiceDB Database Corruption: Critical Table Loss | Detects catastrophic SpiceDB database corruption where critical core tables like `alembic_version` and `relation_tuple_transaction` are missing or dropped. This represents complete database corruption that renders SpiceDB unable to perform any authorization operations, causing total permission system failure. | Authorization Systems | SpiceDBDatabase CorruptionAuthorizationPostgreSQL |
CRE-2025-0080 High Impact: 0/10 Mitigation: 9/10 | Redpanda High Severity Issues | Detects when Redpanda hits any of these on startup or early runtime: 1. Fails to create its crash_reports directory (POSIX error 13). 2. Heartbeat or node-status RPC failures indicating a broker is down. 3. Raft group failure. 4. Data center failure | Data Streaming Platforms | RedpandaStartup FailurePermission FailureRPCRaftNode DownCluster DegradationData AvailabilityDatabase Corruption |
CRE-2025-0081 Critical Impact: 9/10 Mitigation: 8/10 | Temporal Server Fails Persistence on Read-Only Database | Detects critical failures where Temporal Server is unable to perform essential database write operations (e.g., starting workflows, recording history, completing tasks) because its underlying SQL database (e.g., PostgreSQL) is in a read-only state. This leads to a halt or severe degradation in workflow processing and can cause cluster instability. | Temporal Server Failure | TemporalPostgreSQLREADONLY |
CRE-2025-0085 High Impact: 8/10 Mitigation: 7/10 | SpiceDB Schema Validation Failures Block Authorization Updates | Detects SpiceDB schema validation failures that prevent authorization logic updates and deployments. These failures occur when invalid schema definitions are submitted, including syntax errors, circular dependencies, type conflicts, or malformed permission expressions, blocking critical authorization system updates. | Authorization Problems | SpiceDBAuthorizationConfigurationValidationCrashStartup Failure |
CRE-2025-0090 Low Impact: 0/10 Mitigation: 0/10 | Loki Log Line Exceeds Max Size Limit | Alloy detects the Loki is dropping log lines because they exceed the configured maximum line size. This typically indicates that applications are emitting extremely long log entries, which Loki is configured to reject by default. | Observability Problems | AlloyLokiLogsObservabilityGrafana |
CRE-2025-0099 High Impact: 8/10 Mitigation: 7/10 | Redpanda Crash Due to Memory Exhaustion and Startup Failures | Redpanda streaming platform crashes due to a combination of system-level failures including permission denied errors for performance monitoring subsystems, missing critical configuration files, and memory allocation failures. | Data Streaming Platforms | RedpandaContainer CrashMemory ExhaustionConfiguration FailureStreaming PlatformKafka CompatiblePermission DeniedSIGKILL |
CRE-2025-0102 High Impact: 0/10 Mitigation: 0/10 | Redpanda Cluster Critical Failure - Node Loss, Quorum Lost, and Data Availability Impacted | - The Redpanda streaming data platform is experiencing a severe, cascading failure. - This typically involves critical errors on one or more nodes (e.g., storage failures), leading to nodes becoming unresponsive or shutting down. - Subsequently, this can cause loss of controller quorum, leadership election problems for partitions, and a significant degradation in overall cluster health and data availability. | Redpanda Problems | RedpandaStreaming DataCluster FailureNode DownQuorum LossData AvailabilityErrorsDistributed System |
CRE-2025-0105 High Impact: 9/10 Mitigation: 3/10 | SpiceDB Datastore Startup Failure | Detects critical failures where a SpiceDB instance cannot start due to an invalid schema or an uninitialized datastore during the bootstrap process. This is a common configuration error that prevents the service from initializing and serving requests, leading to a total service outage. | Authorization Systems | SpiceDBAuthorizationDatastoreMisconfigurationStartup Failure |