Commercial CREs

Welcome to the Commercial CRE feed, where you can explore and discover commercial CRES by tags, categories, or other details. Use the tabs below to navigate between different views.

Categories
Tags
Technologies
CREs

Observability Problems

6 CREs

Problems related to observability, like monitoring, logging, and tracing

Container Security

6 CREs

Problems related to container security, such as image vulnerabilities, deprecated repositories, insecure registries, or container image policy violations

GraphQL Problems

5 CREs

Problems related to GraphQL

Jetty Problems

4 CREs

Problems related to Java Jetty

Ingress Problems

4 CREs

Problems related to Ingress

Resource Management

4 CREs

Problems related to resource management in Kubernetes, such as missing CPU/memory requests, limits, or resource allocation issues

Istio Problems

3 CREs

Problems related to Istio

OTEL Problems

3 CREs

Problems related to OTEL

Message Queue Problems

2 CREs

Problems related to message queues, like Kafka, RabbitMQ, NATS, and others

Memory Problems

2 CREs

Problems related to memory

ArgoCD Problems

2 CREs

Problems related to ArgoCD

Kubernetes Problems

2 CREs

Problems related to Kubernetes

AWS Problems

2 CREs

Problems related to AWS

Kubernetes Provisioning Problems

2 CREs

Problems related to Kubernetes node provisioning and scaling, such as autoscaler failures, capacity issues, or provisioner configuration problems

Kubernetes Best Practices

2 CREs

Problems related to violations of Kubernetes best practices, such as missing health checks, resource specifications, or security configurations

API Service Problems

1 CREs

Problems related to API services, such as GraphQL validation errors, REST API issues, or service communication failures

Proxy Problems

1 CREs

Problems related to proxies, like NGINX, HAProxy, and others

Networking Problems

1 CREs

Connectivity, DNS, or routing issues affecting system communication.

Service Mesh Monitoring

1 CREs

Problems related to service mesh monitoring

Storage Problems

1 CREs

Problems related to storage

Service Mesh Problems

1 CREs

Problems related to service mesh

MongoDB Problems

1 CREs

Problems related to MongoDB

SQL Problems

1 CREs

Problems related to SQL

Fault Tolerance Problems

1 CREs

Problems related to fault tolerance

Kafka Problems

1 CREs

Problems related to Kafka

Secrets Problems

1 CREs

Problems related to secrets

Clickhouse Problems

1 CREs

Problems related to Clickhouse

Postgres Problems

1 CREs

Problems related to Postgres

Kubernetes Networking Problems

1 CREs

Problems related to Kubernetes networking, including Ingress, Services, and traffic routing

Traefik Problems

1 CREs

Problems related to Traefik

Prometheus Problems

1 CREs

Problems related to Prometheus

NATS Problems

1 CREs

Problems related to NATS.io

Application Error

1 CREs

Problems related to application errors

Continuous Delivery Problems

1 CREs

Problems related to continuous delivery and deployment pipelines

High Availability Problems

1 CREs

Problems related to high availability, such as cluster communication failures, quorum loss, or split-brain scenarios

Database Integrity Problems

1 CREs

Problems related to database integrity constraints, such as not-null violations, unique constraint violations, or foreign key violations

Message Broker Errors

1 CREs

Problems related to message brokers, such as message size limits, connection issues, or configuration problems

Database Problems

1 CREs

Problems related to databases, like MySQL, PostgreSQL, MongoDB, and others

Workflow Service Problems

1 CREs

Problems related to workflow orchestration services, such as task execution failures, archival issues, or service coordination problems

Policy Enforcement Issues

1 CREs

Problems related to policy enforcement systems, such as admission controllers, policy engines, or security policy validation failures

Autoscaling Problems

1 CREs

Problems related to autoscaling behavior and policies, including HPA/VPA/Karpenter budget limits or scale failures

Data Storage Problems

1 CREs

Problems related to data storage systems, such as Elasticsearch indexing failures, field limit exceeded, or data persistence issues

ID	Title	Description	Category	Tags
prequel-2024-0006 Medium Impact: 8/10 Mitigation: 2/10	Kafka Topic Operator Thread Blocked kubernetes	There is a known issue in the Strimzi Kafka Topic Operator where the operator thread can become blocked. This can cause the operator to stop processing events and can lead to a backlog of events. This can cause the operator to become unresponsive and can lead to liveness probe failures and restarts of the Strimzi Kafka Topic Operator.	Message Queue Problems	Known Problem Kafka Strimzi
prequel-2025-0001 Critical Impact: 7/10 Mitigation: 3/10	Telepresence.io Traffic Manager Excessive Client-side Kubernetes API Throttling traffic-manager	One or more cluster components (kubectl sessions, operators, controllers, CI/CD jobs, etc.) hit the default client-side rate-limiter in client-go (QPS = 5, Burst = 10). The client logs messages such as `Waited for ‹N›s due to client-side throttling, not priority and fairness` and delays each request until a token is available. Although the API server itself may still have spare capacity, and Priority & Fairness queueing is not the bottleneck, end-user actions and controllers feel sluggish or appear to “stall”.	Kubernetes Problems	Kubernetes Telepresence Traffic Manager API Throttling
prequel-2025-0002 Medium Impact: 7/10 Mitigation: 3/10	Envoy metrics scraping failure with unexpected EOF prometheus	Prometheus is failing to scrape and write Envoy metrics from Istio sidecars due to an unexpected EOF error. This occurs when trying to collect metrics from services that don't have proper protocol selection configured in their Kubernetes Service definition	Service Mesh Monitoring	Prometheus Istio Envoy Metrics Service Mesh Kubernetes
prequel-2025-0003 Low Impact: 4/10 Mitigation: 5/10	Loki WAL Out of Disk Space loki	Loki is experiencing an out of disk space error due to the WAL (Write-Ahead Logging) filling up the disk. This can happen when the WAL is not properly configured or when the disk is full.	Storage Problems	Loki WAL Disk Space Out of Disk Space Disk Full
prequel-2025-0004 Low Impact: 7/10 Mitigation: 8/10	Process Out of Memory oom	A pod OOM (Out Of Memory) crash in occurs when a container inside a pod tries to use more memory than has been allocated to it, causing the container to be terminated by the operating system.	Memory Problems	OOM Crash
prequel-2025-0005 High Impact: 3/10 Mitigation: 3/10	Kiali Unable to Fetch Istio Traces kiali	Kiali is unable to fetch Istio traces due to a configuration error.	Service Mesh Problems	Istio Tracing Kiali
prequel-2025-0006 Low Impact: 3/10 Mitigation: 7/10	Apollo GraphQL Error graphql	An application using Apollo GraphQL is experiencing an error.	GraphQL Problems	Apollo GraphQL Error
prequel-2025-0007 High Impact: 3/10 Mitigation: 7/10	GraphQL "Cannot read properties of undefined" error graphql	Indicates an error in a subgraph service query during query execution in a federated service.	GraphQL Problems	Apollo GraphQL Error
prequel-2025-0008 High Impact: 3/10 Mitigation: 7/10	Apollo GraphQL DOWNSTREAM_SERVICE_ERROR graphql	Indicates an error in a subgraph service query during query execution in a federated service.	GraphQL Problems	Apollo GraphQL Error
prequel-2025-0009 Low Impact: 4/10 Mitigation: 3/10	ArgoCD Excessive Syncs argocd	ArgoCD Reconciliation Storm	ArgoCD Problems	ArgoCD Sync
prequel-2025-0010 High Impact: 8/10 Mitigation: 4/10	Telepresence agent-injector certificate reload failure traffic-manager	Telepresence 2.5.x versions suffer from a critical TLS handshake error between the mutating webhook and the agent injector. When the certificate is rotated or regenerated, the agent-injector pod fails to reload the new certificate, causing all admission requests to fail with "remote error: tls: bad certificate". This effectively breaks the traffic manager's ability to inject the agent into workloads, preventing Telepresence from functioning properly.	Kubernetes Problems	Known Problem Telepresence Kubernetes Certificate
prequel-2025-0011 Medium Impact: 7/10 Mitigation: 5/10	GraphQL internal server error due to record not found graphql	The application is experiencing internal server errors when GraphQL operations attempt to access records that do not exist in the database. This occurs when GraphQL queries reference entities that have been deleted, were never created, or are inaccessible due to permission issues. Instead of handling these cases gracefully with proper error responses, the API is escalating them to internal server errors that may impact client applications and user experience.	GraphQL Problems	GraphQL Database Errors
prequel-2025-0012 High Impact: 6/10 Mitigation: 5/10	GraphQL internal server error due to unhandled exception in NestJS resolver graphql	The application is generating internal server errors during GraphQL operations due to uncaught exceptions in resolver logic. These errors are not properly handled or transformed into structured GraphQL responses, resulting in unexpected 500-level failures for client applications. Stack traces often reference NestJS internal files like `external-context-creator.js`, indicating the framework attempted to execute resolver logic but encountered an exception that was not intercepted by the application code.	GraphQL Problems	GraphQL Errors nestjs
prequel-2025-0013 Critical Impact: 9/10 Mitigation: 6/10	Deployment Replica OOM Caused HTTP 5xx Error oom	A deployment replica OOM caused HTTP 5xx error.	Memory Problems	OOM Errors
prequel-2025-0014 Medium Impact: 2/10 Mitigation: 3/10	Jetty IllegalStateException jetty	A session object in an application thread is possibly being accessed outside the scope of a request.	Jetty Problems	Jetty Exceptions Errors
prequel-2025-0015 Medium Impact: 4/10 Mitigation: 5/10	Java SQL Batch Exception sql	A SQL batch exception occurred.	SQL Problems	Java SQL Exceptions
prequel-2025-0016 Medium Impact: 3/10 Mitigation: 4/10	MongoDB Server Timeouts pymongo	A MongoDB server timeout occurred.	MongoDB Problems	MongoDB Timeout Exceptions
prequel-2025-0017 Medium Impact: 3/10 Mitigation: 4/10	Jetty HTTP 500 Errors jetty	A Jetty HTTP 500 error occurred.	Jetty Problems	Jetty Errors
prequel-2025-0018 Low Impact: 5/10 Mitigation: 6/10	Jetty LDAP Timeout jetty	A Jetty LDAP timeout occurred.	Jetty Problems	Jetty LDAP Timeout
prequel-2025-0019 Medium Impact: 6/10 Mitigation: 7/10	Jetty LDAP Closed Exception jetty	A Jetty LDAP closed exception occurred.	Jetty Problems	Jetty LDAP Exceptions
prequel-2025-0020 High Impact: 8/10 Mitigation: 2/10	Too many replicas scheduled on the same node dru	80% or more of a deployment's replica pods are scheduled on the same Kubernetes node. If this node shuts down or experiences a problem, the service will experience an outage.	Fault Tolerance Problems	Replica Kubernetes
prequel-2025-0021 High Impact: 8/10 Mitigation: 3/10	Kafka Streams Exception jetty	A Kafka Streams exception occurred. One or more source topics were missing during a Kafka rebalance.	Kafka Problems	Kafka Exceptions
prequel-2025-0022 High Impact: 5/10 Mitigation: 4/10	External Secrets Access Denied due to IAM Policy external-secrets	External Secrets access denied due to IAM policy misconfiguration.	Secrets Problems	Secrets Access Denied
prequel-2025-0023 High Impact: 8/10 Mitigation: 2/10	Clickhouse Keeper Network Errors clickhouse	Large ClickHouse queries can consume a significant amount of resources, triggering several NETWORK_ERROR or NO_REPLICA_HAS_PART errors.	Clickhouse Problems	Clickhouse Network Errors
prequel-2025-0024 High Impact: 6/10 Mitigation: 7/10	Istio Traffic Timeout istio	Connections routed through ztunnel stop after the default 10s deadline. Ztunnel logs show `error access connection complete ... error="io error: deadline has elapsed"` or `error="connection timed out, maybe a NetworkPolicy is blocking HBONE port 15008"` while clients see 504 Gateway Timeout or connection-reset errors. The issue is limited to workloads enrolled in Ambient mode; sidecar-injected or “no-mesh” pods continue to work.	Istio Problems	Istio Timeout
prequel-2025-0025 Low Impact: 3/10 Mitigation: 6/10	Istio CNI Ztunnel Connection Failure istio	The CNI plugin is not connected to Ztunnel. For pods in the mesh, Istio will run a CNI plugin during the pod 'sandbox' creation. This configures the networking rules. This may intermittently fail, in which case Kubernetes will automatically retry.	Istio Problems	Istio
prequel-2025-0026 Low Impact: 3/10 Mitigation: 6/10	Istio XDS GRPC Failure istio	Envoy sidecars or Ambient ztunnel keep retrying the control-plane stream and log ``` XDS client connection error: gRPC connection error:status: Unknown, message: "...", source: tcp connect error: Connection refused (os error 111) ``` or ``` ... source: tcp connect error: deadline has elapsed ``` The proxies never reach “ADS stream established”, so no configuration, certificates, or policy updates are delivered until this is mitigated.	Istio Problems	Istio XDS
prequel-2025-0027 Low Impact: 5/10 Mitigation: 2/10	Ingress Nginx Prefix Wildcard Error ingress-nginx	The NGINX Ingress Controller rejects an Ingress manifest whose `pathType: Prefix` value contains a wildcard (``). Log excerpt: ``` ingress: default/api prefix path shouldn't contain wildcards ``` When the controller refuses the rule, it omits it from the generated `nginx.conf`; clients receive 404 / 502* responses even though the manifest was accepted by the Kubernetes API server. The problem appears most often after upgrading to ingress-nginx ≥ 1.8, where stricter validation was added.	Ingress Problems	Nginx Ingress Kubernetes
prequel-2025-0028 Low Impact: 2/10 Mitigation: 2/10	Datadog Postgres Check Exception datadog	The Datadog Agent’s Postgres integration throws an uncaught Python traceback while trying to run an `EXPLAIN (FORMAT JSON)` against a sampled query. After the first failure the underlying psycopg2 cursor is closed, and every subsequent collection cycle logs ``` Traceback … File ".../datadog_checks/postgres/explain_parameterized_queries.py", … psycopg2.InterfaceError: cursor already closed ``` The check status flips to ERROR, and query metrics / samples stop flowing.	Postgres Problems	PostgreSQL Datadog
prequel-2025-0071 Critical Impact: 8/10 Mitigation: 4/10	CPU Cores Cause Silent ingress-nginx Worker Crashes oom	The ingress-nginx controller worker processes are crashing because there are too many for the limits specified for this deployment.	Proxy Problems	Nginx Known Problem
prequel-2025-0072 Low Impact: 3/10 Mitigation: 2/10	OTel Collector Dropped Data to to High Memory Usage otel-collector	The OpenTelemetry Collector’s memory_limiter processor (added by default in most distro Helm charts) protects the process RSS by monitoring the Go heap and rejecting exports once the soft limit (default 85 % of container/VM memory) is exceeded. After a queue/exporter exhausts its retry budget you’ll see log records such as: ``` no more retries left: rpc error: code = Unavailable desc = data refused due to high memory usage ``` The batches being dropped can be traces, metrics, or logs, depending on which pipeline hit the limit.	OTEL Problems	OTEL Memory Backpressure
prequel-2025-0073 Low Impact: 5/10 Mitigation: 1/10	OTel Collector Resource Detection Failure otel-collector	The resource_detection processor fails while trying to determine basic host attributes and repeatedly logs: ``` failed getting OS type: failed to fetch Docker OS type: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? ``` The Collector keeps running but exports traces, metrics, or logs without mandatory resource labels, leading to data loss or mis-grouping in the backend.	OTEL Problems	OTEL Known Issue
prequel-2025-0074 Low Impact: 8/10 Mitigation: 1/10	Traefik License Expired traefik	Traefik Enterprise (or Traefik Hub-enabled Proxy) periodically “pings” Traefik’s SaaS platform to validate the node-level licence token. When the licence or trial period lapses the process logs ``` Unable to ping platform error="your trial or license expired, contact sales if you want to enable your account" ``` and disables all commercial-only features (dashboards, enterprise plugins, distributed rate-limits, Hub service directory). Plain reverse-proxy routes may continue for a short grace period, but new configuration reloads are rejected.	Traefik Problems	Traefik
prequel-2025-0075 Low Impact: 2/10 Mitigation: 5/10	Prometheus Config Reload Failed prometheus	The prometheus-config-reloader sidecar (used by the Prometheus Operator / kube-prometheus-stack) detected a change in the ConfigMap/Secret but cannot POST to the Prometheus `/-/reload` endpoint. It logs repeatedly: ``` Failed to trigger reload. Retrying. ``` While the main Prometheus container keeps serving traffic, new scrape configs, alerting rules, and recording rules are NOT applied, leaving the instance frozen on an outdated configuration set.	Prometheus Problems	Prometheus
prequel-2025-0076 Medium Impact: 6/10 Mitigation: 4/10	NATS Route Error caused by DNS Resolution Failure nats	A NATS server establishes a TCP route, logs “Route connection created”, but within milliseconds DNS resolution for its peer fails; the server reports ``` Error trying to connect to route [nats://......]: lookup for host ....... no such host ``` and immediately closes the socket. When this sequence happens repeatedly the cluster oscillates between full mesh and partitioned states, leading to intermittent publish / subscribe errors and duplicate message deliveries.	NATS Problems	NATS DNS
prequel-2025-0077 Low Impact: 2/10 Mitigation: 2/10	OTEL Target Allocator Could Not Find Collector on Fargate Node otel-operator	The OTEL Collector is not scheduled on the Fargate node.	OTEL Problems	OTEL AWS Fargate
prequel-2025-0078 Low Impact: 6/10 Mitigation: 5/10	AWS LoadBalancer Security Group Failure aws-load-balancer-controller	While reconciling a TargetGroupBinding the AWS Load Balancer Controller inspects the ENI attached to each pod (IP mode) or worker node (instance mode). If it finds zero or more than one security group carrying the cluster-ownership tag `kubernetes.io/cluster/‹cluster-name›: owned`, it aborts and logs: ``` Reconciler error … targetGroupBinding … expected exactly one securityGroup tagged … ``` When this happens the controller never attaches nodes/pods to target groups, so the load balancer comes up with 0 healthy targets.	AWS Problems	AWS Loadbalancer Security Group
prequel-2025-0079 Medium Impact: 3/10 Mitigation: 3/10	AWS Cluster Autoscaler Access Denied aws-cluster-autoscaler	Cluster Autoscaler tries to fetch node-group metadata to decide whether it can scale a workload-affinityed pod. The call to the EKS control plane fails with ``` Failed to get labels from EKS DescribeNodegroup API for nodegroup ‹name› … AccessDeniedException: User ‹ARN› is not authorized to perform: eks:DescribeNodegroup on resource: arn:aws:eks:‹region›:‹acct›:nodegroup/… ``` Once the error is hit the Autoscaler marks the node-group Not-Ready for scaling actions, so pending pods remain unscheduled and scale-down decisions are skipped.	AWS Problems	AWS Autoscaling
prequel-2025-0080 Medium Impact: 8/10 Mitigation: 4/10	Ruby NoMethodError - undefined method ruby	A Ruby application has encountered a NoMethodError exception, indicating that code is attempting to call a method that does not exist for a given object. This typically happens when referencing an undefined method, when method names are misspelled, or when interacting with nil/null objects. NoMethodError is one of the most common runtime errors in Ruby applications and can cause immediate crashes or unexpected behavior.	Application Error	Ruby Runtime Error Application Exception
prequel-2025-0081 Medium Impact: 6/10 Mitigation: 4/10	ArgoCD RawExtension API Field Error with Datadog Operator argocd	ArgoCD application controller fails to process certain custom resources due to being unable to find API fields in struct RawExtension. This commonly affects users deploying Datadog Operator CRDs, resulting in application sync errors for these resources.	Continuous Delivery Problems	ArgoCD Kubernetes Custom Resource Datadog
prequel-2025-0082 High Impact: 9/10 Mitigation: 7/10	HashiCorp Vault Raft Cluster Communication Failure vault	HashiCorp Vault nodes in a Raft cluster are unable to communicate with each other for an extended period. This disrupts the Raft consensus mechanism which is critical for Vault's high availability and data consistency. When nodes can't communicate, the cluster may lose quorum, preventing operations like unsealing, authentication, or secret retrieval.	High Availability Problems	Vault Raft Consensus Networking
prequel-2025-0083 Medium Impact: 7/10 Mitigation: 5/10	GraphQL schema validation failures graphql	GraphQL validation errors occur when client requests fail to comply with the GraphQL schema. These errors typically happen during query parsing and validation phases, before execution begins. Common validation failures include unknown types, missing required arguments, incorrect field usage, or invalid input values. These errors prevent the operation from executing and return error messages that describe the validation problems to the client.	API Service Problems	GraphQL Validation API Error
prequel-2025-0084 Medium Impact: 7/10 Mitigation: 4/10	PostgreSQL unsupported Unicode escape sequence error python	The application encounters errors when PostgreSQL attempts to process strings containing invalid or unsupported Unicode escape sequences. This commonly occurs in applications using psycopg2 to interact with PostgreSQL databases, resulting in queries failing with "unsupported Unicode escape sequence" errors. The underlying issue is that PostgreSQL's string parser attempts to interpret escape sequences like '\\\\uXXXX' according to Unicode standards, but rejects malformed or incomplete sequences.	Database Problems	PostgreSQL Unicode Data Error
prequel-2025-0085 Medium Impact: 7/10 Mitigation: 5/10	Kafka message size limit exceeded celery	The Kafka producer encountered a "Message size too large" error when attempting to send a message to a Kafka broker. This occurs when a message exceeds the configured maximum message size limit on the broker. Kafka has configurable message size limits at both broker and producer levels to protect system stability and prevent resource exhaustion. When this limit is hit, the message is rejected and not stored in the topic.	Message Broker Errors	Kafka Producer Error Configuration Issue
prequel-2025-0086 Medium Impact: 7/10 Mitigation: 3/10	Database Not-Null Constraint Violation psycopg2	An application is attempting to insert or update records in a database table with NULL values in columns that have NOT NULL constraints. This causes database operations to fail with integrity errors, typically surfacing as NotNullViolation exceptions in application logs. In Django applications, this commonly appears as django.db.utils.IntegrityError or psycopg2.errors.NotNullViolation when using PostgreSQL.	Database Integrity Problems	Database PostgreSQL Django Data Integrity
prequel-2025-0087 Medium Impact: 7/10 Mitigation: 5/10	Kyverno JMESPath query failure due to unknown key kyverno	Kyverno policies with JMESPath expressions are failing due to references to keys that don't exist in the target resources. This happens when policies attempt to access object properties that aren't present in the resources being validated, resulting in "Unknown key" errors during policy validation.	Policy Enforcement Issues	Kyverno Kubernetes Policy Management
prequel-2025-0088 Medium Impact: 7/10 Mitigation: 5/10	Temporal visibility archival failures temporal	Temporal Server is experiencing failures when attempting to archive workflow visibility records. These failures occur when the system encounters invalid search attribute types, specifically those marked as "Unspecified". Visibility archival is a critical component of Temporal's data retention strategy, allowing historical workflow execution records to be preserved while keeping the primary storage optimized for active workflows.	Workflow Service Problems	Temporal Archival Data Retention
prequel-2025-0089 Medium Impact: 7/10 Mitigation: 5/10	Argo CD Manifest Generation Errors argocd	Argo CD is experiencing recurring manifest generation errors. These errors indicate that the GitOps system is unable to properly generate or resolve Kubernetes manifests from the source repositories. When manifest generation fails consistently, applications cannot be properly synchronized, leading to configuration drift and potential deployment failures.	ArgoCD Problems	ArgoCD GitOps Continuous Delivery
prequel-2025-0090 High Impact: 8/10 Mitigation: 5/10	Karpenter version incompatible with Kubernetes version; Pods cannot be scheduled karpenter	Karpenter is unable to provision new nodes because the current Karpenter version is not compatible with Kubernetes version . This incompatibility causes validation errors in the nodeclass controller and prevents pods from being scheduled properly in the cluster.	Kubernetes Provisioning Problems	AWS Karpenter Kubernetes
prequel-2025-0091 High Impact: 2/10 Mitigation: 2/10	Redpanda data transforms cannot be used because they are disabled redpanda	This rule triggers when Redpanda logs the error `invalid_argument: data transforms disabled - use \\`rpk cluster config set data_transforms_enabled true\\` to enable`. The message indicates that WebAssembly-powered Data Transforms are turned off at the cluster level, so any attempt to deploy or run transform functions fails.	Message Queue Problems	Data Transforms WebAssembly Misconfiguration
prequel-2025-0092 High Impact: 6/10 Mitigation: 4/10	AWS CNI intermittent runtime panics and failure to destroy pod network aws-cni	This rule fires when the kubelet reports a series of `FailedKillPod / KillPodSandboxError` events that contain `rpc error: code = Unknown desc = failed to destroy network for sandbox…` together with a SIGSEGV / nil-pointer panic from `routed-eni-cni-plugin/cni.go` or `PluginMainFuncsWithError`. These messages indicate that the Amazon VPC CNI plugin crashed while tearing down a Pod’s network namespace, leaving the sandbox in an indeterminate state.	Kubernetes Provisioning Problems	EKS Pod Termination Network Panic
prequel-2025-0093 Medium Impact: 8/10 Mitigation: 5/10	aws-load-balancer-controller rejects Ingress resource with wildcard path and Prefix pathType aws-load-balancer-controller	The aws-load-balancer-controller is unable to translate an Ingress resource into an AWS ALB Listener Rule when the path contains a wildcard (*) and the pathType is set to Prefix.	Kubernetes Networking Problems	Kubernetes AWS Loadbalancer Controller Ingress Resource AWS Networking Configuration Path Validation ALB Routing
prequel-2025-0094 High Impact: 8/10 Mitigation: 4/10	cert-manager Cloudflare DNS cleanup failure cert-manager	cert-manager is unable to clean up Cloudflare DNS-01 challenges due to a change in the Cloudflare API, which no longer returns zone information in individual DNS records. This breaks the interaction when cert-manager attempts to delete the TXT record, resulting in a failed certificate generation.	Networking Problems	Cloudflare Cert-Manager Public API Deprecation
prequel-2025-0095 High Impact: 7/10 Mitigation: 5/10	Elasticsearch field limit exceeded causing Logstash indexing failures logstash	Logstash is failing to index events to Elasticsearch due to the total fields limit of 1000 being exceeded. This occurs when the Elasticsearch index has reached its maximum field limit, preventing new fields from being added during document indexing.	Data Storage Problems	Elasticsearch Logstash Indexing Failure
prequel-2025-0096 Medium Impact: 7/10 Mitigation: 6/10	Loki Ingester Memcache Object Size Limit Exceeded ingester	Loki ingester encounters "object too large for cache" errors when attempting to store log entries exceeding memcache's configured size limit (typically 1MB). Large log lines remain in the ingester buffer causing continuous failed ingest attempts, pod health degradation, and eventual recycling. The accumulation of oversized entries can lead to buffer exhaustion and ingester instability.	Observability Problems	Loki Ingester Memcached Object Size Limit Cache Storage Observability Telemetry Threshold Exceeded Data Loss Configuration
prequel-2025-0097 Medium Impact: 6/10 Mitigation: 5/10	Loki Compactor Schema Table Mismatch compactor	Loki compactor encounters schema configuration mismatches when it finds index tables in object storage that don't correspond to any configured schema period in the Loki configuration. This causes the compactor to skip compaction for those tables, leading to storage inefficiency and potential query performance degradation. The issue typically occurs after schema migrations, configuration changes, or when legacy data exists with different table naming conventions.	Observability Problems	Loki Compactor Schema Configuration Storage Observability Index
prequel-2025-0098 Medium Impact: 6/10 Mitigation: 4/10	Loki Pattern Ingester Empty Ring distributor	Loki distributor encounters "empty ring" errors when attempting to send streams to pattern ingesters. This occurs when pattern ingestion is enabled in the configuration but no pattern-ingester pods are running or properly registered in the ring. The distributor's pattern-tee component cannot find any available pattern ingesters to process pattern extraction, leading to high error spam in logs while normal log ingestion continues to function.	Observability Problems	Loki Configuration Observability Deployment Replication
prequel-2025-0099 Medium Impact: 6/10 Mitigation: 3/10	DataDog Agent Remote Configuration Error datadog	DataDog Agent encounters "empty targets meta in director local store" errors when attempting to retrieve remote configuration. This issue affects APM (Application Performance Monitoring) remote configuration functionality in DataDog Agent versions between 7.61.0 and 7.68.0. The error prevents proper retrieval and parsing of remote configuration from DataDog's backend, causing APM tracer libraries to fail when attempting to fetch dynamic configuration updates.	Observability Problems	Datadog Observability Configuration
prequel-2025-0100 Medium Impact: 6/10 Mitigation: 4/10	Prometheus ingestion failure due to too many labels distributor	Grafana Mimir's distributor rejects incoming Prometheus series when the number of label names on a single series exceeds the configured per-tenant limit. When this occurs, logs contain the message "received a series whose number of labels exceeds the limit" and the affected samples are dropped. This typically arises from excessive or dynamic labeling in scrape targets or relabeling rules that generate many unique label names per series. To adjust the per-tenant limit, configure the distributor with `-validation.max-label-names-per-series`. When deploying via the `mimir-distributed` Helm chart, set `mimir.structuredConfig.limits.max_label_names_per_series` to a higher value (default is 30). Increase limits cautiously to avoid cardinality explosions and memory pressure. Prefer reducing label names at the source where possible.	Observability Problems	Prometheus Grafana Metrics Configuration Configuration Issue Threshold Exceeded Observability
prequel-2025-0101 Medium Impact: 6/10 Mitigation: 5/10	Loki Ingester Memcache Out of Memory ingester	Loki ingester reports memcached errors indicating out-of-memory conditions while caching objects, logging messages such as "SERVER_ERROR out of memory storing object". When this occurs, cache writes fail and can lead to degraded ingestion performance, retries, and increased memory pressure on the ingester.	Observability Problems	Loki Ingester Memcached Storage Cache Memory Data Loss Threshold Exceeded Observability Configuration
prequel-2025-0102 High Impact: 7/10 Mitigation: 6/10	Ingress Nginx HTTP 5XX Error ingress-nginx	The ingress-nginx controller is returning HTTP 5XX errors	Ingress Problems	Nginx Ingress Errors
prequel-2025-0103 High Impact: 4/10 Mitigation: 5/10	Ingress Nginx Backend Service Has No Active Endpoints ingress-nginx	The ingress-nginx controller has detected that a service does not have any active endpoints. This typically happens when the service selector does not match any pods or the pods are not in a ready state. The controller logs a warning message indicating that the service does not have any active endpoints.	Ingress Problems	Nginx Ingress Kubernetes Service
prequel-2025-0104 Medium Impact: 5/10 Mitigation: 4/10	Ingress Nginx can't obtain X.509 certificate ingress-nginx	The Nginx ingress encountered an error while trying to obtain an X.509 certificate from the Kubernetes secret.	Ingress Problems	Kubernetes Certificate Nginx Ingress
prequel-2025-0105 Medium Impact: 7/10 Mitigation: 5/10	Karpenter NodePool budget exceeded; Pods cannot be scheduled karpenter	Karpenter is used to automatically provision Kubernetes nodes. NodePools can define a maximum budget for total resource usage to prevent unexpectedly expensive cloud bills. When the budget is reached, Karpenter will stop provisioning new nodes and new pods will fail to schedule.	Autoscaling Problems	Karpenter Kubernetes Autoscaling Capacity Budgets
prequel-2025-0106 Medium Impact: 0/10 Mitigation: 0/10	Kubernetes Bitnami Image Pull Events kubernetes	- Detects Kubernetes events where Bitnami container images are being pulled from Docker Hub. - Monitors image pull operations for Bitnami images across all namespaces. - Identifies usage of Bitnami images that may be affected by upcoming catalog changes. - Tracks container deployments using Bitnami images for migration planning.	Container Security	Kubernetes Bitnami Container Images Image Pulls Docker Hub Migration Planning Catalog Changes
prequel-2025-0107 Medium Impact: 0/10 Mitigation: 0/10	Kubernetes Bitnami Image Pull Error kubernetes	- Detects Kubernetes events where Bitnami container image pulls are failing due to repository deprecation. - Monitors image pull failures for Bitnami images as they approach the August 28, 2025 deprecation deadline. - Identifies specific error conditions when Bitnami images become unavailable from deprecated repositories. - Tracks container deployment failures due to Bitnami image repository deprecation.	Container Security	Kubernetes Bitnami Container Images Image Pull Errors Docker Hub Repository Deprecation Migration Required
prequel-2025-0108 Medium Impact: 0/10 Mitigation: 0/10	Kubernetes Deprecated Bitnami Repository Image Pulls kubernetes	- Detects Kubernetes events where container images are being pulled from the deprecated /bitnami repository on Docker Hub. - Monitors image pull operations specifically from docker.io/bitnami/* which will be discontinued. - Identifies usage of the deprecated Bitnami repository that requires immediate migration. - Tracks container deployments using the legacy /bitnami path for urgent migration planning.	Container Security	Kubernetes Bitnami Deprecated Repository Container Images Image Pulls Docker Hub
prequel-2025-0109 Medium Impact: 0/10 Mitigation: 0/10	Kubernetes Legacy Bitnami Repository Image Pulls kubernetes	- Detects Kubernetes events where container images are being pulled from the unmaintaing /bitnamilegacy repository on Docker Hub. - Monitors image pull operations specifically from docker.io/bitnamilegacy/* which is no longer maintained. - Identifies usage of the deprecated Bitnami repository that requires immediate migration. - Tracks container deployments using the legacy /bitnamilegacy path for urgent migration planning.	Container Security	Kubernetes Bitnami Container Images Image Pulls Docker Hub Security
prequel-2025-0110 Medium Impact: 0/10 Mitigation: 0/10	Kubernetes Bitnami Secure Image Pull Events - Designed for Non-Prod Usage Only kubernetes	- Detects Kubernetes events where Bitnami Secure container images are being pulled. - Monitors image pull operations for Bitnami Secure images which cannot be pinned to specific versions. - Identifies usage of Bitnami Secure images that lack version pinning capabilities for production stability. - Tracks container deployments using unpinnable Bitnami Secure images for compliance monitoring.	Container Security	Kubernetes Bitnami Container Images Image Pulls Dev Only
prequel-2025-0111 Medium Impact: 0/10 Mitigation: 0/10	Kubernetes Deprecated Bitnami Repository Image Pulls v1	- Detects Kubernetes events where container images are being pulled from the deprecated /bitnami repository on Docker Hub. - Monitors image pull operations specifically from docker.io/bitnami/* which will be discontinued. - Identifies usage of the deprecated Bitnami repository that requires immediate migration. - Tracks container deployments using the legacy /bitnami path for urgent migration planning.	Container Security	Kubernetes Bitnami Deprecated Repository Container Images Image Pulls Docker Hub
prequel-2025-0112 Medium Impact: 0/10 Mitigation: 0/10	Kubernetes Deployment CPU Requests Missing v1	- Detects Kubernetes Deployment resources without CPU requests configured on containers. - Monitors deployment specifications where containers lack proper CPU request definitions. - Identifies resource management violations that can lead to poor cluster scheduling. - Tracks deployments that may cause resource contention and performance issues.	Resource Management	Kubernetes Deployment CPU Requestsresource-managementScheduling Performance
prequel-2025-0113 Medium Impact: 0/10 Mitigation: 0/10	Kubernetes Deployment CPU Limits Missing v1	- Detects Kubernetes Deployment resources without CPU limits configured on containers. - Monitors deployment specifications where containers lack proper CPU limit definitions. - Identifies resource management violations that can lead to resource exhaustion. - Tracks deployments that may consume excessive CPU resources without bounds.	Resource Management	Kubernetes Deployment CPU Limitsresource-managementResource Exhaustion Performance
prequel-2025-0114 Medium Impact: 0/10 Mitigation: 0/10	Kubernetes Deployment Memory Requests Missing v1	- Detects Kubernetes Deployment resources without memory requests configured on containers. - Monitors deployment specifications where containers lack proper memory request definitions. - Identifies resource management violations that can lead to poor scheduling decisions. - Tracks deployments that may cause memory pressure and OOM conditions.	Resource Management	Kubernetes Deployment Memory Requestsresource-managementScheduling OOM
prequel-2025-0115 Medium Impact: 0/10 Mitigation: 0/10	Kubernetes Deployment Memory Limits Missing v1	- Detects Kubernetes Deployment resources without memory limits configured on containers. - Monitors deployment specifications where containers lack proper memory limit definitions. - Identifies resource management violations that can lead to memory exhaustion. - Tracks deployments that may consume excessive memory resources without bounds.	Resource Management	Kubernetes Deployment Memory Limitsresource-managementMemory Exhaustion OOM
prequel-2025-0116 Medium Impact: 0/10 Mitigation: 0/10	Kubernetes Deployment Liveness Probe Missing v1	- Detects Kubernetes Deployment resources without liveness probes configured on containers. - Monitors deployment specifications where containers lack proper health check definitions. - Identifies reliability violations that can lead to undetected application failures. - Tracks deployments that may run unhealthy containers without automatic recovery.	Kubernetes Best Practices	Kubernetes Deployment Liveness Probe Health Checks Reliability Availability
prequel-2025-0117 Medium Impact: 0/10 Mitigation: 0/10	Kubernetes Deployment Readiness Probe Missing v1	- Detects Kubernetes Deployment resources without readiness probes configured on containers. - Monitors deployment specifications where containers lack proper readiness check definitions. - Identifies reliability violations that can lead to premature traffic routing. - Tracks deployments that may receive traffic before being fully ready to handle requests.	Kubernetes Best Practices	Kubernetes Deployment Readiness Probe Health Checks Reliability Traffic Routing

Observability Problems

Container Security

GraphQL Problems

Jetty Problems

Ingress Problems

Resource Management

Istio Problems

OTEL Problems

Message Queue Problems

Memory Problems

ArgoCD Problems

Kubernetes Problems

AWS Problems

Kubernetes Provisioning Problems

Kubernetes Best Practices

API Service Problems

Proxy Problems

Networking Problems

Service Mesh Monitoring

Storage Problems

Service Mesh Problems

MongoDB Problems

SQL Problems

Fault Tolerance Problems

Kafka Problems

Secrets Problems

Clickhouse Problems

Postgres Problems

Kubernetes Networking Problems

Traefik Problems

Prometheus Problems

NATS Problems

Application Error

Continuous Delivery Problems

High Availability Problems

Database Integrity Problems

Message Broker Errors

Database Problems

Workflow Service Problems

Policy Enforcement Issues

Autoscaling Problems

Data Storage Problems

Kubernetes

Configuration

Deployment

Errors

Observability

GraphQL

Bitnami

Container Images

AWS

Loki

Nginx

Istio

Exceptions

Image Pulls

Docker Hub

OOM

Jetty

Ingress

Threshold Exceeded

Known Problem

Kafka

PostgreSQL

Prometheus

Storage

Timeout

Datadog

Apollo

Error

ArgoCD

OTEL

Karpenter

Cache

Data Loss

Memcached

Memory

Metrics

Networking

Telepresence