Skip to main content

What are Common Reliability Enumerations (CREs)?

Common Reliability Enumerations (CREs) are an open, structured standard for naming and categorizing reliability problems found in production systems. CREs represent the collective knowledge of The Open Problem Detection (and Resolution) Community where hundreds of engineers and practitioners across startups, enterprises, and critical infrastructure providers discuss how to share, detect, and mitigate reliability problems.

CREs provide a consistent way to describe reliability problems (cause, impact, and mitigation). The CRE schema and taxonomy enables the sharing of reliability intelligence and gives teams a vocabulary to discuss recurring problems without reinventing the wheel or diagnosing incidents in isolation.

Just as CVEs (Common Vulnerabilities and Exposures) provide a method to classify and share known threats, CREs offer an equivalent standard for reliability problems.

With CREs, you can:

  • Recognize known failure modes before they escalate
  • Correlate similar issues across services, teams, or companies
  • Drive better postmortems, triage, and tooling decisions
  • Contribute your own findings to an evolving, community-backed index

CREs give teams a common framework to identify, compare, and learn from reliability issues—making patterns visible that were previously siloed or overlooked.

When paired with rules, CREs become a powerful way to both understand and detect problems.

Example CRE Rule

Below is simple rule that looks for a sequence of events in a single log source over a window of time along with a negative condition (an event that should not occur during the window). Try it out on the playground.

cre-2024-0007.yaml
rules:
- cre:
id: CRE-2024-0007
severity: 0
title: RabbitMQ Mnesia overloaded recovering persistent queues
category: message-queue-problems
author: Prequel
description: |
- The RabbitMQ cluster is processing a large number of persistent mirrored queues at boot.
cause: |
- The Erlang process, Mnesia, is overloaded while recovering persistent queues on boot.
impact: |
- RabbitMQ is unable to process any new messages and can cause outages in consumers and producers.
tags:
- cre-2024-0007
- known-problem
- rabbitmq
mitigation: |
- Adjusting mirroring policies to limit the number of mirrored queues
- Remove high-availability policies from queues
- Add additional CPU resources and restart the RabbitMQ cluster
- Use [lazy queues](https://www.rabbitmq.com/docs/lazy-queues) to avoid incurring the costs of writing data to disk
references:
- https://groups.google.com/g/rabbitmq-users/c/ekV9tTBRZms/m/1EXw-ruuBQAJ
applications:
- name: "rabbitmq"
version: "3.9.x"
metadata:
kind: prequel
id: 5UD1RZxGC5LJQnVpAkV11A
generation: 1
rule:
sequence:
window: 30s
event:
source: rabbitmq
order:
- regex: "Discarding message(.+)in an old incarnation(.+)of this node"
- "Mnesia is overloaded"
negate:
- "SIGTERM received - shutting down"