What are CREs? | Prequel

What are Common Reliability Enumerations (CREs)?

Common Reliability Enumerations (CREs) are an open, structured standard for naming and categorizing reliability problems found in production systems. CREs represent the collective knowledge of The Open Problem Detection (and Resolution) Community where hundreds of engineers and practitioners across startups, enterprises, and critical infrastructure providers discuss how to share, detect, and mitigate reliability problems.

CREs provide a consistent way to describe reliability problems (cause, impact, and mitigation). The CRE schema and taxonomy enables the sharing of reliability intelligence and gives teams a vocabulary to discuss recurring problems without reinventing the wheel or diagnosing incidents in isolation.

Just as CVEs (Common Vulnerabilities and Exposures) provide a method to classify and share known threats, CREs offer an equivalent standard for reliability problems.

With CREs, you can:

Recognize known failure modes before they escalate
Correlate similar issues across services, teams, or companies
Drive better postmortems, triage, and tooling decisions
Contribute your own findings to an evolving, community-backed index

CREs give teams a common framework to identify, compare, and learn from reliability issues—making patterns visible that were previously siloed or overlooked.

When paired with rules, CREs become a powerful way to both understand and detect problems.

Example CRE Rule

Below is simple rule that looks for a sequence of events in a single log source over a window of time along with a negative condition (an event that should not occur during the window). Try it out on the playground.

cre-2024-0007.yaml
rules:
  - cre:
      id: CRE-2024-0007
      severity: 0
      title: RabbitMQ Mnesia overloaded recovering persistent queues
      category: message-queue-problems
      author: Prequel
      description: |
        - The RabbitMQ cluster is processing a large number of persistent mirrored queues at boot. 
      cause: |
        - The Erlang process, Mnesia, is overloaded while recovering persistent queues on boot. 
      impact: |
        - RabbitMQ is unable to process any new messages and can cause outages in consumers and producers.
      tags: 
        - cre-2024-0007
        - known-problem
        - rabbitmq
      mitigation: |
        - Adjusting mirroring policies to limit the number of mirrored queues
        - Remove high-availability policies from queues
        - Add additional CPU resources and restart the RabbitMQ cluster
        - Use [lazy queues](https://www.rabbitmq.com/docs/lazy-queues) to avoid incurring the costs of writing data to disk 
      references:
        - https://groups.google.com/g/rabbitmq-users/c/ekV9tTBRZms/m/1EXw-ruuBQAJ
      applications:
        - name: "rabbitmq"
          version: "3.9.x"
    metadata:
      kind: prequel
      id: 5UD1RZxGC5LJQnVpAkV11A
      hash: CAbgxyQnLLP12A6GrRHAdcBsbtstGio1gEAj3kLqyRe9
      generation: 1
    rule:
      sequence:
        window: 30s
        event:
          source: rabbitmq
        order:
          - regex: "Discarding message(.+)in an old incarnation(.+)of this node"
          - "Mnesia is overloaded"
        negate:
          - "SIGTERM received - shutting down"

What are Common Reliability Enumerations (CREs)?​

Example CRE Rule​

What are Common Reliability Enumerations (CREs)?

Example CRE Rule