PromQL Recording Rules

Introduction

Recording rules are a powerful feature in Prometheus that allow you to pre-compute frequently used or computationally expensive PromQL expressions and save their results as new time series. Instead of computing the same expressions repeatedly at query time, which can be resource-intensive for complex queries, recording rules let you calculate these expressions ahead of time at a specified interval.

Recording rules serve as a form of optimization in your Prometheus monitoring setup, helping to:

Reduce query-time latency for dashboards and alerts
Decrease the computational load on your Prometheus server
Simplify complex queries by breaking them down into pre-computed components
Store historical data for expressions that would otherwise be calculated on-the-fly

Understanding Recording Rules

A recording rule in Prometheus follows this basic structure:

- record: <new_metric_name>
  expr: <promql_expression>
  labels:
    [ <label_name>: <label_value> ]

When Prometheus evaluates this rule:

It computes the result of the expr PromQL expression
Stores the result as a new time series with the name specified in record
Optionally applies additional labels defined in the labels section

Recording rules are evaluated at a regular interval (configurable in Prometheus), and the results are stored in the Prometheus database alongside your other metrics.

Configuring Recording Rules

Recording rules are defined in YAML configuration files and loaded into Prometheus. These files are typically organized in a rules directory and referenced in your Prometheus configuration.

Basic Setup

First, create a rule file (e.g., recording_rules.yml):

groups:
  - name: example
    rules:
    - record: job:http_requests_total:rate5m
      expr: sum(rate(http_requests_total[5m])) by (job)

Then, reference this file in your Prometheus configuration (prometheus.yml):

rule_files:
  - "rules/recording_rules.yml"

Rule Groups

Rules are organized into groups, with all rules in a group being evaluated at the same interval. Each group needs a unique name:

groups:
  - name: http_requests
    rules:
      - record: job:http_requests_total:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)
      
  - name: api_latency
    rules:
      - record: api:request_latency:p95
        expr: histogram_quantile(0.95, sum(rate(api_request_duration_seconds_bucket[5m])) by (le, service))

Naming Conventions

Prometheus recommends using a specific naming convention for recording rules:

level:metric_name:operation

For example:

job:http_requests_total:rate5m - Rate of HTTP requests over 5 minutes, aggregated by job
instance:memory_utilization:ratio - Memory utilization ratio by instance

This convention makes it clear that the metric is derived and helps avoid naming conflicts.

Practical Examples

Let's look at some common use cases for recording rules:

Example 1: HTTP Error Rate

Computing the error rate for HTTP requests:

groups:
  - name: http_errors
    rules:
      - record: job:http_errors:ratio_rate5m
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)

This rule calculates the ratio of 5xx errors to total requests over a 5-minute window, grouped by job.

Example 2: Node Exporter CPU Usage

Pre-computing CPU usage from node exporter metrics:

groups:
  - name: node_resources
    rules:
      - record: instance:node_cpu:usage_ratio_avg5m
        expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

This creates a new metric showing the average CPU usage (as a ratio) across all CPUs for each instance.

Example 3: Request Latency Percentiles

Pre-calculating percentiles from a histogram:

groups:
  - name: api_latency
    rules:
      - record: api:request_duration_seconds:p50
        expr: histogram_quantile(0.5, sum(rate(api_request_duration_seconds_bucket[5m])) by (le, service))
      - record: api:request_duration_seconds:p90
        expr: histogram_quantile(0.9, sum(rate(api_request_duration_seconds_bucket[5m])) by (le, service))
      - record: api:request_duration_seconds:p95
        expr: histogram_quantile(0.95, sum(rate(api_request_duration_seconds_bucket[5m])) by (le, service))
      - record: api:request_duration_seconds:p99
        expr: histogram_quantile(0.99, sum(rate(api_request_duration_seconds_bucket[5m])) by (le, service))

This group pre-computes 50th, 90th, 95th, and 99th percentile latencies for API requests.

Example 4: Aggregating Metrics Across Clusters

Creating cluster-level views from instance-level metrics:

groups:
  - name: cluster_resources
    rules:
      - record: cluster:memory_usage:ratio
        expr: sum(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / sum(node_memory_MemTotal_bytes)
      - record: cluster:cpu:usage_ratio_avg5m
        expr: 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))

These rules aggregate memory and CPU usage across all nodes to provide cluster-level metrics.

Best Practices for Recording Rules

To effectively use recording rules in your Prometheus setup:

1. Only Record What's Necessary

Recording rules consume storage and resources. Focus on:

Frequently accessed queries
Computationally expensive expressions
Metrics needed for alerts

2. Follow the Naming Conventions

The recommended format is level:metric:operations:

The level prefix indicates the aggregation level (e.g., job, instance, cluster)
The metric name should reflect the source
The operation suffix indicates the type of computation (e.g., rate5m, ratio)

Organize rules into logical groups that make sense to evaluate together:

Rules that depend on similar raw metrics
Rules that will be used together in dashboards
Rules with similar evaluation needs

4. Add Clear Comments

Documenting complex rules helps maintenance:

groups:
  - name: service_slos
    rules:
      # Calculate the error budget consumption rate
      # Target SLO is 99.9% availability over 30 days
      - record: service:slo:error_budget_consumption_rate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[1h]))
          / sum(rate(http_requests_total[1h]))
          / (1 - 0.999)

5. Use Labels Instead of Many Similar Rules

Instead of creating many similar rules, use labels to differentiate:

# Less optimal - repeating similar rules
- record: api_error_rate_users
  expr: sum(rate(api_requests{path="/users", status=~"5.."}[5m])) / sum(rate(api_requests{path="/users"}[5m]))
- record: api_error_rate_products
  expr: sum(rate(api_requests{path="/products", status=~"5.."}[5m])) / sum(rate(api_requests{path="/products"}[5m]))

# Better - use labels
- record: api:error_rate:ratio_rate5m
  expr: sum(rate(api_requests{status=~"5.."}[5m])) by (path) / sum(rate(api_requests[5m])) by (path)

Using Recording Rules in Queries

Once you've defined and loaded your recording rules, you can use the resulting metrics just like any other Prometheus metric:

# Original complex query
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job) > 0.01

# With recording rule
job:http_errors:ratio_rate5m > 0.01

This makes your PromQL queries more concise and efficient.

Debugging Recording Rules

You can check if your recording rules are working correctly by:

Visiting the /rules endpoint in the Prometheus UI, which shows all loaded rules
Querying for your new metrics directly in the Prometheus expression browser
Checking for any evaluation errors in the Prometheus logs

If a rule isn't working as expected, verify:

The syntax of your rule file (must be valid YAML)
The PromQL expression (should be valid and return data)
That Prometheus has reloaded the rule files

Visualizing the Results in Grafana

Here's how to use recording rules effectively in Grafana:

# Directly use the pre-computed metric
job:http_errors:ratio_rate5m{job="api-server"}

# Or combine with other queries
job:http_errors:ratio_rate5m{job="api-server"} > 0.05

The advantage of using recording rules in dashboards is that they load much faster, especially for panels that would otherwise compute complex expressions on each refresh.

Diagram: How Recording Rules Work

Recording Rules vs. Alerts

Recording rules and alert rules are both defined in rule files, but serve different purposes:

Recording Rules	Alert Rules
Pre-compute expressions and store them as new time series	Evaluate expressions and trigger alerts
`record:` field defines the new metric name	`alert:` field defines the alert name
No notifications, just data storage	Generate notifications when triggered
Used for optimization and simplification	Used for monitoring and incident response

You'll often use recording rules as inputs to alert rules, creating more efficient and maintainable alerting configurations.

Summary

Recording rules in Prometheus offer a powerful way to optimize your monitoring system by pre-computing complex or frequently used PromQL expressions. They help reduce query latency, decrease server load, and simplify your dashboards and alerts.

Key takeaways:

Use recording rules for frequently accessed or expensive queries
Follow naming conventions to maintain clarity
Group related rules together for logical organization
Add documentation for complex rules
Use recording rules as inputs to alerts for better performance

By implementing recording rules effectively, you can build a more efficient and scalable Prometheus monitoring system that handles complex queries with ease.

Additional Resources

Exercises

Create a recording rule that calculates the average memory usage percentage across all your instances.
Design a set of recording rules that pre-compute the 90th and 99th percentile latencies for different endpoints of your application.
Write a recording rule that calculates the ratio of 4xx and 5xx errors compared to total requests, broken down by service and endpoint.
Implement a recording rule that computes the rate of increase of disk usage and could help predict when you might run out of disk space.
Create a cluster-level recording rule that aggregates CPU, memory, and network usage across all nodes.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Recording Rules​

Configuring Recording Rules​

Basic Setup​

Rule Groups​

Naming Conventions​

Practical Examples​

Example 1: HTTP Error Rate​

Example 2: Node Exporter CPU Usage​

Example 3: Request Latency Percentiles​

Example 4: Aggregating Metrics Across Clusters​

Best Practices for Recording Rules​

1. Only Record What's Necessary​

2. Follow the Naming Conventions​

3. Group Related Rules Together​

4. Add Clear Comments​

5. Use Labels Instead of Many Similar Rules​

Using Recording Rules in Queries​

Debugging Recording Rules​

Visualizing the Results in Grafana​

Diagram: How Recording Rules Work​

Recording Rules vs. Alerts​

Summary​

Additional Resources​

Exercises​