Prometheus and Error Budgets

Introduction

Error budgets are a fundamental concept in Site Reliability Engineering (SRE) that provide a quantitative approach to managing service reliability. They represent the acceptable level of failure or downtime a service can experience before corrective action is required. Prometheus, with its powerful monitoring and alerting capabilities, serves as an excellent tool for implementing and tracking error budgets.

In this guide, we'll explore how to use Prometheus to define, measure, and maintain error budgets for your services, helping you balance reliability with the pace of innovation.

Understanding Error Budgets

What is an Error Budget?

An error budget is the difference between:

Your Service Level Objective (SLO) - the target reliability you promise to users
The actual measured reliability of your service

For example, if your SLO states that your API should have 99.9% availability in a month (allowing for about 43 minutes of downtime), and you've only used 10 minutes of that allowed downtime, your remaining error budget would be 33 minutes.

The Error Budget Formula

The basic formula for calculating an error budget is:

Error Budget = (1 - SLO) × Total Service Time

For a 99.9% SLO over a 30-day month:

Error Budget = (1 - 0.999) × (30 days × 24 hours × 60 minutes)
Error Budget = 0.001 × 43,200 minutes = 43.2 minutes

Why Error Budgets Matter

Error budgets provide several benefits:

Objective Decision Making: They create a data-driven framework for deciding when to focus on new features versus fixing reliability issues
Aligning Incentives: They help development and operations teams find the right balance between speed and stability
Risk Management: They allow for controlled risk-taking without compromising overall service health
Prioritization: They help teams focus on the most impactful reliability improvements

Implementing Error Budgets with Prometheus

Prerequisites

Before implementing error budgets, you need:

Clearly defined Service Level Indicators (SLIs) - metrics that measure service health
Service Level Objectives (SLOs) - target values for your SLIs
Prometheus set up to collect relevant metrics
Grafana (optional but recommended) for visualization

Step 1: Define Your SLIs in Prometheus

First, you need to identify and collect the metrics that represent your service's health. Common SLIs include:

Availability: The percentage of successful requests
Latency: Response time for requests
Error Rate: The percentage of failed requests
Throughput: The number of requests handled

Let's implement a basic availability SLI using Prometheus:

# prometheus.yml
scrape_configs:
  - job_name: 'api'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['api-server:8080']

Your application should expose metrics that allow you to count total requests and errors:

// Go example with Prometheus client
package main

import (
    "net/http"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"path", "status"},
    )
)

func recordMetrics(path string, status int) {
    httpRequestsTotal.WithLabelValues(path, fmt.Sprintf("%d", status)).Inc()
}

func handler(w http.ResponseWriter, r *http.Request) {
    // Handle request...
    recordMetrics(r.URL.Path, 200)
}

func main() {
    http.HandleFunc("/", handler)
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", nil)
}

Step 2: Create PromQL Queries for SLOs

Once your metrics are being collected, you can create PromQL queries to calculate your SLIs:

# Availability SLI: Percentage of successful requests
(sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100

For latency SLO (assuming you have a histogram metric for request duration):

# Latency SLI: Percentage of requests that complete within your target time (e.g., 300ms)
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))

Step 3: Calculate Error Budget Consumption

To calculate error budget consumption, you need to measure how much of your allowed error budget you've used:

# For a 99.9% availability SLO
# First, calculate the error rate (complement of availability)
1 - (sum(rate(http_requests_total{status=~"2.."}[24h])) / sum(rate(http_requests_total[24h])))

# Then, compare to your allowed error rate
# Error budget used (as a percentage)
(1 - (sum(rate(http_requests_total{status=~"2.."}[24h])) / sum(rate(http_requests_total[24h]))) / 0.001) * 100

Where 0.001 represents your allowed error rate for a 99.9% SLO.

Step 4: Create a Grafana Dashboard

A Grafana dashboard can help visualize your SLOs and error budget consumption. Here's how to structure it:

Create a panel showing your SLI over time
Add a panel showing error budget consumption
Include burn rate (how quickly you're using your error budget)
Add remaining error budget for the current period

Let's create a Grafana dashboard using JSON:

{
  "title": "Error Budget Dashboard",
  "panels": [
    {
      "title": "Availability SLI",
      "type": "graph",
      "targets": [
        {
          "expr": "(sum(rate(http_requests_total{status=~\"2..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
          "legendFormat": "Availability (%)"
        }
      ],
      "yaxes": [
        {
          "min": 90,
          "max": 100
        }
      ]
    },
    {
      "title": "Error Budget Consumption",
      "type": "gauge",
      "targets": [
        {
          "expr": "(1 - (sum(rate(http_requests_total{status=~\"2..\"}[30d])) / sum(rate(http_requests_total[30d]))) / 0.001) * 100",
          "legendFormat": "Budget Used (%)"
        }
      ],
      "thresholds": [
        {
          "value": 50,
          "color": "green"
        },
        {
          "value": 80,
          "color": "yellow"
        },
        {
          "value": 90,
          "color": "red"
        }
      ],
      "min": 0,
      "max": 100
    }
  ]
}

Step 5: Set Up Alerting on Error Budget Consumption

Create alerts that trigger when you're consuming your error budget too quickly:

# prometheus-rules.yml
groups:
- name: error-budget
  rules:
  - alert: ErrorBudgetBurnRate
    expr: (1 - (sum(rate(http_requests_total{status=~"2.."}[1h])) / sum(rate(http_requests_total[1h])))) > (0.001 * 24)
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Error budget burning too fast"
      description: "Error budget is being consumed 24x faster than allocated, which would deplete the monthly budget in approximately {{ $value | humanizeDuration }}."

This alert triggers when your error rate exceeds 24 times your allowed rate (which would consume your entire monthly budget in one day).

Multi-Window, Multi-Burn-Rate Alerts

For more sophisticated error budget monitoring, implement multi-window, multi-burn-rate alerts:

# prometheus-rules.yml
groups:
- name: error-budget
  rules:
  - alert: ErrorBudgetBurnFast
    expr: (1 - (sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m])))) > (0.001 * 144)
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Error budget burning very fast"
      description: "Error budget is being consumed 144x faster than allocated, which would deplete the monthly budget in approximately 5 hours."
      
  - alert: ErrorBudgetBurnMedium
    expr: (1 - (sum(rate(http_requests_total{status=~"2.."}[30m])) / sum(rate(http_requests_total[30m])))) > (0.001 * 24)
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Error budget burning at medium rate"
      description: "Error budget is being consumed 24x faster than allocated, which would deplete the monthly budget in approximately 1 day."
      
  - alert: ErrorBudgetBurnSlow
    expr: (1 - (sum(rate(http_requests_total{status=~"2.."}[6h])) / sum(rate(http_requests_total[6h])))) > (0.001 * 4)
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "Error budget burning steadily"
      description: "Error budget is being consumed 4x faster than allocated, which would deplete the monthly budget in approximately 1 week."

This approach uses different time windows and burn rates to detect both fast-burning issues and slower, sustained problems.

Real-World Example: Implementing Error Budgets for an E-commerce API

Let's walk through a complete example for an e-commerce API service:

1. Define SLOs

For our e-commerce API, we'll establish:

99.95% availability (2.2 hours downtime per month)
95% of requests completing in under 200ms

2. Instrument Your Application

package main

import (
    "time"
    "net/http"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "ecommerce_http_requests_total",
            Help: "Total number of HTTP requests for e-commerce API",
        },
        []string{"endpoint", "status"},
    )
    
    requestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "ecommerce_request_duration_seconds",
            Help:    "Duration of HTTP requests for e-commerce API",
            Buckets: prometheus.ExponentialBuckets(0.01, 2, 10), // 10ms to ~5s
        },
        []string{"endpoint"},
    )
)

func apiHandler(w http.ResponseWriter, r *http.Request) {
    start := time.Now()
    
    // Process request...
    
    duration := time.Since(start).Seconds()
    requestDuration.WithLabelValues(r.URL.Path).Observe(duration)
    httpRequestsTotal.WithLabelValues(r.URL.Path, "200").Inc()
}

func main() {
    http.HandleFunc("/api/", apiHandler)
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", nil)
}

3. Calculate Error Budgets with PromQL

For availability:

# Availability SLI
availability_sli = sum(rate(ecommerce_http_requests_total{status=~"2.."}[1h])) / sum(rate(ecommerce_http_requests_total[1h]))

# Error budget consumption (for 99.95% SLO)
error_budget_used = (1 - availability_sli) / 0.0005

For latency:

# Latency SLI (% of requests under 200ms)
latency_sli = sum(rate(ecommerce_request_duration_seconds_bucket{le="0.2"}[1h])) / sum(rate(ecommerce_request_duration_seconds_count[1h]))

# Error budget consumption (for 95% SLO)
latency_budget_used = (1 - latency_sli) / 0.05

4. Visualize with a Mermaid Diagram

5. Using Error Budgets in Practice

Weekly SRE Review:
- Review current error budget consumption
- If over 50% consumed, prioritize reliability work
- If under 50% consumed, proceed with feature work
Automated Release Policy:
- Configure your CI/CD system to check error budget before deploying:

#!/bin/bash
# Script to check error budget before deployment

# Query Prometheus for current error budget consumption
BUDGET_CONSUMED=$(curl -s "http://prometheus:9090/api/v1/query?query=(1+-+(sum(rate(ecommerce_http_requests_total%7Bstatus%3D~%222..%22%7D%5B30d%5D))+%2F+sum(rate(ecommerce_http_requests_total%5B30d%5D))))+%2F+0.0005" | jq '.data.result[0].value[1]' | tr -d '"')

# If budget consumption is over 80%, block deployment
if (( $(echo "$BUDGET_CONSUMED > 0.8" | bc -l) )); then
    echo "Error budget is over 80% consumed. Blocking deployment."
    exit 1
else
    echo "Error budget is at acceptable levels. Proceeding with deployment."
    # Continue with deployment
fi

Implementing Error Budget Policies

The real power of error budgets comes from the policies you build around them:

Example Error Budget Policy

Define Error Budget Thresholds:
- Green: <50% consumed - Accelerate feature development
- Yellow: 50-80% consumed - Balance features and reliability
- Red: >80% consumed - Focus on reliability improvements
Establish Response Actions:
- Implement a "feature freeze" when error budget is depleted
- Require additional testing for changes when budget is <30%
- Auto-approve low-risk changes when budget is >70%
Document the Policy in Version Control:

# error-budget-policy.yaml
service: ecommerce-api
slos:
  - name: availability
    target: 99.95%
    measurement: sum(rate(ecommerce_http_requests_total{status=~"2.."}[window])) / sum(rate(ecommerce_http_requests_total[window]))
  - name: latency
    target: 95% under 200ms
    measurement: sum(rate(ecommerce_request_duration_seconds_bucket{le="0.2"}[window])) / sum(rate(ecommerce_request_duration_seconds_count[window]))

error_budget_policies:
  - budget_remaining: <20%
    actions:
      - Halt all non-essential deployments
      - Escalate to engineering leadership
      - Form incident team if not already in place
  - budget_remaining: 20-50%
    actions:
      - Require additional pre-deployment testing
      - Implement canary deployments for all changes
      - Prioritize reliability improvements
  - budget_remaining: >50%
    actions:
      - Normal development velocity
      - Standard change management procedures

Advanced Error Budget Concepts

1. Weighted Error Budgets

Not all errors have the same impact. You can weight your error budget by the importance of different endpoints:

# Weighted error budget for critical endpoints
sum(rate(ecommerce_http_requests_total{status!~"2..", endpoint=~"/(cart|checkout)/.*"}[1h])) * 5 +
sum(rate(ecommerce_http_requests_total{status!~"2..", endpoint=~"/browse/.*"}[1h])) * 1

2. Dynamic SLOs

For seasonal businesses, you might want stricter SLOs during high-traffic periods:

# Dynamic SLO based on traffic
(
  sum(rate(ecommerce_http_requests_total[1d])) > 1000000 
  and (1 - (sum(rate(ecommerce_http_requests_total{status=~"2.."}[1h])) / sum(rate(ecommerce_http_requests_total[1h])))) > 0.0002
)
or
(
  sum(rate(ecommerce_http_requests_total[1d])) <= 1000000 
  and (1 - (sum(rate(ecommerce_http_requests_total{status=~"2.."}[1h])) / sum(rate(ecommerce_http_requests_total[1h])))) > 0.001
)

3. Customer-Centric Error Budgets

Rather than system-level metrics, measure error budgets from the customer perspective:

# User-journey success rate
sum(rate(user_checkout_journey_success[1h])) / sum(rate(user_checkout_journey_attempts[1h]))

Common Pitfalls and Challenges

Too Many SLOs: Start with just 1-3 key SLOs that genuinely reflect user experience
Incorrect Time Windows: Use appropriate time windows for your traffic patterns
Failure to Reset: Remember to reset error budgets at the end of each measurement period
Ignoring Burn Rate: Monitor not just total consumption but how quickly budget is being depleted
Lack of Buy-In: Ensure teams understand and commit to the error budget policy

Summary

Error budgets provide a quantitative framework for balancing reliability and innovation. With Prometheus, you can implement a robust error budget system by:

Defining clear SLOs based on user experience
Instrumenting your applications to capture relevant metrics
Creating PromQL queries to calculate SLIs and error budget consumption
Setting up visualizations and alerts
Implementing policies that guide decision-making based on error budget status

By following these steps, you can create a data-driven reliability culture that supports both stability and innovation.

Additional Resources

For further learning about error budgets and SRE practices:

Read Google's SRE Book, particularly the chapters on SLOs and Error Budgets
Explore Prometheus documentation for more advanced PromQL queries
Research "Multi-Window, Multi-Burn-Rate" alerting patterns
Practice with the examples in this guide on a test service

Exercises

Set up basic availability and latency SLOs for a simple web service
Create a Grafana dashboard to visualize your error budgets
Implement a multi-window alert system for different burn rates
Draft an error budget policy for your team
Simulate an outage and observe how it affects your error budget

Remember, the goal of error budgets isn't perfect reliability, but rather the right level of reliability for your users and business needs.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Error Budgets​

What is an Error Budget?​

The Error Budget Formula​

Why Error Budgets Matter​

Implementing Error Budgets with Prometheus​

Prerequisites​

Step 1: Define Your SLIs in Prometheus​

Step 2: Create PromQL Queries for SLOs​

Step 3: Calculate Error Budget Consumption​

Step 4: Create a Grafana Dashboard​

Step 5: Set Up Alerting on Error Budget Consumption​

Multi-Window, Multi-Burn-Rate Alerts​

Real-World Example: Implementing Error Budgets for an E-commerce API​

1. Define SLOs​

2. Instrument Your Application​

3. Calculate Error Budgets with PromQL​

4. Visualize with a Mermaid Diagram​

5. Using Error Budgets in Practice​

Implementing Error Budget Policies​

Example Error Budget Policy​

Advanced Error Budget Concepts​

1. Weighted Error Budgets​

2. Dynamic SLOs​

3. Customer-Centric Error Budgets​

Common Pitfalls and Challenges​

Summary​

Additional Resources​

Exercises​