Alert Provisioning
Introduction
Alert provisioning in Grafana allows you to automate the creation and management of alerting rules and notification policies through configuration files rather than the UI. This approach is particularly valuable for large-scale deployments, consistent environments, and infrastructure-as-code practices.
In this guide, we'll explore how to provision alerts in Grafana, the benefits of this approach, and practical examples to help you implement alert provisioning in your own monitoring environments.
What is Alert Provisioning?
Alert provisioning is the practice of defining your Grafana alerting configuration as code or configuration files that can be version-controlled, automated, and deployed consistently across multiple Grafana instances. This contrasts with manual alert creation through the Grafana UI.
Key Benefits
- Version Control: Track changes to your alerting configuration over time
- Automation: Integrate with CI/CD pipelines for automated deployments
- Consistency: Ensure identical alert configurations across environments
- Scalability: Efficiently manage large numbers of alerts
- Disaster Recovery: Quickly restore alert configurations if needed
Alert Provisioning Methods
Grafana supports two primary methods for provisioning alerts:
- File-based provisioning: Using YAML configuration files
- Terraform: Using the Grafana Terraform provider
Let's explore each approach.
File-Based Alert Provisioning
Grafana can load alert rules and notification policies from YAML files stored in specific directories.
Directory Structure
For file-based provisioning, Grafana looks for configuration files in the following locations:
/etc/grafana/provisioning/alerting/
├── rules/
│ ├── cpu_alerts.yaml
│ └── memory_alerts.yaml
└── notification_policies/
└── team_policies.yaml
Provisioning Alert Rules
Alert rules are defined in YAML files with a specific structure. Here's an example:
apiVersion: 1
groups:
- name: CPU Usage Alerts
folder: Infrastructure
interval: 60s
rules:
- name: High CPU Usage
condition: B
data:
- refId: A
datasourceUid: PBFA97CFB590B2093
model:
expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) * 100
refId: A
- refId: B
datasourceUid: __expr__
model:
conditions:
- evaluator:
params: [80]
type: gt
operator:
type: and
query:
params: [A]
reducer:
type: avg
refId: B
noDataState: OK
execErrState: Error
for: 5m
labels:
severity: warning
category: system
annotations:
summary: High CPU usage on {{ $labels.instance }}
description: "CPU usage has been above 80% for 5 minutes. Current value: {{ $value }}%"
Provisioning Notification Policies
Notification policies define how alerts are routed to notification channels. Here's an example:
apiVersion: 1
policies:
- receiver: default-email
group_by: ['alertname']
routes:
- receiver: ops-team-slack
group_by: ['alertname', 'instance']
matchers:
- severity =~ "critical|warning"
category = "system"
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
mute_time_intervals:
- weekends
- receiver: dev-team-slack
group_by: ['alertname', 'job']
matchers:
- category = "application"
group_wait: 45s
Provisioning with Terraform
For teams already using Terraform for infrastructure management, Grafana's Terraform provider offers another way to provision alerts.
Installing the Grafana Terraform Provider
Add the following to your Terraform configuration:
terraform {
required_providers {
grafana = {
source = "grafana/grafana"
version = "1.28.0"
}
}
}
provider "grafana" {
url = "http://grafana.example.com"
auth = var.grafana_api_key
}
Creating Alert Rules with Terraform
Here's an example of defining an alert rule with Terraform:
resource "grafana_rule_group" "cpu_alerts" {
name = "CPU Usage Alerts"
folder_uid = grafana_folder.infrastructure.uid
interval_seconds = 60
rule {
name = "High CPU Usage"
for = "5m"
condition = "B"
no_data_state = "OK"
exec_err_state = "Error"
data {
ref_id = "A"
datasource_uid = "PBFA97CFB590B2093"
relative_time_range {
from = 600
to = 0
}
model = jsonencode({
expr = "avg by(instance) (rate(node_cpu_seconds_total{mode!=\"idle\"}[5m])) * 100"
refId = "A"
})
}
data {
ref_id = "B"
datasource_uid = "__expr__"
relative_time_range {
from = 0
to = 0
}
model = jsonencode({
conditions = [{
evaluator = {
params = [80]
type = "gt"
}
operator = {
type = "and"
}
query = {
params = ["A"]
}
reducer = {
type = "avg"
}
}]
refId = "B"
})
}
labels = {
severity = "warning"
category = "system"
}
annotations = {
summary = "High CPU usage on {{ $labels.instance }}"
description = "CPU usage has been above 80% for 5 minutes. Current value: {{ $value }}%"
}
}
}
Practical Example: Implementing Alert Provisioning for a Web Application
Let's walk through a real-world example of setting up provisioned alerts for a web application environment.
Scenario
You manage a web application with:
- Frontend services
- Backend API services
- Database servers
- Messaging queue
Step 1: Define Your Alert Rules Structure
Create a directory structure for your alert provisioning:
/provisioning/
├── alerting/
│ ├── rules/
│ │ ├── frontend_alerts.yaml
│ │ ├── backend_alerts.yaml