Custom Metric Collection
Introduction
In the world of monitoring, predefined metrics can take you only so far. To gain deeper insights into your specific applications and services, you'll often need to create and collect custom metrics tailored to your unique use cases.
Custom metric collection in Prometheus allows you to instrument your code with specific measurements that matter to your application's performance, health, and business logic. Whether you're tracking user logins, order processing times, or memory usage patterns specific to your application, custom metrics provide the visibility you need.
In this guide, we'll explore how to define, implement, and collect custom metrics using Prometheus client libraries, understand best practices, and see real-world applications of custom metric collection.
Understanding Prometheus Metric Types
Before diving into creating custom metrics, let's understand the four fundamental metric types in Prometheus:
Counter
A counter is a cumulative metric that represents a single monotonically increasing value. Counters can only increase or be reset to zero (usually when the process restarts).
Use cases:
- Number of requests processed
- Number of errors
- Total tasks completed
Gauge
A gauge represents a single numerical value that can arbitrarily go up and down.
Use cases:
- Memory usage
- Current temperature
- Number of active connections
Histogram
A histogram samples observations and counts them in configurable buckets. It also provides a sum of all observed values.
Use cases:
- Request durations
- Response sizes
- Latency measurements
Summary
Similar to a histogram, a summary samples observations. Instead of buckets, it calculates configurable quantiles over a sliding time window.
Use cases:
- Request durations with quantile calculations
- When you need precise percentile measurements
Creating Custom Metrics with Client Libraries
Prometheus offers client libraries for many programming languages. Let's explore how to create custom metrics in some popular ones:
Go
package main
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
func main() {
// Counter example
requestCounter := promauto.NewCounter(prometheus.CounterOpts{
Name: "myapp_requests_total",
Help: "The total number of processed requests",
})
// Gauge example
connectionGauge := promauto.NewGauge(prometheus.GaugeOpts{
Name: "myapp_active_connections",
Help: "The current number of active connections",
})
// Histogram example
durationHistogram := promauto.NewHistogram(prometheus.HistogramOpts{
Name: "myapp_request_duration_seconds",
Help: "Request duration distribution",
Buckets: prometheus.LinearBuckets(0.01, 0.05, 10), // 10 buckets, from 0.01 to 0.46 seconds
})
// Simulate some metrics
go func() {
for {
requestCounter.Inc()
connectionGauge.Set(float64(100 + time.Now().Second()))
durationHistogram.Observe(0.1 + float64(time.Now().Nanosecond())/1e9)
time.Sleep(1 * time.Second)
}
}()
// Expose metrics on /metrics endpoint
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":2112", nil)
}
Python
from prometheus_client import start_http_server, Counter, Gauge, Histogram
import random
import time
# Create metrics
REQUEST_COUNT = Counter('myapp_requests_total', 'Total app requests')
ACTIVE_CONNECTIONS = Gauge('myapp_active_connections', 'Number of active connections')
REQUEST_DURATION = Histogram('myapp_request_duration_seconds',
'Request duration in seconds',
buckets=[0.01, 0.05, 0.1, 0.5, 1, 5])
# Start server
start_http_server(8000)
# Generate some metrics
while True:
# Increment counter
REQUEST_COUNT.inc()
# Set gauge to random value
connection_count = random.randint(80, 120)
ACTIVE_CONNECTIONS.set(connection_count)
# Observe histogram value
duration = random.random() * 0.5
REQUEST_DURATION.observe(duration)
time.sleep(1)
Java
import io.prometheus.client.Counter;
import io.prometheus.client.Gauge;
import io.prometheus.client.Histogram;
import io.prometheus.client.exporter.HTTPServer;
import java.io.IOException;
import java.util.Random;
public class CustomMetricsExample {
static final Counter requestCounter = Counter.build()
.name("myapp_requests_total")
.help("Total requests processed")
.register();
static final Gauge connectionGauge = Gauge.build()
.name("myapp_active_connections")
.help("Current number of active connections")
.register();
static final Histogram requestDuration = Histogram.build()
.name("myapp_request_duration_seconds")
.help("Request duration distribution")
.buckets(0.01, 0.05, 0.1, 0.5, 1, 5)
.register();
public static void main(String[] args) throws IOException, InterruptedException {
HTTPServer server = new HTTPServer(8000);
Random random = new Random();
while (true) {
// Increment counter
requestCounter.inc();
// Update gauge
connectionGauge.set(80 + random.nextInt(41));
// Record histogram value
requestDuration.observe(random.nextDouble() * 0.5);
Thread.sleep(1000);
}
}
}
Node.js
const express = require('express');
const client = require('prom-client');
const app = express();
// Create a Registry to register the metrics
const register = new client.Registry();
client.collectDefaultMetrics({ register });
// Create custom metrics
const requestCounter = new client.Counter({
name: 'myapp_requests_total',
help: 'Total number of requests',
registers: [register]
});
const connectionGauge = new client.Gauge({
name: 'myapp_active_connections',
help: 'Number of active connections',
registers: [register]
});
const requestDuration = new client.Histogram({
name: 'myapp_request_duration_seconds',
help: 'Request duration distribution',
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
registers: [register]
});
// Simulate metrics
setInterval(() => {
requestCounter.inc();
connectionGauge.set(80 + Math.floor(Math.random() * 41));
requestDuration.observe(Math.random() * 0.5);
}, 1000);
// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
app.listen(8000, () => {
console.log('Server is running on http://localhost:8000');
});
Best Practices for Custom Metric Collection
1. Naming Conventions
Follow a consistent naming pattern for your metrics:
<namespace>_<subsystem>_<name>_<unit>[_info]
For example:
http_requests_total
node_memory_usage_bytes
api_request_duration_seconds
2. Labels and Cardinality
Use labels to add dimensions to your metrics, but be cautious about cardinality explosion:
# Good - Low cardinality
HTTP_REQUESTS = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'status_code', 'endpoint']
)
# Bad - High cardinality (user_id could have millions of values)
HTTP_REQUESTS = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'status_code', 'user_id']
)
3. Choosing the Right Metric Type
Select the appropriate metric type based on what you're measuring:
- Use counters for events or totals
- Use gauges for current values
- Use histograms for distributions of values, especially latencies
4. Documentation
Always include comprehensive help text for each metric:
requestCounter := promauto.NewCounter(prometheus.CounterOpts{
Name: "myapp_requests_total",
Help: "The total number of HTTP requests processed, labeled by method, status code, and endpoint",
})
Real-World Custom Metric Collection Example
Let's build a more comprehensive example that monitors a fictional e-commerce application:
E-commerce Application Metrics
from prometheus_client import start_http_server, Counter, Gauge, Histogram, Summary
import random
import time
# Business metrics
CHECKOUT_COUNTER = Counter('ecommerce_checkouts_total',
'Total number of completed checkouts')
CART_ABANDONMENT = Counter('ecommerce_cart_abandonments_total',
'Total number of abandoned shopping carts')
PRODUCT_VIEWS = Counter('ecommerce_product_views_total',
'Product views',
['product_category'])
# Technical metrics
API_REQUEST_DURATION = Histogram('ecommerce_api_request_duration_seconds',
'API request duration in seconds',
['endpoint'],
buckets=[0.01, 0.05, 0.1, 0.5, 1, 5])
DB_CONNECTION_POOL = Gauge('ecommerce_db_connections_active',
'Number of active database connections')
PAYMENT_PROCESSING_TIME = Summary('ecommerce_payment_processing_seconds',
'Time spent processing payments')
# Start Prometheus HTTP server
start_http_server(8000)
print("Metrics available at http://localhost:8000/metrics")
# Simulate application activity
product_categories = ['electronics', 'clothing', 'home', 'food', 'toys']
api_endpoints = ['products', 'cart', 'checkout', 'user', 'search']
while True:
# Simulate product views
category = random.choice(product_categories)
PRODUCT_VIEWS.labels(product_category=category).inc()
# Simulate checkouts and cart abandonments
if random.random() < 0.1: # 10% checkout
CHECKOUT_COUNTER.inc()
if random.random() < 0.2: # 20% abandon cart
CART_ABANDONMENT.inc()
# Simulate API requests with different durations
endpoint = random.choice(api_endpoints)
duration = 0.05 + (random.random() * 0.3) # Between 0.05s and 0.35s
API_REQUEST_DURATION.labels(endpoint=endpoint).observe(duration)
# Simulate DB connection pool fluctuations
connections = random.randint(5, 20)
DB_CONNECTION_POOL.set(connections)
# Simulate payment processing time
if random.random() < 0.05: # 5% of iterations process a payment
payment_time = 0.5 + (random.random() * 2.0) # Between 0.5s and 2.5s
PAYMENT_PROCESSING_TIME.observe(payment_time)
time.sleep(0.1) # Generate metrics quickly for demonstration
Resulting Metrics
This example would generate metrics such as:
# HELP ecommerce_checkouts_total Total number of completed checkouts
# TYPE ecommerce_checkouts_total counter
ecommerce_checkouts_total 42
# HELP ecommerce_cart_abandonments_total Total number of abandoned shopping carts
# TYPE ecommerce_cart_abandonments_total counter
ecommerce_cart_abandonments_total 87
# HELP ecommerce_product_views_total Product views
# TYPE ecommerce_product_views_total counter
ecommerce_product_views_total{product_category="electronics"} 132
ecommerce_product_views_total{product_category="clothing"} 98
ecommerce_product_views_total{product_category="home"} 65
ecommerce_product_views_total{product_category="food"} 43
ecommerce_product_views_total{product_category="toys"} 54
# HELP ecommerce_api_request_duration_seconds API request duration in seconds
# TYPE ecommerce_api_request_duration_seconds histogram
...
# HELP ecommerce_db_connections_active Number of active database connections
# TYPE ecommerce_db_connections_active gauge
ecommerce_db_connections_active 12
# HELP ecommerce_payment_processing_seconds Time spent processing payments
# TYPE ecommerce_payment_processing_seconds summary
...
Visualizing Custom Metrics
Once you've collected custom metrics, you can create meaningful dashboards in Grafana. Here's a simple example of PromQL queries for our e-commerce metrics:
-
Checkout Conversion Rate:
sum(rate(ecommerce_checkouts_total[5m])) / sum(rate(ecommerce_product_views_total[5m]))
-
Cart Abandonment Rate:
sum(rate(ecommerce_cart_abandonments_total[5m])) / (sum(rate(ecommerce_cart_abandonments_total[5m])) + sum(rate(ecommerce_checkouts_total[5m])))
-
API Latency by Endpoint (95th Percentile):
histogram_quantile(0.95, sum(rate(ecommerce_api_request_duration_seconds_bucket[5m])) by (endpoint, le))
-
Top Product Categories by Views:
topk(3, sum(rate(ecommerce_product_views_total[1h])) by (product_category))
Push vs. Pull for Custom Metrics
Prometheus primarily uses a pull model where the Prometheus server scrapes metrics endpoints. However, sometimes you need to push metrics:
When to Use Push Gateway
- Short-lived jobs that may complete before scraping
- Batch jobs
- Systems behind firewalls without direct access
Example using the Push Gateway:
from prometheus_client import Counter, push_to_gateway
job_completion = Counter('batch_job_completions_total', 'Number of completed batch jobs')
# Do some work
job_completion.inc()
# Push to Pushgateway
push_to_gateway('localhost:9091', job='batch_processor', registry=None)
Implementing a Custom Collector
Sometimes you need to collect metrics from systems that don't support Prometheus directly. You can implement a custom collector:
from prometheus_client.core import GaugeMetricFamily, CounterMetricFamily, REGISTRY
class CustomCollector(object):
def collect(self):
# Yield metrics
c = CounterMetricFamily('my_custom_counter', 'Description of counter', labels=['label1'])
c.add_metric(['value1'], 15)
yield c
g = GaugeMetricFamily('my_custom_gauge', 'Description of gauge', labels=['label1'])
g.add_metric(['value1'], 12.3)
yield g
# Register the custom collector
REGISTRY.register(CustomCollector())
Working with Exporters
When you can't instrument your application directly, use or build an exporter that converts metrics from one format to Prometheus format.
Here's a simple example of a custom exporter for a legacy API:
import requests
import time
from prometheus_client import start_http_server, Gauge, Counter
# Define metrics
USERS_ONLINE = Gauge('legacy_users_online', 'Number of users currently online')
API_ERRORS = Counter('legacy_api_errors_total', 'Number of legacy API errors')
def scrape_legacy_api():
try:
# Call legacy API
response = requests.get('http://legacy-service/stats')
data = response.json()
# Update Prometheus metrics
USERS_ONLINE.set(data['active_users'])
if not response.ok:
API_ERRORS.inc()
except Exception:
API_ERRORS.inc()
# Start server
start_http_server(8000)
# Main loop
while True:
scrape_legacy_api()
time.sleep(15) # Scrape every 15 seconds
Advanced Custom Metrics
Let's look at some more advanced metrics concepts:
Multi-process Metrics Collection
When running multiple instances of an application, you need to handle metrics appropriately:
from prometheus_client import Counter, multiprocess, CollectorRegistry, start_http_server
import os
# Setup registry for multiprocess collection
registry = CollectorRegistry()
multiprocess.MultiProcessCollector(registry)
# Create a counter with the registry
c = Counter('my_counter', 'My counter help', registry=registry)
# Increment counter
c.inc()
# Expose metrics
start_http_server(8000, registry=registry)
# Important: Set the environment variable
os.environ['PROMETHEUS_MULTIPROC_DIR'] = '/tmp'
Metric with Timestamps
For special cases, you might need to add timestamps to metrics:
from prometheus_client import Counter, Gauge, generate_latest, REGISTRY
import time
g = Gauge('my_gauge', 'Description', ['label'])
# Set gauge with timestamp
g.labels('value').set_to_current_time()
# Create your own timestamp (unix seconds)
g.labels('another_value').set(15, 1623185425)
Summary
Custom metric collection in Prometheus provides a powerful way to gain deep insights into your applications and infrastructure. In this guide, we've covered:
- The four types of Prometheus metrics: Counter, Gauge, Histogram, and Summary
- How to implement custom metrics in various programming languages
- Best practices for naming, labeling, and documenting metrics
- Real-world examples of custom metrics for business and technical monitoring
- Advanced topics like the Push Gateway, custom collectors, and exporters
By implementing custom metrics, you can:
- Track business-relevant indicators
- Measure technical performance
- Create comprehensive monitoring dashboards
- Set up meaningful alerts
Additional Resources
- Official Prometheus Documentation on Client Libraries
- Prometheus Best Practices Guide
- PromQL Cheat Sheet
Exercises
-
Basic Instrumentation: Add custom metrics to an existing application to track:
- Number of API requests
- Response times
- Error rates
-
Business Metrics: Design and implement metrics for a fictional e-commerce site that would help answer these questions:
- What's the conversion rate?
- Which products are most viewed?
- When do we experience the most traffic?
-
Grafana Dashboard: Create a Grafana dashboard showing your custom metrics with:
- A graph of request rates over time
- A heatmap of request durations
- A gauge showing current active users
- A table of top endpoints by request count
-
Custom Exporter: Build a simple exporter that collects system information not available through the node exporter and exposes it as Prometheus metrics.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)