Custom Metric Collection

Introduction

In the world of monitoring, predefined metrics can take you only so far. To gain deeper insights into your specific applications and services, you'll often need to create and collect custom metrics tailored to your unique use cases.

Custom metric collection in Prometheus allows you to instrument your code with specific measurements that matter to your application's performance, health, and business logic. Whether you're tracking user logins, order processing times, or memory usage patterns specific to your application, custom metrics provide the visibility you need.

In this guide, we'll explore how to define, implement, and collect custom metrics using Prometheus client libraries, understand best practices, and see real-world applications of custom metric collection.

Understanding Prometheus Metric Types

Before diving into creating custom metrics, let's understand the four fundamental metric types in Prometheus:

Counter

A counter is a cumulative metric that represents a single monotonically increasing value. Counters can only increase or be reset to zero (usually when the process restarts).

Use cases:

Number of requests processed
Number of errors
Total tasks completed

Gauge

A gauge represents a single numerical value that can arbitrarily go up and down.

Use cases:

Memory usage
Current temperature
Number of active connections

Histogram

A histogram samples observations and counts them in configurable buckets. It also provides a sum of all observed values.

Use cases:

Request durations
Response sizes
Latency measurements

Summary

Similar to a histogram, a summary samples observations. Instead of buckets, it calculates configurable quantiles over a sliding time window.

Use cases:

Request durations with quantile calculations
When you need precise percentile measurements

Creating Custom Metrics with Client Libraries

Prometheus offers client libraries for many programming languages. Let's explore how to create custom metrics in some popular ones:

Go

package main

import (
	"net/http"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	"github.com/prometheus/client_golang/prometheus/promhttp"
)

func main() {
	// Counter example
	requestCounter := promauto.NewCounter(prometheus.CounterOpts{
		Name: "myapp_requests_total",
		Help: "The total number of processed requests",
	})

	// Gauge example
	connectionGauge := promauto.NewGauge(prometheus.GaugeOpts{
		Name: "myapp_active_connections",
		Help: "The current number of active connections",
	})

	// Histogram example
	durationHistogram := promauto.NewHistogram(prometheus.HistogramOpts{
		Name:    "myapp_request_duration_seconds",
		Help:    "Request duration distribution",
		Buckets: prometheus.LinearBuckets(0.01, 0.05, 10), // 10 buckets, from 0.01 to 0.46 seconds
	})

	// Simulate some metrics
	go func() {
		for {
			requestCounter.Inc()
			connectionGauge.Set(float64(100 + time.Now().Second()))
			durationHistogram.Observe(0.1 + float64(time.Now().Nanosecond())/1e9)
			time.Sleep(1 * time.Second)
		}
	}()

	// Expose metrics on /metrics endpoint
	http.Handle("/metrics", promhttp.Handler())
	http.ListenAndServe(":2112", nil)
}

Python

from prometheus_client import start_http_server, Counter, Gauge, Histogram
import random
import time

# Create metrics
REQUEST_COUNT = Counter('myapp_requests_total', 'Total app requests')
ACTIVE_CONNECTIONS = Gauge('myapp_active_connections', 'Number of active connections')
REQUEST_DURATION = Histogram('myapp_request_duration_seconds',
                             'Request duration in seconds',
                             buckets=[0.01, 0.05, 0.1, 0.5, 1, 5])

# Start server
start_http_server(8000)

# Generate some metrics
while True:
    # Increment counter
    REQUEST_COUNT.inc()
    
    # Set gauge to random value
    connection_count = random.randint(80, 120)
    ACTIVE_CONNECTIONS.set(connection_count)
    
    # Observe histogram value
    duration = random.random() * 0.5
    REQUEST_DURATION.observe(duration)
    
    time.sleep(1)

Java

import io.prometheus.client.Counter;
import io.prometheus.client.Gauge;
import io.prometheus.client.Histogram;
import io.prometheus.client.exporter.HTTPServer;

import java.io.IOException;
import java.util.Random;

public class CustomMetricsExample {
    static final Counter requestCounter = Counter.build()
        .name("myapp_requests_total")
        .help("Total requests processed")
        .register();
    
    static final Gauge connectionGauge = Gauge.build()
        .name("myapp_active_connections")
        .help("Current number of active connections")
        .register();
    
    static final Histogram requestDuration = Histogram.build()
        .name("myapp_request_duration_seconds")
        .help("Request duration distribution")
        .buckets(0.01, 0.05, 0.1, 0.5, 1, 5)
        .register();
    
    public static void main(String[] args) throws IOException, InterruptedException {
        HTTPServer server = new HTTPServer(8000);
        Random random = new Random();
        
        while (true) {
            // Increment counter
            requestCounter.inc();
            
            // Update gauge
            connectionGauge.set(80 + random.nextInt(41));
            
            // Record histogram value
            requestDuration.observe(random.nextDouble() * 0.5);
            
            Thread.sleep(1000);
        }
    }
}

Node.js

const express = require('express');
const client = require('prom-client');
const app = express();

// Create a Registry to register the metrics
const register = new client.Registry();
client.collectDefaultMetrics({ register });

// Create custom metrics
const requestCounter = new client.Counter({
  name: 'myapp_requests_total',
  help: 'Total number of requests',
  registers: [register]
});

const connectionGauge = new client.Gauge({
  name: 'myapp_active_connections',
  help: 'Number of active connections',
  registers: [register]
});

const requestDuration = new client.Histogram({
  name: 'myapp_request_duration_seconds',
  help: 'Request duration distribution',
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
  registers: [register]
});

// Simulate metrics
setInterval(() => {
  requestCounter.inc();
  connectionGauge.set(80 + Math.floor(Math.random() * 41));
  requestDuration.observe(Math.random() * 0.5);
}, 1000);

// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(8000, () => {
  console.log('Server is running on http://localhost:8000');
});

Best Practices for Custom Metric Collection

1. Naming Conventions

Follow a consistent naming pattern for your metrics:

<namespace>_<subsystem>_<name>_<unit>[_info]

For example:

http_requests_total
node_memory_usage_bytes
api_request_duration_seconds

2. Labels and Cardinality

Use labels to add dimensions to your metrics, but be cautious about cardinality explosion:

# Good - Low cardinality
HTTP_REQUESTS = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'status_code', 'endpoint']
)

# Bad - High cardinality (user_id could have millions of values)
HTTP_REQUESTS = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'status_code', 'user_id']
)

3. Choosing the Right Metric Type

Select the appropriate metric type based on what you're measuring:

Use counters for events or totals
Use gauges for current values
Use histograms for distributions of values, especially latencies

4. Documentation

Always include comprehensive help text for each metric:

requestCounter := promauto.NewCounter(prometheus.CounterOpts{
    Name: "myapp_requests_total",
    Help: "The total number of HTTP requests processed, labeled by method, status code, and endpoint",
})

Real-World Custom Metric Collection Example

Let's build a more comprehensive example that monitors a fictional e-commerce application:

E-commerce Application Metrics

from prometheus_client import start_http_server, Counter, Gauge, Histogram, Summary
import random
import time

# Business metrics
CHECKOUT_COUNTER = Counter('ecommerce_checkouts_total', 
                          'Total number of completed checkouts')

CART_ABANDONMENT = Counter('ecommerce_cart_abandonments_total', 
                          'Total number of abandoned shopping carts')

PRODUCT_VIEWS = Counter('ecommerce_product_views_total', 
                       'Product views', 
                       ['product_category'])

# Technical metrics
API_REQUEST_DURATION = Histogram('ecommerce_api_request_duration_seconds',
                                'API request duration in seconds',
                                ['endpoint'],
                                buckets=[0.01, 0.05, 0.1, 0.5, 1, 5])

DB_CONNECTION_POOL = Gauge('ecommerce_db_connections_active',
                          'Number of active database connections')

PAYMENT_PROCESSING_TIME = Summary('ecommerce_payment_processing_seconds',
                                 'Time spent processing payments')

# Start Prometheus HTTP server
start_http_server(8000)
print("Metrics available at http://localhost:8000/metrics")

# Simulate application activity
product_categories = ['electronics', 'clothing', 'home', 'food', 'toys']
api_endpoints = ['products', 'cart', 'checkout', 'user', 'search']

while True:
    # Simulate product views
    category = random.choice(product_categories)
    PRODUCT_VIEWS.labels(product_category=category).inc()
    
    # Simulate checkouts and cart abandonments
    if random.random() < 0.1:  # 10% checkout
        CHECKOUT_COUNTER.inc()
    if random.random() < 0.2:  # 20% abandon cart
        CART_ABANDONMENT.inc()
    
    # Simulate API requests with different durations
    endpoint = random.choice(api_endpoints)
    duration = 0.05 + (random.random() * 0.3)  # Between 0.05s and 0.35s
    API_REQUEST_DURATION.labels(endpoint=endpoint).observe(duration)
    
    # Simulate DB connection pool fluctuations
    connections = random.randint(5, 20)
    DB_CONNECTION_POOL.set(connections)
    
    # Simulate payment processing time
    if random.random() < 0.05:  # 5% of iterations process a payment
        payment_time = 0.5 + (random.random() * 2.0)  # Between 0.5s and 2.5s
        PAYMENT_PROCESSING_TIME.observe(payment_time)
    
    time.sleep(0.1)  # Generate metrics quickly for demonstration

Resulting Metrics

This example would generate metrics such as:

# HELP ecommerce_checkouts_total Total number of completed checkouts
# TYPE ecommerce_checkouts_total counter
ecommerce_checkouts_total 42

# HELP ecommerce_cart_abandonments_total Total number of abandoned shopping carts
# TYPE ecommerce_cart_abandonments_total counter
ecommerce_cart_abandonments_total 87

# HELP ecommerce_product_views_total Product views
# TYPE ecommerce_product_views_total counter
ecommerce_product_views_total{product_category="electronics"} 132
ecommerce_product_views_total{product_category="clothing"} 98
ecommerce_product_views_total{product_category="home"} 65
ecommerce_product_views_total{product_category="food"} 43
ecommerce_product_views_total{product_category="toys"} 54

# HELP ecommerce_api_request_duration_seconds API request duration in seconds
# TYPE ecommerce_api_request_duration_seconds histogram
...

# HELP ecommerce_db_connections_active Number of active database connections
# TYPE ecommerce_db_connections_active gauge
ecommerce_db_connections_active 12

# HELP ecommerce_payment_processing_seconds Time spent processing payments
# TYPE ecommerce_payment_processing_seconds summary
...

Visualizing Custom Metrics

Once you've collected custom metrics, you can create meaningful dashboards in Grafana. Here's a simple example of PromQL queries for our e-commerce metrics:

Checkout Conversion Rate:

sum(rate(ecommerce_checkouts_total[5m])) / sum(rate(ecommerce_product_views_total[5m]))

Cart Abandonment Rate:

sum(rate(ecommerce_cart_abandonments_total[5m])) / (sum(rate(ecommerce_cart_abandonments_total[5m])) + sum(rate(ecommerce_checkouts_total[5m])))

API Latency by Endpoint (95th Percentile):

histogram_quantile(0.95, sum(rate(ecommerce_api_request_duration_seconds_bucket[5m])) by (endpoint, le))

Top Product Categories by Views:

topk(3, sum(rate(ecommerce_product_views_total[1h])) by (product_category))

Push vs. Pull for Custom Metrics

Prometheus primarily uses a pull model where the Prometheus server scrapes metrics endpoints. However, sometimes you need to push metrics:

When to Use Push Gateway

Short-lived jobs that may complete before scraping
Batch jobs
Systems behind firewalls without direct access

Example using the Push Gateway:

from prometheus_client import Counter, push_to_gateway

job_completion = Counter('batch_job_completions_total', 'Number of completed batch jobs')

# Do some work
job_completion.inc()

# Push to Pushgateway
push_to_gateway('localhost:9091', job='batch_processor', registry=None)

Implementing a Custom Collector

Sometimes you need to collect metrics from systems that don't support Prometheus directly. You can implement a custom collector:

from prometheus_client.core import GaugeMetricFamily, CounterMetricFamily, REGISTRY

class CustomCollector(object):
    def collect(self):
        # Yield metrics
        c = CounterMetricFamily('my_custom_counter', 'Description of counter', labels=['label1'])
        c.add_metric(['value1'], 15)
        yield c

        g = GaugeMetricFamily('my_custom_gauge', 'Description of gauge', labels=['label1'])
        g.add_metric(['value1'], 12.3)
        yield g

# Register the custom collector
REGISTRY.register(CustomCollector())

Working with Exporters

When you can't instrument your application directly, use or build an exporter that converts metrics from one format to Prometheus format.

Here's a simple example of a custom exporter for a legacy API:

import requests
import time
from prometheus_client import start_http_server, Gauge, Counter

# Define metrics
USERS_ONLINE = Gauge('legacy_users_online', 'Number of users currently online')
API_ERRORS = Counter('legacy_api_errors_total', 'Number of legacy API errors')

def scrape_legacy_api():
    try:
        # Call legacy API
        response = requests.get('http://legacy-service/stats')
        data = response.json()
        
        # Update Prometheus metrics
        USERS_ONLINE.set(data['active_users'])
        
        if not response.ok:
            API_ERRORS.inc()
    except Exception:
        API_ERRORS.inc()

# Start server
start_http_server(8000)

# Main loop
while True:
    scrape_legacy_api()
    time.sleep(15)  # Scrape every 15 seconds

Advanced Custom Metrics

Let's look at some more advanced metrics concepts:

Multi-process Metrics Collection

When running multiple instances of an application, you need to handle metrics appropriately:

from prometheus_client import Counter, multiprocess, CollectorRegistry, start_http_server
import os

# Setup registry for multiprocess collection
registry = CollectorRegistry()
multiprocess.MultiProcessCollector(registry)

# Create a counter with the registry
c = Counter('my_counter', 'My counter help', registry=registry)

# Increment counter
c.inc()

# Expose metrics
start_http_server(8000, registry=registry)

# Important: Set the environment variable
os.environ['PROMETHEUS_MULTIPROC_DIR'] = '/tmp'

Metric with Timestamps

For special cases, you might need to add timestamps to metrics:

from prometheus_client import Counter, Gauge, generate_latest, REGISTRY
import time

g = Gauge('my_gauge', 'Description', ['label'])

# Set gauge with timestamp
g.labels('value').set_to_current_time()

# Create your own timestamp (unix seconds)
g.labels('another_value').set(15, 1623185425)

Summary

Custom metric collection in Prometheus provides a powerful way to gain deep insights into your applications and infrastructure. In this guide, we've covered:

The four types of Prometheus metrics: Counter, Gauge, Histogram, and Summary
How to implement custom metrics in various programming languages
Best practices for naming, labeling, and documenting metrics
Real-world examples of custom metrics for business and technical monitoring
Advanced topics like the Push Gateway, custom collectors, and exporters

By implementing custom metrics, you can:

Track business-relevant indicators
Measure technical performance
Create comprehensive monitoring dashboards
Set up meaningful alerts

Additional Resources

Exercises

Basic Instrumentation: Add custom metrics to an existing application to track:
- Number of API requests
- Response times
- Error rates
Business Metrics: Design and implement metrics for a fictional e-commerce site that would help answer these questions:
- What's the conversion rate?
- Which products are most viewed?
- When do we experience the most traffic?
Grafana Dashboard: Create a Grafana dashboard showing your custom metrics with:
- A graph of request rates over time
- A heatmap of request durations
- A gauge showing current active users
- A table of top endpoints by request count
Custom Exporter: Build a simple exporter that collects system information not available through the node exporter and exposes it as Prometheus metrics.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Prometheus Metric Types​

Counter​

Gauge​

Histogram​

Summary​

Creating Custom Metrics with Client Libraries​

Go​

Python​

Java​

Node.js​

Best Practices for Custom Metric Collection​

1. Naming Conventions​

2. Labels and Cardinality​

3. Choosing the Right Metric Type​

4. Documentation​

Real-World Custom Metric Collection Example​

E-commerce Application Metrics​

Resulting Metrics​

Visualizing Custom Metrics​

Push vs. Pull for Custom Metrics​

When to Use Push Gateway​

Implementing a Custom Collector​

Working with Exporters​

Advanced Custom Metrics​

Multi-process Metrics Collection​

Metric with Timestamps​

Summary​

Additional Resources​

Exercises​

Introduction

Understanding Prometheus Metric Types

Counter

Gauge

Histogram

Summary

Creating Custom Metrics with Client Libraries

Go

Python

Java

Node.js

Best Practices for Custom Metric Collection

1. Naming Conventions

2. Labels and Cardinality

3. Choosing the Right Metric Type

4. Documentation

Real-World Custom Metric Collection Example

E-commerce Application Metrics

Resulting Metrics

Visualizing Custom Metrics

Push vs. Pull for Custom Metrics

When to Use Push Gateway

Implementing a Custom Collector

Working with Exporters

Advanced Custom Metrics

Multi-process Metrics Collection

Metric with Timestamps

Summary

Additional Resources

Exercises