Prometheus Custom Collectors

Introduction

In your monitoring journey with Prometheus, you'll eventually encounter scenarios where the default exporters don't provide the specific metrics you need. This is where Custom Collectors come into play. Custom collectors allow you to define and implement your own metrics collection logic, enabling you to monitor virtually any aspect of your applications or systems.

In this guide, we'll explore how to create custom collectors in Prometheus, understand their internal workings, and implement practical examples that demonstrate their real-world applications.

What Are Custom Collectors?

A collector in Prometheus is a component responsible for gathering specific metrics. While Prometheus provides many ready-to-use exporters (like the Node Exporter for hardware and OS metrics), custom collectors let you define exactly what and how to measure.

Custom collectors implement the Collector interface, which requires methods to:

Describe the metrics being collected
Collect the current values of those metrics

When to Use Custom Collectors

You might need custom collectors when:

You need to monitor a system without an existing exporter
You want to instrument your application with business-specific metrics
You need to collect metrics from multiple sources and present them in a unified way
The default exporters don't provide the granularity or specific metrics you need

Creating Basic Custom Collectors

Let's start by understanding how to implement a custom collector in Go, which is the most common language for Prometheus instrumentation.

The Collector Interface

In Prometheus's client libraries, collectors must implement two key methods:

type Collector interface {
    // Describe sends the super-set of all possible descriptors of metrics
    Describe(chan<- *Desc)
    
    // Collect is called by the Prometheus registry when collecting metrics
    Collect(chan<- Metric)
}

A Simple Example: System Uptime Collector

Let's create a custom collector that reports system uptime (which might not be available in your environment through standard exporters):

package main

import (
    "fmt"
    "log"
    "net/http"
    "os/exec"
    "strconv"
    "strings"
    
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

// UptimeCollector implements the Collector interface
type UptimeCollector struct {
    uptimeMetric *prometheus.Desc
}

// NewUptimeCollector creates a new UptimeCollector
func NewUptimeCollector() *UptimeCollector {
    return &UptimeCollector{
        uptimeMetric: prometheus.NewDesc(
            "system_uptime_seconds",
            "Current system uptime in seconds",
            nil, nil,
        ),
    }
}

// Describe implements the prometheus.Collector interface
func (c *UptimeCollector) Describe(ch chan<- *prometheus.Desc) {
    ch <- c.uptimeMetric
}

// Collect implements the prometheus.Collector interface
func (c *UptimeCollector) Collect(ch chan<- prometheus.Metric) {
    // Execute the 'uptime' command and parse its output
    cmd := exec.Command("cat", "/proc/uptime")
    output, err := cmd.Output()
    if err != nil {
        log.Printf("Error executing uptime command: %v", err)
        return
    }
    
    // Parse the output to get uptime in seconds
    uptimeString := strings.Split(string(output), " ")[0]
    uptime, err := strconv.ParseFloat(uptimeString, 64)
    if err != nil {
        log.Printf("Error parsing uptime: %v", err)
        return
    }
    
    // Create a metric with the uptime value
    ch <- prometheus.MustNewConstMetric(
        c.uptimeMetric,
        prometheus.GaugeValue,
        uptime,
    )
}

func main() {
    // Create a new registry
    reg := prometheus.NewRegistry()
    
    // Create and register our custom collector
    collector := NewUptimeCollector()
    reg.MustRegister(collector)
    
    // Expose metrics on /metrics endpoint
    http.Handle("/metrics", promhttp.HandlerFor(reg, promhttp.HandlerOpts{}))
    fmt.Println("Starting server on :8080")
    log.Fatal(http.ListenAndServe(":8080", nil))
}

When you run this code and access http://localhost:8080/metrics, you'll see output similar to:

# HELP system_uptime_seconds Current system uptime in seconds
# TYPE system_uptime_seconds gauge
system_uptime_seconds 345678.45

Understanding Metric Types in Custom Collectors

When creating custom collectors, you'll need to choose the appropriate metric type for each measurement. Prometheus supports four main metric types:

Counter: A value that only increases (e.g., number of requests processed)
Gauge: A value that can go up and down (e.g., current memory usage)
Histogram: Samples observations and counts them in configurable buckets (e.g., request durations)
Summary: Similar to histograms, but also calculates configurable quantiles (e.g., 95th percentile of request durations)

Let's explore a more complex example that uses different metric types.

Advanced Example: Database Connection Pool Collector

Monitoring a database connection pool is a common requirement. Let's create a custom collector for a fictional database pool:

package main

import (
    "fmt"
    "log"
    "math/rand"
    "net/http"
    "time"
    
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

// DBPoolCollector collects metrics about a database connection pool
type DBPoolCollector struct {
    activeConnections  *prometheus.Desc
    maxConnections     *prometheus.Desc
    connectionsCreated *prometheus.Desc
    queryDuration      *prometheus.Desc
}

// NewDBPoolCollector creates a new DBPoolCollector
func NewDBPoolCollector() *DBPoolCollector {
    return &DBPoolCollector{
        activeConnections: prometheus.NewDesc(
            "db_pool_connections_active",
            "The number of active connections in the database pool",
            []string{"db_name"}, nil,
        ),
        maxConnections: prometheus.NewDesc(
            "db_pool_connections_max",
            "The maximum number of connections allowed",
            []string{"db_name"}, nil,
        ),
        connectionsCreated: prometheus.NewDesc(
            "db_pool_connections_created_total",
            "The total number of connections created",
            []string{"db_name"}, nil,
        ),
        queryDuration: prometheus.NewDesc(
            "db_pool_query_duration_seconds",
            "The duration of database queries in seconds",
            []string{"db_name", "query_type"}, nil,
        ),
    }
}

// Describe implements the prometheus.Collector interface
func (c *DBPoolCollector) Describe(ch chan<- *prometheus.Desc) {
    ch <- c.activeConnections
    ch <- c.maxConnections
    ch <- c.connectionsCreated
    ch <- c.queryDuration
}

// Collect implements the prometheus.Collector interface
func (c *DBPoolCollector) Collect(ch chan<- prometheus.Metric) {
    // In a real scenario, these would come from your actual DB pool
    // For this example, we'll simulate the values
    
    // Simulate active connections (gauge)
    activeConns := float64(rand.Intn(100))
    ch <- prometheus.MustNewConstMetric(
        c.activeConnections,
        prometheus.GaugeValue,
        activeConns,
        "production_db", // label value for db_name
    )
    
    // Maximum connections (gauge)
    ch <- prometheus.MustNewConstMetric(
        c.maxConnections,
        prometheus.GaugeValue,
        200,
        "production_db",
    )
    
    // Total connections created (counter)
    // In a real implementation, this would be a cumulative value
    ch <- prometheus.MustNewConstMetric(
        c.connectionsCreated,
        prometheus.CounterValue,
        1000 + float64(rand.Intn(100)),
        "production_db",
    )
    
    // Query durations for different query types (histogram data)
    // In a real scenario, you would have actual timing data
    queryTypes := []string{"select", "insert", "update", "delete"}
    
    for _, queryType := range queryTypes {
        var baseDuration float64
        switch queryType {
        case "select":
            baseDuration = 0.05
        case "insert":
            baseDuration = 0.1
        case "update":
            baseDuration = 0.15
        case "delete":
            baseDuration = 0.12
        }
        
        // Add some randomness to the durations
        duration := baseDuration + rand.Float64()*0.1
        
        ch <- prometheus.MustNewConstMetric(
            c.queryDuration,
            prometheus.GaugeValue, // In a real implementation, you might use a Histogram
            duration,
            "production_db", queryType,
        )
    }
}

func main() {
    // Seed the random number generator
    rand.Seed(time.Now().UnixNano())
    
    // Create a new registry
    reg := prometheus.NewRegistry()
    
    // Create and register our custom collector
    collector := NewDBPoolCollector()
    reg.MustRegister(collector)
    
    // Expose metrics on /metrics endpoint
    http.Handle("/metrics", promhttp.HandlerFor(reg, promhttp.HandlerOpts{}))
    fmt.Println("Starting server on :8080")
    log.Fatal(http.ListenAndServe(":8080", nil))
}

When this code runs, it will expose metrics like:

# HELP db_pool_connections_active The number of active connections in the database pool
# TYPE db_pool_connections_active gauge
db_pool_connections_active{db_name="production_db"} 87

# HELP db_pool_connections_max The maximum number of connections allowed
# TYPE db_pool_connections_max gauge
db_pool_connections_max{db_name="production_db"} 200

# HELP db_pool_connections_created_total The total number of connections created
# TYPE db_pool_connections_created_total counter
db_pool_connections_created_total{db_name="production_db"} 1042

# HELP db_pool_query_duration_seconds The duration of database queries in seconds
# TYPE db_pool_query_duration_seconds gauge
db_pool_query_duration_seconds{db_name="production_db",query_type="select"} 0.123
db_pool_query_duration_seconds{db_name="production_db",query_type="insert"} 0.187
db_pool_query_duration_seconds{db_name="production_db",query_type="update"} 0.226
db_pool_query_duration_seconds{db_name="production_db",query_type="delete"} 0.144

Using Labels in Custom Collectors

As you've seen in the previous example, labels provide a powerful way to add dimensions to your metrics. They allow you to:

Categorize metrics (e.g., by database name, server instance, or query type)
Query and filter metrics in Prometheus expressions
Create more targeted alerts and dashboards

When designing your custom collectors, carefully consider which labels to include:

Use labels for dimensions that are important for querying and alerting
Avoid high-cardinality labels (e.g., user IDs or timestamps) as they can impact Prometheus performance
Keep label names and values consistent across related metrics

Registering Custom Collectors

There are two main ways to register your custom collectors:

1. Register directly with a registry:

reg := prometheus.NewRegistry()
collector := NewMyCustomCollector()
reg.MustRegister(collector)

2. Use the default registry:

collector := NewMyCustomCollector()
prometheus.MustRegister(collector)

Using a custom registry is useful when you want to expose different sets of metrics on different endpoints or when you want to control exactly which metrics are exposed.

Real-World Applications

Let's look at some practical scenarios where custom collectors are valuable:

1. External API Monitoring

// APIHealthCollector monitors external API health and response times
type APIHealthCollector struct {
    apiHealth      *prometheus.Desc
    apiResponseTime *prometheus.Desc
}

func (c *APIHealthCollector) Collect(ch chan<- prometheus.Metric) {
    // Check multiple APIs
    apis := map[string]string{
        "payment_gateway": "https://payment.example.com/health",
        "auth_service":    "https://auth.example.com/health",
        "data_service":    "https://data.example.com/health",
    }
    
    for name, url := range apis {
        // Measure response time
        startTime := time.Now()
        resp, err := http.Get(url)
        duration := time.Since(startTime).Seconds()
        
        // Record response time
        ch <- prometheus.MustNewConstMetric(
            c.apiResponseTime,
            prometheus.GaugeValue,
            duration,
            name,
        )
        
        // Record health status (1 = healthy, 0 = unhealthy)
        var health float64 = 0
        if err == nil && resp.StatusCode == 200 {
            health = 1
        }
        
        ch <- prometheus.MustNewConstMetric(
            c.apiHealth,
            prometheus.GaugeValue,
            health,
            name,
        )
    }
}

2. File System Monitoring

// FileSystemCollector monitors specific directories
type FileSystemCollector struct {
    directorySize *prometheus.Desc
    fileCount     *prometheus.Desc
}

func (c *FileSystemCollector) Collect(ch chan<- prometheus.Metric) {
    // Monitor critical directories
    directories := []string{"/var/log", "/tmp", "/var/lib/mysql"}
    
    for _, dir := range directories {
        // Get directory size and file count
        var size int64 = 0
        var count int64 = 0
        
        err := filepath.Walk(dir, func(_ string, info os.FileInfo, err error) error {
            if err != nil {
                return err
            }
            if !info.IsDir() {
                size += info.Size()
                count++
            }
            return nil
        })
        
        if err == nil {
            ch <- prometheus.MustNewConstMetric(
                c.directorySize,
                prometheus.GaugeValue,
                float64(size),
                dir,
            )
            
            ch <- prometheus.MustNewConstMetric(
                c.fileCount,
                prometheus.GaugeValue,
                float64(count),
                dir,
            )
        }
    }
}

3. Business Metrics Collector

Business metrics are often overlooked but can provide valuable insights:

// BusinessMetricsCollector collects business-related metrics
type BusinessMetricsCollector struct {
    activeUsers        *prometheus.Desc
    conversionRate     *prometheus.Desc
    averageOrderValue  *prometheus.Desc
}

func (c *BusinessMetricsCollector) Collect(ch chan<- prometheus.Metric) {
    // In a real application, these would come from your database or analytics service
    
    // Simulate active users count from different regions
    regions := []string{"north_america", "europe", "asia", "other"}
    for _, region := range regions {
        var baseUsers float64
        switch region {
        case "north_america":
            baseUsers = 5000
        case "europe":
            baseUsers = 3500
        case "asia":
            baseUsers = 4200
        case "other":
            baseUsers = 1800
        }
        
        activeUsers := baseUsers + rand.Float64()*500
        
        ch <- prometheus.MustNewConstMetric(
            c.activeUsers,
            prometheus.GaugeValue,
            activeUsers,
            region,
        )
    }
    
    // Conversion rates for different product categories
    categories := []string{"electronics", "clothing", "home_goods", "food"}
    for _, category := range categories {
        var baseRate float64
        switch category {
        case "electronics":
            baseRate = 0.032
        case "clothing":
            baseRate = 0.045
        case "home_goods":
            baseRate = 0.028
        case "food":
            baseRate = 0.067
        }
        
        conversionRate := baseRate + (rand.Float64()-0.5)*0.01
        
        ch <- prometheus.MustNewConstMetric(
            c.conversionRate,
            prometheus.GaugeValue,
            conversionRate,
            category,
        )
        
        // Average order values
        var baseOrderValue float64
        switch category {
        case "electronics":
            baseOrderValue = 250
        case "clothing":
            baseOrderValue = 85
        case "home_goods":
            baseOrderValue = 120
        case "food":
            baseOrderValue = 45
        }
        
        orderValue := baseOrderValue * (1 + (rand.Float64()-0.5)*0.2)
        
        ch <- prometheus.MustNewConstMetric(
            c.averageOrderValue,
            prometheus.GaugeValue,
            orderValue,
            category,
        )
    }
}

Best Practices for Custom Collectors

When implementing custom collectors, follow these best practices:

Naming Conventions: Follow Prometheus naming conventions
- Use lowercase with underscores (snake_case)
- Include relevant units (e.g., _seconds, _bytes, _total)
- Use consistent prefixes for related metrics
Performance Considerations:
- Keep metric collection lightweight; heavy operations can impact your application
- Implement caching for expensive operations
- Consider timeouts for external dependencies
Error Handling:
- Handle errors gracefully in the Collect method
- Log issues but don't block metric collection if one metric fails
- Provide fallback values when appropriate
Documentation:
- Add helpful descriptions to your metrics
- Document the meaning of labels
- Include unit information in the metric name or description
Testing:
- Write unit tests for your collectors
- Simulate edge cases and error conditions
- Test the performance impact of your collectors

Flow Diagram of a Custom Collector

Here's a diagram showing how custom collectors fit into the Prometheus ecosystem:

Summary

Custom collectors are a powerful feature of Prometheus that allow you to extend its monitoring capabilities to virtually any system or application. By implementing the Collector interface, you can create metrics tailored to your specific needs, whether they're technical metrics like system performance or business metrics like conversion rates.

In this guide, we've explored:

What custom collectors are and when to use them
How to implement the Collector interface
Different metric types and their appropriate uses
Using labels to add dimensions to your metrics
Real-world examples of custom collectors
Best practices for designing and implementing collectors

Custom collectors provide the flexibility you need to build a comprehensive monitoring solution that covers not just standard system metrics, but also application-specific and business-relevant metrics that can give you deeper insights into your systems.

Exercises

Create a custom collector that monitors the number of files in a specific directory and their total size.
Extend the database connection pool collector to include metrics about query errors and slow queries.
Implement a custom collector for a third-party API that your application depends on, tracking response times and error rates.
Create a collector that provides business metrics relevant to your application domain (e.g., user registrations, active sessions, or transaction values).
Design a custom collector that combines data from multiple sources into a single coherent set of metrics.

Additional Resources

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What Are Custom Collectors?​

When to Use Custom Collectors​

Creating Basic Custom Collectors​

The Collector Interface​

A Simple Example: System Uptime Collector​

Understanding Metric Types in Custom Collectors​

Advanced Example: Database Connection Pool Collector​

Using Labels in Custom Collectors​

Registering Custom Collectors​

1. Register directly with a registry:​

2. Use the default registry:​

Real-World Applications​

1. External API Monitoring​

2. File System Monitoring​

3. Business Metrics Collector​

Best Practices for Custom Collectors​

Flow Diagram of a Custom Collector​

Summary​

Exercises​

Additional Resources​