Grafana Tempo

Introduction

Grafana Tempo is a high-scale, minimal-dependency distributed tracing system developed by Grafana Labs. It's designed to be cost-effective and easy to operate, making it an excellent choice for organizations of all sizes looking to implement distributed tracing. As part of the Prometheus ecosystem, Tempo integrates seamlessly with Prometheus metrics and Loki logs, providing a complete observability solution.

Distributed tracing is a method used to profile and monitor applications, especially those built using a microservices architecture. Traces help you follow a request as it travels through different services, making it easier to identify performance bottlenecks and troubleshoot issues.

Key Features of Grafana Tempo

Seamless Integration: Works with the broader Prometheus ecosystem
Cost-Efficient Storage: Uses object storage (S3, GCS, Azure Blob Storage) for trace data
Trace Discovery: Allows finding traces by service name, duration, and other attributes
Compatibility: Supports multiple tracing protocols (Jaeger, Zipkin, OpenTelemetry, OpenCensus)
Exemplars Support: Connects metrics and logs to corresponding traces

How Tempo Fits in the Prometheus Ecosystem

In the Prometheus ecosystem, Tempo complements the existing tools:

Prometheus collects and stores metrics data
Loki handles logs collection and storage
Tempo manages distributed traces
Grafana provides a unified dashboard for all three data types

This combination is often referred to as the "Three Pillars of Observability" (metrics, logs, and traces).

Getting Started with Grafana Tempo

Prerequisites

Docker and Docker Compose installed
Basic understanding of distributed systems
Familiarity with Prometheus concepts

Setting Up Tempo Locally

Let's set up a basic Tempo instance using Docker Compose. Create a file named docker-compose.yml with the following content:

version: '3'
services:
  tempo:
    image: grafana/tempo:latest
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./tempo.yaml:/etc/tempo.yaml
    ports:
      - "3200:3200"  # tempo
      - "4317:4317"  # otlp grpc
      - "4318:4318"  # otlp http
      - "9411:9411"  # zipkin

  grafana:
    image: grafana/grafana:latest
    volumes:
      - ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
    ports:
      - "3000:3000"

Now, create a basic Tempo configuration file named tempo.yaml:

server:
  http_listen_port: 3200

distributor:
  receivers:
    jaeger:
      protocols:
        thrift_http:
          endpoint: 0.0.0.0:14268
    zipkin:
      endpoint: 0.0.0.0:9411
    otlp:
      protocols:
        http:
          endpoint: 0.0.0.0:4318
        grpc:
          endpoint: 0.0.0.0:4317

storage:
  trace:
    backend: local
    local:
      path: /tmp/tempo/blocks
    pool:
      max_workers: 100
      queue_depth: 10000

compactor:
  compaction:
    block_retention: 48h

Finally, create a Grafana datasource configuration file grafana-datasources.yaml:

apiVersion: 1

datasources:
  - name: Tempo
    type: tempo
    access: proxy
    orgId: 1
    url: http://tempo:3200
    basicAuth: false
    isDefault: true
    version: 1
    editable: false
    apiVersion: 1
    uid: tempo

Start the services with:

docker-compose up -d

Now you can access Grafana at http://localhost:3000 (default credentials: admin/admin) and you'll have Tempo configured as a data source.

Instrumenting Applications for Tracing

For Tempo to be useful, your applications need to be instrumented to emit traces. Let's look at how to instrument a simple Node.js application using OpenTelemetry:

First, install the required packages:

npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-trace-otlp-http

Create a file named tracing.js:

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces'
  }),
  instrumentations: [getNodeAutoInstrumentations()]
});

sdk.start();

Create a simple Express application in app.js:

// Load tracing first
require('./tracing');

const express = require('express');
const app = express();
const port = 3001;

app.get('/', async (req, res) => {
  // Simulate work
  await new Promise(resolve => setTimeout(resolve, 100));
  res.send('Hello World!');
});

app.get('/api', async (req, res) => {
  // Simulate database query
  await new Promise(resolve => setTimeout(resolve, 200));
  // Simulate external API call
  await fetch('https://jsonplaceholder.typicode.com/todos/1');
  res.json({ message: 'API response' });
});

app.listen(port, () => {
  console.log(`Example app listening at http://localhost:${port}`);
});

After starting this application and making a few requests to it, you'll be able to see the traces in Grafana.

Querying Traces in Grafana

Open Grafana at http://localhost:3000
Navigate to Explore (compass icon in the left sidebar)
Select "Tempo" as the data source
You can search for traces using:
- Service name
- Operation name
- Duration
- Tags/attributes

Example query to find traces longer than 500ms for a service named "express-app":

{service.name="express-app"} | duration > 500ms

Connecting Metrics and Traces with Exemplars

One of the most powerful features of the Prometheus ecosystem is the ability to connect metrics, logs, and traces. This is done through "exemplars".

Exemplars are sample trace IDs embedded within metrics that allow you to jump from a metric spike directly to the traces that were recorded during that time period.

Here's how to set up Prometheus to record exemplars:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'instrumented-app'
    static_configs:
      - targets: ['app:3001']
    
    # Enable exemplar storage
    params:
      exemplar: ['true']

When viewing metrics in Grafana, you'll see small diamonds on your metric lines indicating exemplars. Clicking on these will take you directly to the associated trace in Tempo.

Practical Use Cases

1. Performance Troubleshooting

When users report slow responses:

Check metrics dashboards to identify when slowdowns occur
Use exemplars to jump to traces during those periods
Analyze traces to see which services or operations are taking too long
Optimize the identified bottlenecks

2. Error Investigation

When errors occur:

Search for traces with error status
Examine the full request path to see where errors originated
View associated logs for more context about the error
Fix the root cause based on the comprehensive view

3. Capacity Planning

For capacity planning:

Analyze trace data to understand service dependencies
Identify frequently called services that might need scaling
Find underutilized services that could be scaled down
Optimize resource allocation based on actual usage patterns

Best Practices

Use Service and Span Names Consistently: Adopt a consistent naming convention for services and spans to make querying easier.
Add Relevant Attributes: Include meaningful attributes (tags) on your spans to make filtering more effective.
Set Appropriate Sampling Rates: In high-volume systems, use sampling to reduce costs while still capturing representative traces.
Integrate with Metrics and Logs: Connect your tracing data with metrics and logs for a complete view of your system.
Trace Context Propagation: Ensure trace context is properly propagated between services, especially across different technologies.

Common Challenges and Solutions

Challenge	Solution
Too many traces to store	Implement tail-based sampling to store only interesting traces
Missing spans in traces	Check if all services propagate trace context correctly
High cardinality issues	Be careful with high-cardinality attributes in span tags
Integration with legacy systems	Use OpenTelemetry collectors to bridge trace data from legacy systems

Summary

Grafana Tempo is a powerful distributed tracing system that integrates smoothly with the Prometheus ecosystem. With its minimal dependencies and cost-effective storage approach, it provides an accessible entry point into the world of distributed tracing.

By combining Tempo traces with Prometheus metrics and Loki logs, you can achieve a comprehensive observability solution that helps you understand, troubleshoot, and optimize your distributed systems.

Additional Resources

Exercises

Set up Tempo, Prometheus, and Loki using Docker Compose for a complete observability stack.
Instrument a simple application using OpenTelemetry and send traces to Tempo.
Create a Grafana dashboard that combines metrics, logs, and traces.
Implement sampling strategies for high-volume applications.
Practice troubleshooting by intentionally introducing latency and errors, then use Tempo to diagnose the issues.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Key Features of Grafana Tempo​

How Tempo Fits in the Prometheus Ecosystem​

Getting Started with Grafana Tempo​

Prerequisites​

Setting Up Tempo Locally​

Instrumenting Applications for Tracing​

Querying Traces in Grafana​

Connecting Metrics and Traces with Exemplars​

Practical Use Cases​

1. Performance Troubleshooting​

2. Error Investigation​

3. Capacity Planning​

Best Practices​

Common Challenges and Solutions​

Summary​

Additional Resources​

Exercises​