Skip to content

Distributed Tracing

Why Tracing Matters in Microservices

In a monolithic application, a stack trace tells you exactly what happened when a request failed. In a microservices architecture, a single user request may flow through 5, 10, or even 50 services. When something goes wrong, you need a way to follow that request across service boundaries.

Monolith: One stack trace tells the whole story
┌────────────────────────────────┐
│ → handleRequest() │
│ → validateInput() │
│ → queryDatabase() │
│ → processPayment() ← ERROR │
│ → sendNotification() │
└────────────────────────────────┘
Microservices: Request crosses multiple processes
┌──────────┐ ┌──────────┐ ┌──────────┐
│ API │──▶│ Order │──▶│ Payment │ ← ERROR
│ Gateway │ │ Service │ │ Service │
└──────────┘ └──────────┘ └──────────┘
│ │
│ ┌──────────┐
└─────────────────────▶│Inventory │
│ Service │
└──────────┘
Without tracing, which service caused the problem?
How long did each service take?
What was the call sequence?

Distributed tracing answers three critical questions:

  1. Where did the request go? (Which services did it touch?)
  2. How long did each service take? (Where is the bottleneck?)
  3. What went wrong? (Which service or operation failed?)

The Trace-Span Model

A trace represents the entire journey of a request through a distributed system. It consists of spans — each span represents a unit of work within a single service.

Trace Structure:
Trace ID: 4bf92f3577b34da6a3ce929d0e0e4736
┌──────────────────────────────────────────────────────────┐
│ │
│ Span A: API Gateway [450ms] │
│ ┌──────────────────────────────────────────────────┐ │
│ │ trace_id: 4bf92f... │ │
│ │ span_id: a1b2c3 │ │
│ │ parent: none (root span) │ │
│ │ service: api-gateway │ │
│ │ operation: GET /api/orders/123 │ │
│ │ duration: 450ms │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ├──▶ Span B: Auth Service [80ms] │
│ │ ┌────────────────────────────────────────┐ │
│ │ │ span_id: d4e5f6 │ │
│ │ │ parent: a1b2c3 │ │
│ │ │ service: auth-service │ │
│ │ │ operation: validateToken │ │
│ │ └────────────────────────────────────────┘ │
│ │ │
│ └──▶ Span C: Order Service [320ms] │
│ ┌────────────────────────────────────────┐ │
│ │ span_id: g7h8i9 │ │
│ │ parent: a1b2c3 │ │
│ │ service: order-service │ │
│ └────────────────────────────────────────┘ │
│ │ │
│ ├──▶ Span D: DB Query [200ms] │
│ │ span_id: j0k1l2 │
│ │ parent: g7h8i9 │
│ │ │
│ └──▶ Span E: Cache [15ms] │
│ span_id: m3n4o5 │
│ parent: g7h8i9 │
└──────────────────────────────────────────────────────────┘

Span Anatomy

Each span carries detailed information about the work it represents.

FieldDescriptionExample
Trace IDUnique identifier for the entire trace4bf92f3577b34da6...
Span IDUnique identifier for this spana1b2c3d4e5f6
Parent Span IDID of the calling spannone (root) or parent ID
Operation NameThe specific operationGET /api/orders/123
Service NameWhich service executed itorder-service
Start TimeWhen the span started2024-03-15T14:32:01.234Z
DurationHow long it took320ms
StatusSuccess or errorOK, ERROR
AttributesKey-value metadatahttp.method=GET, db.statement=SELECT...
EventsTimestamped annotations within the spanExceptions, log messages

OpenTelemetry

OpenTelemetry (OTel) is the industry-standard, vendor-neutral framework for collecting traces, metrics, and logs. It provides APIs, SDKs, and tools for instrumenting applications.

OpenTelemetry Architecture:
┌──────────────────────────────────────────────────────┐
│ Your Application │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ OpenTelemetry SDK │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ │
│ │ │ Tracer │ │ Meter │ │ Log Provider │ │ │
│ │ │ Provider │ │ Provider │ │ │ │ │
│ │ └─────┬────┘ └─────┬────┘ └──────┬───────┘ │ │
│ │ │ │ │ │ │
│ │ ┌─────▼────────────▼─────────────▼───────┐ │ │
│ │ │ Exporters │ │ │
│ │ │ (OTLP, Jaeger, Prometheus, Console) │ │ │
│ │ └────────────────┬───────────────────────┘ │ │
│ └───────────────────┼───────────────────────────┘ │
└──────────────────────┼──────────────────────────────┘
┌──────────────────────────────────────────────────────┐
│ OTel Collector (optional) │
│ │
│ Receive → Process → Export │
│ (batch, filter, sample, transform) │
└──────────┬──────────┬──────────┬─────────────────────┘
│ │ │
▼ ▼ ▼
Jaeger Prometheus Loki
(Traces) (Metrics) (Logs)

Instrumenting Your Application

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
BatchSpanProcessor,
ConsoleSpanExporter,
)
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
OTLPSpanExporter,
)
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import (
RequestsInstrumentor,
)
from opentelemetry.instrumentation.psycopg2 import (
Psycopg2Instrumentor,
)
# Configure the tracer
resource = Resource.create({
"service.name": "order-service",
"service.version": "1.2.0",
"deployment.environment": "production",
})
provider = TracerProvider(resource=resource)
# Export traces to Jaeger via OTLP
otlp_exporter = OTLPSpanExporter(
endpoint="http://jaeger-collector:4317"
)
provider.add_span_processor(
BatchSpanProcessor(otlp_exporter)
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
# Auto-instrument frameworks and libraries
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()
Psycopg2Instrumentor().instrument()
# Manual instrumentation for custom logic
def process_order(order_id: str):
with tracer.start_as_current_span(
"process_order",
attributes={
"order.id": order_id,
"order.source": "web",
}
) as span:
# Validate order
with tracer.start_as_current_span("validate_order"):
validate(order_id)
# Check inventory
with tracer.start_as_current_span(
"check_inventory"
) as inv_span:
available = check_inventory(order_id)
inv_span.set_attribute(
"inventory.available", available
)
if not available:
span.set_status(
trace.StatusCode.ERROR,
"Out of stock"
)
span.add_event("inventory_check_failed", {
"order_id": order_id,
"reason": "out_of_stock",
})
return None
# Process payment
with tracer.start_as_current_span("process_payment"):
payment_result = charge_payment(order_id)
span.set_attribute(
"payment.status",
payment_result.status
)
return order_id

Context Propagation

For tracing to work across service boundaries, the trace context (trace ID, span ID, flags) must be propagated from service to service. This typically happens via HTTP headers.

Context Propagation via HTTP Headers:
Service A Service B
┌─────────────────────┐ ┌─────────────────────┐
│ Span: processOrder │ │ Span: chargePayment │
│ trace_id: abc123 │ │ trace_id: abc123 │
│ span_id: span_A │ │ span_id: span_B │
│ │ HTTP │ parent: span_A │
│ │──────────▶│ │
│ │ Headers: │ │
│ │ traceparent: 00-abc123-span_A-01│
└─────────────────────┘ └─────────────────────┘
W3C Trace Context Header Format:
traceparent: {version}-{trace-id}-{parent-span-id}-{trace-flags}
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

Propagation Formats

FormatStandardHeader Name
W3C Trace ContextW3C standard (recommended)traceparent, tracestate
B3Zipkin formatX-B3-TraceId, X-B3-SpanId, X-B3-Sampled
JaegerJaeger nativeuber-trace-id
AWS X-RayAWS formatX-Amzn-Trace-Id

Trace Sampling Strategies

In high-traffic systems, tracing every single request would generate enormous volumes of data. Sampling controls which requests are traced.

Sampling Decision Point:
All Requests
┌────────────────────────────────┐
│ ● ● ● ● ● ● ● ● ● ● ● ● ● │
│ ● ● ● ● ● ● ● ● ● ● ● ● ● │ 1,000 req/s
│ ● ● ● ● ● ● ● ● ● ● ● ● ● │
└────────────┬───────────────────┘
▼ Sampling (10%)
┌────────────────────────────────┐
│ ● ● ● ● ● │
│ ● ● ● │ 100 req/s stored
│ ● ● ● ● │
└────────────────────────────────┘

Sampling Strategies Compared

StrategyDescriptionProsCons
Head-based (probabilistic)Decide at start of requestSimple, low overheadMay miss important traces
Tail-basedDecide after trace completesCan keep all errorsHigher resource usage
Rate-limitedKeep N traces per secondPredictable costsMay miss spikes
Always-onTrace everythingComplete dataVery expensive at scale
Parent-basedFollow parent’s decisionConsistent tracesDepends on upstream
from opentelemetry.sdk.trace.sampling import (
TraceIdRatioBased,
ParentBasedTraceIdRatio,
)
from opentelemetry.sdk.trace import TracerProvider
# Head-based: sample 10% of traces
sampler = TraceIdRatioBased(0.1)
# Parent-based: respect parent's sampling decision,
# use ratio for root spans
sampler = ParentBasedTraceIdRatio(0.1)
provider = TracerProvider(sampler=sampler)
# Custom sampler: always sample errors and slow requests
from opentelemetry.sdk.trace.sampling import (
Sampler,
SamplingResult,
Decision,
)
class SmartSampler(Sampler):
def __init__(self, base_rate=0.1):
self.base_rate = base_rate
self.base_sampler = TraceIdRatioBased(base_rate)
def should_sample(
self, parent_context, trace_id,
name, kind, attributes, links
):
# Always sample specific operations
always_sample = [
"process_payment",
"user_signup",
"admin_action",
]
if name in always_sample:
return SamplingResult(
Decision.RECORD_AND_SAMPLE,
attributes,
)
# Use base rate for everything else
return self.base_sampler.should_sample(
parent_context, trace_id,
name, kind, attributes, links
)
def get_description(self):
return f"SmartSampler(base_rate={self.base_rate})"
provider = TracerProvider(sampler=SmartSampler(0.1))

Correlating Logs with Traces

The true power of observability emerges when you can jump from an alert (metric) to a trace to the specific log line that shows what went wrong.

Correlation Flow:
1. Alert fires: "Error rate > 5%"
2. Find trace IDs from error spans
3. Query logs with trace ID
4. See exact error: "CardDeclinedException
for user 789 at line 142 of payment.py"
import logging
import json
from opentelemetry import trace
class TraceContextFormatter(logging.Formatter):
"""Add trace context to every log line."""
def format(self, record):
# Get current span context
span = trace.get_current_span()
ctx = span.get_span_context()
log_entry = {
"timestamp": self.formatTime(record),
"level": record.levelname.lower(),
"message": record.getMessage(),
"logger": record.name,
# Trace correlation fields
"trace_id": format(ctx.trace_id, "032x")
if ctx.trace_id else None,
"span_id": format(ctx.span_id, "016x")
if ctx.span_id else None,
"service": "order-service",
}
# Add any extra fields
for key in ["user_id", "order_id", "duration_ms"]:
if hasattr(record, key):
log_entry[key] = getattr(record, key)
if record.exc_info:
log_entry["exception"] = self.formatException(
record.exc_info
)
return json.dumps(log_entry)
# Configure logger
logger = logging.getLogger("order-service")
handler = logging.StreamHandler()
handler.setFormatter(TraceContextFormatter())
logger.addHandler(handler)
# Now every log automatically includes trace_id and span_id
def process_order(order_id):
with tracer.start_as_current_span("process_order"):
logger.info(
"Processing order",
extra={"order_id": order_id}
)
# Log output includes trace_id automatically:
# {"trace_id": "4bf92f...", "span_id": "a1b2c3...",
# "message": "Processing order",
# "order_id": "order_123"}

Jaeger: Trace Visualization

Jaeger is an open-source distributed tracing platform that provides trace storage, querying, and visualization.

Jaeger Architecture:
Applications Jaeger Backend
┌─────────┐ ┌──────────────────────┐
│Service A│──┐ │ │
├─────────┤ │ OTLP │ ┌────────────────┐ │
│Service B│──┼──────────▶│ │ Collector │ │
├─────────┤ │ │ │ (receive, │ │
│Service C│──┘ │ │ process, │ │
└─────────┘ │ │ store) │ │
│ └───────┬────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ Storage │ │
│ │ (Elasticsearch,│ │
│ │ Cassandra, │ │
│ │ or in-memory) │ │
│ └───────┬────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ Query + UI │ │
│ │ (search, │ │
│ │ visualize) │ │
│ └────────────────┘ │
└──────────────────────┘

Setting Up Jaeger

# docker-compose.yaml for local Jaeger
version: '3.8'
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # Jaeger UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
environment:
- COLLECTOR_OTLP_ENABLED=true
- SPAN_STORAGE_TYPE=elasticsearch
- ES_SERVER_URLS=http://elasticsearch:9200

Tracing Best Practices

PracticeDescription
Name spans meaningfullyUse HTTP GET /api/orders not just HTTP
Add semantic attributesFollow OpenTelemetry semantic conventions
Record errors properlyUse span.recordException() and set error status
Propagate contextEnsure trace context flows through all service calls
Use baggage sparinglyBaggage items travel with every request — keep them small
Sample intelligentlyAlways keep errors, sample routine traffic
Set resource attributesService name, version, environment on every span
Instrument at boundariesHTTP handlers, database calls, queue operations

Summary

ConceptKey Takeaway
Distributed TracingFollow a single request across all services in a microservice architecture
Traces and SpansA trace is a tree of spans; each span represents one operation
OpenTelemetryVendor-neutral standard for instrumenting applications
Context PropagationW3C Trace Context headers carry trace IDs across service boundaries
SamplingControl trace volume: head-based, tail-based, or smart sampling
Log CorrelationInclude trace IDs in logs to connect logs, traces, and metrics
JaegerOpen-source tracing backend for storage and visualization