Skip to content

Logging & Metrics

Structured Logging

Traditional unstructured logs are human-readable but machine-hostile. Structured logging outputs logs in a parseable format (typically JSON), making them searchable, filterable, and aggregatable at scale.

Unstructured Log:
2024-03-15 14:32:01 ERROR PaymentService - Payment failed
for user 789, amount $99.99, card ending 4242
Structured Log (JSON):
{
"timestamp": "2024-03-15T14:32:01.234Z",
"level": "error",
"service": "payment-service",
"instance": "payment-svc-pod-3",
"trace_id": "abc123def456",
"span_id": "span789",
"user_id": "user_789",
"message": "Payment processing failed",
"error_type": "CardDeclinedException",
"amount": 99.99,
"currency": "USD",
"card_last_four": "4242",
"duration_ms": 1250,
"retry_count": 2
}

Why Structured Logging Matters

AspectUnstructuredStructured
SearchabilityRegex-based, fragileField-based queries
FilteringManual parsingWHERE level = "error" AND service = "payment"
AggregationNearly impossibleCOUNT(*) GROUP BY error_type
AlertingPattern matching on textPrecise conditions on fields
CorrelationCopy-paste trace IDsJoin on trace_id across services
DashboardsCannot buildBuild from any field

Log Levels

Use log levels consistently across all services.

LevelWhen to UseExample
TRACEVery detailed diagnostic info”Entering function processPayment with args…”
DEBUGDiagnostic information for developers”Cache miss for key user:789”
INFONormal operational events”Order 456 placed successfully”
WARNUnexpected but recoverable situations”Retrying payment after timeout (attempt 2/3)“
ERRORErrors that need attention”Payment failed: CardDeclinedException”
FATALSystem cannot continue”Database connection pool exhausted, shutting down”
import logging
import json
from datetime import datetime, timezone
class JSONFormatter(logging.Formatter):
"""Format log records as JSON for structured logging."""
def format(self, record):
log_entry = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"level": record.levelname.lower(),
"logger": record.name,
"message": record.getMessage(),
"module": record.module,
"function": record.funcName,
"line": record.lineno,
}
# Add extra fields
if hasattr(record, "trace_id"):
log_entry["trace_id"] = record.trace_id
if hasattr(record, "user_id"):
log_entry["user_id"] = record.user_id
if hasattr(record, "duration_ms"):
log_entry["duration_ms"] = record.duration_ms
# Add exception info
if record.exc_info:
log_entry["exception"] = self.formatException(
record.exc_info
)
return json.dumps(log_entry)
# Configure structured logging
logger = logging.getLogger("payment-service")
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
logger.setLevel(logging.INFO)
# Usage with context
logger.info(
"Payment processed successfully",
extra={
"trace_id": "abc123",
"user_id": "user_789",
"duration_ms": 150
}
)
# Using structlog for more ergonomic structured logging
import structlog
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.JSONRenderer()
]
)
log = structlog.get_logger("payment-service")
log.info(
"payment_processed",
user_id="user_789",
amount=99.99,
duration_ms=150
)

Log Aggregation

In distributed systems, logs are scattered across dozens or hundreds of service instances. Log aggregation collects, processes, and centralizes logs for unified querying.

ELK Stack (Elasticsearch, Logstash, Kibana)

ELK Stack Architecture:
Services Collection Storage UI
┌────────┐ ┌──────────┐ ┌────────────┐ ┌────────┐
│Service A│──┐ │ │ │ │ │ │
├────────┤ │ │ Logstash │ │Elasticsearch│ │ Kibana │
│Service B│──┼──────────▶│ (Parse, │─────▶│ (Index, │─▶│(Search,│
├────────┤ │ │ Filter, │ │ Store, │ │ Visual)│
│Service C│──┘ Beats │ Enrich) │ │ Query) │ │ │
└────────┘ (Filebeat) └──────────┘ └────────────┘ └────────┘

Grafana Loki (Lightweight Alternative)

Loki Stack:
Services Collection Storage UI
┌────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐
│Service A│──┐ │ │ │ │ │ │
├────────┤ │ │Promtail │ │ Loki │ │Grafana │
│Service B│──┼─────────▶│(Collect, │─────▶│(Label- │──▶│(Unified│
├────────┤ │ │ Label) │ │ based │ │ logs + │
│Service C│──┘ │ │ │ index) │ │metrics)│
└────────┘ └──────────┘ └──────────┘ └────────┘
Loki indexes only labels (not full text),
making it much cheaper to operate than Elasticsearch.
FeatureELK StackGrafana Loki
IndexingFull-text indexingLabel-based indexing only
Storage costHigh (indexes everything)Low (stores compressed logs)
Query speedVery fast (indexed)Slower (scans log content)
ComplexityHigh (cluster management)Lower
Best forLarge-scale log analyticsCost-effective log aggregation

Metrics

Metrics are numerical measurements collected at regular intervals. They provide a quantitative view of system health and behavior over time.

Metric Types

Counter (Monotonically increasing):
│ ●
│ ●●●
│ ●●●
│ ●●●●
│ ●●●●●
│ ●●●●
│ ●●●●
│●●●
└──────────────────────────── Time
Example: total_requests, errors_total
Only goes up. Reset on restart.
Gauge (Goes up and down):
│ ●● ●●
│ ● ●● ●● ●
│ ● ●● ●● ●●
│ ● ●● ●● ●●
│ ● ●●●● ●
│● ●
└──────────────────────────────── Time
Example: memory_usage_bytes, active_connections
Current value. Snapshot at a point in time.
Histogram (Distribution):
│ ████
│ ████████
│ ████████████
│ ████████████████
│ ████████████████████
└──────────────────────── Latency buckets (ms)
0-10 10-50 50-100 100-500 500+
Example: request_duration_seconds
Counts observations into configurable buckets.
Calculates sum and count for averages.

Prometheus

Prometheus is the de facto standard for metrics collection in cloud-native systems. It uses a pull model — scraping metrics from application endpoints.

from prometheus_client import (
Counter,
Histogram,
Gauge,
Summary,
start_http_server,
generate_latest,
)
import time
import random
# Define metrics
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint'],
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5]
)
ACTIVE_CONNECTIONS = Gauge(
'active_connections',
'Number of active connections'
)
IN_PROGRESS = Gauge(
'http_requests_in_progress',
'Number of HTTP requests in progress'
)
# Use metrics in your application
def handle_request(method, endpoint):
ACTIVE_CONNECTIONS.inc()
IN_PROGRESS.inc()
start_time = time.time()
try:
# Process request...
status = "200"
time.sleep(random.uniform(0.01, 0.5))
except Exception:
status = "500"
finally:
duration = time.time() - start_time
REQUEST_COUNT.labels(
method=method,
endpoint=endpoint,
status=status
).inc()
REQUEST_LATENCY.labels(
method=method,
endpoint=endpoint
).observe(duration)
IN_PROGRESS.dec()
ACTIVE_CONNECTIONS.dec()
# Expose /metrics endpoint for Prometheus to scrape
start_http_server(8000)
# Prometheus scrapes http://your-app:8000/metrics

PromQL (Prometheus Query Language)

Common PromQL Queries:
# Request rate (requests per second over 5 minutes)
rate(http_requests_total[5m])
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
# 99th percentile latency
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
)
# Active connections by instance
active_connections{job="api-server"}
# Memory usage trend
process_resident_memory_bytes{job="api-server"}
# Top 5 endpoints by request count
topk(5, sum by (endpoint) (
rate(http_requests_total[5m])
))

Grafana Dashboards

Grafana visualizes metrics from Prometheus (and other data sources) as interactive dashboards. A well-designed dashboard tells a story about service health.

Dashboard Design Principles

Effective Dashboard Layout:
┌──────────────────────────────────────────────────────┐
│ Service Health Overview (RED Method) │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Request │ │ Error │ │ Duration │ │
│ │ Rate │ │ Rate │ │ (p50/99)│ │
│ │ 1,234/s │ │ 0.5% │ │ 45/210ms │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Request Rate Over Time (line chart) │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────┐ ┌────────────────────────┐ │
│ │ Latency Distribution│ │ Error Rate by Endpoint │ │
│ │ (heatmap) │ │ (stacked bar) │ │
│ └─────────────────────┘ └────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Resource Usage: CPU, Memory, Disk │ │
│ └────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘

The RED Method (for request-driven services)

SignalMetricPurpose
RateRequests per secondTraffic volume
ErrorsError percentageService reliability
DurationLatency percentilesUser experience

The USE Method (for resources)

SignalMetricPurpose
UtilizationPercentage of resource busyCapacity planning
SaturationQueue depth, waiting workBottleneck detection
ErrorsError events on the resourceHardware/software issues

Alerting Best Practices

Alerts should be actionable, timely, and not overwhelming. Bad alerting leads to alert fatigue — when engineers start ignoring alerts because too many are false positives.

Alert Design Principles

Good Alert:
┌─────────────────────────────────────────────┐
│ FIRING: High Error Rate on Payment Service │
│ │
│ Severity: critical │
│ Service: payment-service │
│ Error rate: 5.2% (threshold: 1%) │
│ Duration: 10 minutes │
│ Dashboard: https://grafana/d/payments │
│ Runbook: https://wiki/runbooks/payment-err │
│ On-call: @jane-doe (primary) │
│ │
│ What to do: │
│ 1. Check the dashboard for error patterns │
│ 2. Check recent deployments │
│ 3. Follow the runbook │
└─────────────────────────────────────────────┘
Bad Alert:
┌─────────────────────────────────────────────┐
│ FIRING: CPU > 80% │
│ │
│ (No context. No runbook. Is this a problem │
│ or normal during peak traffic? What should │
│ the on-call engineer do?) │
└─────────────────────────────────────────────┘
PrincipleDescription
Alert on symptoms, not causesAlert on “error rate is high” not “CPU is high”
Include contextDashboard links, runbook links, recent changes
Set appropriate thresholdsBased on SLOs, not arbitrary numbers
Use multi-window alertingAvoid alerting on brief spikes (require sustained violations)
Severity levelsCritical = pages on-call, Warning = next business day
Reduce noiseGroup related alerts, suppress duplicates
# Prometheus alerting rules (prometheus-rules.yaml)
groups:
- name: payment-service
rules:
# Alert on high error rate (symptom-based)
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{
service="payment", status=~"5.."
}[5m]))
/
sum(rate(http_requests_total{
service="payment"
}[5m]))
> 0.01
for: 5m # Must persist for 5 minutes
labels:
severity: critical
team: payments
annotations:
summary: "High error rate on payment service"
description: >
Error rate is {{ $value | humanizePercentage }}
(threshold: 1%).
dashboard: "https://grafana.example/d/payments"
runbook: "https://wiki.example/runbooks/payment-errors"
# Alert on high latency
- alert: HighLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket{
service="payment"
}[5m])
) > 2.0
for: 10m
labels:
severity: warning
team: payments
annotations:
summary: "High p99 latency on payment service"
description: >
p99 latency is {{ $value | humanizeDuration }}.
# Alert on error budget burn rate
- alert: ErrorBudgetBurnRate
expr: |
1 - (
sum(rate(http_requests_total{
service="payment", status!~"5.."
}[1h]))
/
sum(rate(http_requests_total{
service="payment"
}[1h]))
) > 14.4 * (1 - 0.999)
for: 5m
labels:
severity: critical
annotations:
summary: "Error budget burning too fast"

Logging and Metrics Anti-Patterns

Anti-PatternProblemFix
Logging everythingStorage costs explode, signal lost in noiseLog meaningful events, use sampling for high-volume paths
Unstructured logsCannot search, filter, or aggregateUse JSON structured logging from day one
High-cardinality labelsPrometheus runs out of memoryNever use user IDs, request IDs, or IP addresses as metric labels
Alerting on causes”CPU high” is not actionableAlert on user-facing symptoms: error rate, latency
No runbooksOn-call has no idea what to doEvery alert must link to a runbook
Alert fatigueEngineers ignore all alertsReduce noise, fix flapping alerts, tune thresholds

Summary

ConceptKey Takeaway
Structured LoggingJSON logs are searchable, filterable, and machine-parseable
Log LevelsUse consistently: DEBUG, INFO, WARN, ERROR, FATAL
Log AggregationCentralize logs with ELK or Loki for unified querying
Metric TypesCounters (totals), Gauges (current), Histograms (distribution)
PrometheusPull-based metrics collection with powerful query language
GrafanaUnified visualization for metrics, logs, and traces
RED MethodRate, Errors, Duration — for request-driven services
AlertingAlert on symptoms, include context, link to runbooks