Skip to content

Observability & SRE

Observability vs Monitoring

Monitoring tells you when something is wrong. Observability tells you why it is wrong. While monitoring is about collecting predefined metrics and setting alerts, observability is the ability to understand the internal state of a system by examining its outputs.

Monitoring: Observability:
┌────────────────────────┐ ┌────────────────────────┐
│ "CPU is at 95%" │ │ "CPU is at 95% because │
│ "Error rate spiked" │ │ user X's query caused │
│ "Latency is high" │ │ a full table scan on │
│ │ │ the orders table due │
│ You know WHAT │ │ to a missing index │
│ is happening. │ │ deployed 10 min ago." │
│ │ │ │
│ Pre-defined dashboards │ │ You understand WHY │
│ and alerts. │ │ it is happening. │
│ │ │ │
│ Works for KNOWN │ │ Works for UNKNOWN │
│ failure modes. │ │ failure modes. │
└────────────────────────┘ └────────────────────────┘
AspectMonitoringObservability
ApproachTrack known metricsExplore unknown problems
Questions”Is the system healthy?""Why is the system unhealthy?”
DataPredefined metrics and thresholdsRich, high-cardinality data
InvestigationDashboard-drivenQuery-driven, ad-hoc exploration
Failure modesHandles known unknownsHandles unknown unknowns

The Three Pillars of Observability

Modern observability rests on three complementary data types, each providing a different lens into system behavior.

┌─────────────────────────────────────────────────────────┐
│ Three Pillars │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ │ │ │ │ │ │
│ │ LOGS │ │ METRICS │ │ TRACES │ │
│ │ │ │ │ │ │ │
│ │ What │ │ How the │ │ How a request │ │
│ │ happened │ │ system │ │ flows through │ │
│ │ (events) │ │ behaves │ │ services │ │
│ │ │ │ (numbers)│ │ (causality) │ │
│ └──────────┘ └──────────┘ └──────────────────┘ │
│ │
│ Detailed Aggregated Causal │
│ per-event over time per-request │
│ data data data │
└─────────────────────────────────────────────────────────┘

Logs

Logs are discrete events with timestamps and context. They record what happened in a system at a specific point in time.

{
"timestamp": "2024-03-15T14:32:01.234Z",
"level": "error",
"service": "payment-service",
"trace_id": "abc123def456",
"user_id": "user_789",
"message": "Payment processing failed",
"error": "CardDeclinedException",
"amount": 99.99,
"currency": "USD",
"duration_ms": 1250
}

Best for: Debugging specific issues, audit trails, security events, error details.

Metrics

Metrics are numerical measurements collected over time. They show trends, patterns, and anomalies in system behavior.

HTTP Request Rate (requests/second)
60 │ ●●
50 │ ● ●●
40 │ ●●●●● ●●
30 │ ●●●● ●●●
20 │ ●●●●●● ●●●●●●
10 │●●●● ●●●●●
0 └──────────────────────────────────────── Time
08:00 09:00 10:00 11:00 12:00
Key metric types:
- Counter: Total requests served (monotonically increasing)
- Gauge: Current memory usage (goes up and down)
- Histogram: Request latency distribution (buckets)
- Summary: Similar to histogram with pre-calculated quantiles

Best for: Alerting, dashboards, capacity planning, trend analysis.

Traces

Traces follow a single request as it flows through multiple services in a distributed system. They show causality and timing.

Trace ID: abc-123-def
┌──────────────────────────────────────────────────────────────┐
│ Span: API Gateway │
│ ██████████████████████████████████████████████████ 450ms │
│ │ │
│ ├─ Span: Auth Service │
│ │ ████████ 80ms │
│ │ │
│ ├─ Span: Order Service │
│ │ ██████████████████████████████████ 320ms │
│ │ │ │
│ │ ├─ Span: Database Query │
│ │ │ ██████████████████████ 200ms ← SLOW! │
│ │ │ │
│ │ └─ Span: Cache Lookup │
│ │ ██ 15ms │
│ │ │
│ └─ Span: Notification Service │
│ ████ 30ms │
└──────────────────────────────────────────────────────────────┘
This trace reveals that the database query is the bottleneck.

Best for: Understanding request flow, finding bottlenecks, debugging latency issues in microservices.

How the Three Pillars Work Together

1. Alert fires: "Error rate > 5%" (METRICS)
2. Check traces: "Errors in payment-svc" (TRACES)
3. Find the error: "CardDeclined for (LOGS)
user_789 at 14:32:01, amount $99.99"

SRE Principles

Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to operations problems. Originated at Google, SRE provides a framework for building and running reliable, scalable systems.

The Google SRE Philosophy

SRE Core Tenets:
┌───────────────────────────────────────────────────────┐
│ 1. Embracing Risk │
│ 100% reliability is wrong target. │
│ Define acceptable risk levels. │
│ │
│ 2. Service Level Objectives (SLOs) │
│ Quantify reliability targets. │
│ Use error budgets to balance velocity and │
│ reliability. │
│ │
│ 3. Eliminating Toil │
│ Automate repetitive operational work. │
│ Engineers should spend > 50% on engineering, │
│ not operations. │
│ │
│ 4. Monitoring │
│ Use the four golden signals: │
│ Latency, Traffic, Errors, Saturation. │
│ │
│ 5. Simplicity │
│ Simple systems are more reliable. │
│ Resist unnecessary complexity. │
│ │
│ 6. Release Engineering │
│ Make deployments boring and repeatable. │
└───────────────────────────────────────────────────────┘

The Four Golden Signals

Every service should be monitored using these four signals at minimum.

SignalWhat It MeasuresExample
LatencyTime to serve a requestp50 = 50ms, p99 = 200ms
TrafficDemand on the system1,000 requests/second
ErrorsRate of failed requests0.5% error rate
SaturationHow “full” the system isCPU at 75%, disk 80% full

Embracing Risk

Pursuing 100% reliability is almost always the wrong goal. The incremental cost of each additional “nine” of reliability increases exponentially, while the user-perceived benefit diminishes.

Availability Downtime/Year Cost to Achieve
─────────────────────────────────────────────────
99% 3.65 days $
99.9% 8.76 hours $$
99.95% 4.38 hours $$$
99.99% 52.6 minutes $$$$
99.999% 5.26 minutes $$$$$$$$
The cost grows exponentially, but users
often cannot tell the difference between
99.99% and 99.999%.

Eliminating Toil

Toil is manual, repetitive, automatable work that scales linearly with service growth. SRE aims to spend no more than 50% of time on toil, with the remainder spent on engineering work.

ToilEngineering
Manually restarting crashed servicesBuilding auto-restart and self-healing
Manually scaling up for traffic spikesImplementing auto-scaling
Manually checking logs for errorsBuilding automated alerting
Manually running database migrationsAutomating migration pipelines
Manually responding to the same alert repeatedlyWriting a runbook, then automating the fix

Observability Stack

A typical modern observability stack combines specialized tools for each pillar.

┌──────────────────────────────────────────────────────────┐
│ Observability Stack │
│ │
│ Logs: │
│ ┌──────┐ ┌────────────┐ ┌─────────────────────────┐ │
│ │Fluentd│─▶│Elasticsearch│─▶│ Kibana (Visualization) │ │
│ │Promtail│ │ or Loki │ │ or Grafana │ │
│ └──────┘ └────────────┘ └─────────────────────────┘ │
│ │
│ Metrics: │
│ ┌────────────┐ ┌────────────┐ ┌─────────────────┐ │
│ │ Prometheus │─▶│ Prometheus │─▶│ Grafana │ │
│ │ Exporters │ │ Server │ │ Dashboards │ │
│ └────────────┘ └────────────┘ └─────────────────┘ │
│ │
│ Traces: │
│ ┌────────────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ OpenTelemetry │─▶│ Jaeger │─▶│ Trace UI │ │
│ │ SDK │ │ or Tempo │ │ │ │
│ └────────────────┘ └──────────┘ └──────────────┘ │
│ │
│ Unified: │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Grafana (Unified dashboards, alerts, explore) │ │
│ │ or Datadog / New Relic / Honeycomb │ │
│ └──────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
ToolCategoryDescription
PrometheusMetricsPull-based metrics collection and storage
GrafanaVisualizationDashboards for metrics, logs, and traces
ElasticsearchLog storageFull-text search engine for log data
LokiLog storageLightweight log aggregation by Grafana Labs
JaegerTracingDistributed tracing backend
TempoTracingScalable tracing backend by Grafana Labs
OpenTelemetryInstrumentationVendor-neutral telemetry collection framework
DatadogAll-in-oneCommercial observability platform
HoneycombObservabilityQuery-driven observability platform

What You Will Learn in This Section