Skip to content

Monitoring, Observability & IaC

Monitoring vs Observability

Monitoring tells you when something is wrong. Observability helps you understand why.

AspectMonitoringObservability
FocusKnown failure modesUnknown unknowns
ApproachDashboards, alerts, thresholdsExploration, correlation, querying
Question”Is the system healthy?""Why is the system behaving this way?”
DataPredefined metrics and checksRich, high-cardinality telemetry data
AnalogyCar dashboard warning lightsA mechanic’s full diagnostic toolkit

Monitoring is a subset of observability. A truly observable system emits enough data that you can diagnose novel problems without deploying new instrumentation.

The Three Pillars of Observability

1. Metrics

Metrics are numerical measurements collected over time. They are compact, aggregatable, and ideal for dashboards and alerting.

Time Series: http_requests_total
Requests
per sec
120 │ ╱─╲
100 │ ╱───╱ ╲
80 │ ╱───╱ ╲───╲
60 │ ╱───╱ ╲───╲
40 │ ╱───╱ ╲───
20 │╱───╱
0 └──────────────────────────────────────────▶
00:00 04:00 08:00 12:00 16:00 20:00 Time

Metric Types

TypeDescriptionExample
CounterMonotonically increasing value; can only go up (or reset to zero)http_requests_total, errors_total
GaugeValue that can go up and downcpu_usage_percent, active_connections
HistogramSamples observations and counts them in configurable bucketsrequest_duration_seconds (p50, p95, p99)
SummaryLike histogram but calculates quantiles client-siderequest_duration_summary

2. Logs

Logs are timestamped, immutable records of discrete events. They provide the richest context for debugging specific issues.

Structured Logging

Always use structured (JSON) logging in production. It is machine-parseable, searchable, and far more useful than plain text:

// Good: Structured log
{
"timestamp": "2025-03-15T10:23:45.123Z",
"level": "error",
"service": "payment-api",
"trace_id": "abc123def456",
"span_id": "789ghi",
"user_id": "user-42",
"message": "Payment processing failed",
"error": "CardDeclined",
"amount": 49.99,
"currency": "USD",
"duration_ms": 234
}
// Bad: Unstructured log
ERROR 2025-03-15 10:23:45 - Payment failed for user 42, card declined, amount $49.99

Logging Best Practices

  • Use log levels consistently: DEBUG for development detail, INFO for normal operations, WARN for recoverable issues, ERROR for failures requiring attention.
  • Include correlation IDs: Trace IDs and request IDs let you follow a single request across services.
  • Do not log sensitive data: Never log passwords, tokens, credit card numbers, or personal data.
  • Centralize logs: Use tools like the ELK Stack (Elasticsearch, Logstash, Kibana), Loki, or Datadog to aggregate logs from all services.

3. Traces

Distributed traces track the journey of a single request as it flows through multiple services, revealing where time is spent and where failures occur:

Request: GET /api/orders/123
┌─────────────────────────────────────────────────────────────┐
│ API Gateway (12ms) │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Order Service (45ms) │ │
│ │ ┌───────────────────┐ ┌────────────────────────────┐ │ │
│ │ │ Auth Service (8ms)│ │ Database Query (22ms) │ │ │
│ │ └───────────────────┘ └────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────┐ │ │
│ │ │ Inventory Service (30ms) │ │ │
│ │ │ ┌─────────────────────────┐ │ │ │
│ │ │ │ Cache Lookup (2ms) │ │ │ │
│ │ │ └─────────────────────────┘ │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
0ms 10ms 20ms 30ms 40ms 50ms 60ms

Key concepts:

  • Trace: The entire journey of a request through the system.
  • Span: A single operation within a trace (one service call, one database query).
  • Trace Context Propagation: Passing trace IDs between services via HTTP headers (e.g., traceparent header in the W3C Trace Context standard).

OpenTelemetry (OTel) is the industry-standard framework for collecting traces, metrics, and logs. It provides vendor-neutral SDKs and a collector that can export to any backend.

Prometheus

Prometheus is the most widely used open-source monitoring system in the cloud-native ecosystem. It uses a pull model, scraping metrics from targets at regular intervals.

Prometheus Architecture

┌────────────┐ ┌────────────┐ ┌────────────┐
│ App + /metrics│ │ App + /metrics│ │ Node Exporter│
└──────┬─────┘ └──────┬─────┘ └──────┬─────┘
│ │ │
└───────────────┼───────────────┘
│ scrape (pull)
┌──────▼──────┐
│ Prometheus │
│ Server │
│ │
│ ┌────────┐ │ ┌──────────────┐
│ │ TSDB │ │────▶│ Grafana │
│ │(storage)│ │ │ (dashboards)│
│ └────────┘ │ └──────────────┘
│ │
│ ┌────────┐ │ ┌──────────────┐
│ │ Alert │ │────▶│ Alertmanager │──▶ Slack/PagerDuty
│ │ Rules │ │ └──────────────┘
│ └────────┘ │
└─────────────┘

PromQL Basics

PromQL (Prometheus Query Language) lets you query and aggregate time-series data:

# Total HTTP requests in the last 5 minutes
rate(http_requests_total[5m])
# 95th percentile request duration
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
# Memory usage above 80%
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100 > 80

Alerting Rules

prometheus-alerts.yaml
groups:
- name: application-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
- alert: HighLatency
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "High p95 latency"
description: "95th percentile latency is {{ $value }}s"
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash-looping"

Grafana Dashboards

Grafana connects to Prometheus (and many other data sources) to create rich, interactive dashboards. A well-designed dashboard answers key operational questions at a glance:

The RED Method for Services

For request-driven services, monitor these three signals:

  • Rate — Requests per second
  • Errors — Failed requests per second
  • Duration — Latency distribution (p50, p95, p99)

The USE Method for Resources

For infrastructure resources, monitor:

  • Utilization — How full is the resource? (CPU at 80%)
  • Saturation — How much queued work? (Disk I/O queue length)
  • Errors — How many errors? (Network packet drops)

SLIs, SLOs, SLAs, and Error Budgets

These concepts from Site Reliability Engineering (SRE) provide a framework for defining and measuring reliability:

ConceptDefinitionExample
SLI (Service Level Indicator)A quantitative measure of service behavior99.2% of requests complete in under 300ms
SLO (Service Level Objective)A target value or range for an SLI99.9% of requests should succeed
SLA (Service Level Agreement)A contract with consequences for missing objectives99.95% uptime or customer gets credits
Error BudgetThe allowed amount of unreliability (100% - SLO)If SLO is 99.9%, error budget is 0.1%

Error Budget in Practice

SLO: 99.9% availability over 30 days
Total minutes in 30 days: 43,200
Error budget (0.1%): 43.2 minutes of allowed downtime
Week 1: 10 min downtime → 33.2 min remaining ████████████░░░░ 77%
Week 2: 5 min downtime → 28.2 min remaining ██████████░░░░░░ 65%
Week 3: 0 min downtime → 28.2 min remaining ██████████░░░░░░ 65%
Week 4: 2 min downtime → 26.2 min remaining █████████░░░░░░░ 61%
Budget consumed: 39.4% → Safe to deploy new features

When the error budget is nearly exhausted, teams should slow down feature work and focus on reliability improvements.

Alerting Strategies

Effective alerting requires careful design to avoid alert fatigue while catching real issues:

  • Alert on symptoms, not causes — Alert on “users are seeing errors” rather than “CPU is at 90%.”
  • Set meaningful thresholds — Use error budgets and SLOs to define when alerts fire.
  • Include runbooks — Every alert should link to a document explaining what to check and how to respond.
  • Tier your alerts — Critical alerts page on-call; warnings go to a channel for next-business-day review.
  • Reduce noise — Deduplicate, group related alerts, and suppress flapping alerts.

Infrastructure as Code (IaC)

Why IaC?

Managing infrastructure manually (clicking through cloud consoles, running ad-hoc commands) leads to:

  • Configuration drift — Servers diverge over time and nobody knows the exact state.
  • Knowledge silos — Only one person knows how the infrastructure was set up.
  • Unreproducible environments — “It works in staging” because staging was set up differently.
  • Slow provisioning — Setting up new environments takes days of manual work.

Infrastructure as Code solves these problems by defining infrastructure in version-controlled configuration files that are reviewed, tested, and applied automatically.

Declarative vs Imperative

ApproachDescriptionExample Tools
DeclarativeYou describe the desired end state; the tool figures out how to achieve itTerraform, Pulumi, CloudFormation, Kubernetes manifests
ImperativeYou describe the exact steps to execute in orderShell scripts, Ansible playbooks (partially), AWS CLI

Declarative is preferred for infrastructure because it is idempotent — applying the same configuration twice produces the same result.

Terraform

Terraform by HashiCorp is the most widely adopted IaC tool. It uses HashiCorp Configuration Language (HCL) to define infrastructure across any cloud provider.

Core Concepts

Terraform Workflow
┌─────────┐ ┌──────────┐ ┌─────────┐ ┌──────────┐
│ Write │────▶│ Plan │────▶│ Apply │────▶│ Manage │
│ (.tf) │ │ │ │ │ │ │
│ │ │ Preview │ │ Create/ │ │ State │
│ Define │ │ changes │ │ update │ │ tracked │
│ resources│ │ safely │ │ infra │ │ in .tfstate│
└─────────┘ └──────────┘ └─────────┘ └──────────┘
ConceptDescription
ProviderA plugin that interfaces with a cloud platform or service (AWS, GCP, Azure, Kubernetes, etc.)
ResourceA single piece of infrastructure (a server, database, DNS record, etc.)
StateA file that tracks what Terraform has created, so it knows what to update or destroy
ModuleA reusable, encapsulated group of resources (like a function for infrastructure)
PlanA preview of what Terraform will do before making any changes
ApplyExecute the planned changes to create, update, or destroy resources

Terraform Examples

# main.tf - AWS infrastructure with Terraform
terraform {
required_version = ">= 1.7"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
# Store state remotely for team collaboration
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
provider "aws" {
region = var.aws_region
}
# Variables
variable "aws_region" {
description = "AWS region to deploy resources"
type = string
default = "us-east-1"
}
variable "environment" {
description = "Environment name"
type = string
default = "production"
}
# VPC
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "${var.environment}-vpc"
Environment = var.environment
ManagedBy = "terraform"
}
}
# Public subnets
resource "aws_subnet" "public" {
count = 2
vpc_id = aws_vpc.main.id
cidr_block = "10.0.${count.index + 1}.0/24"
availability_zone = data.aws_availability_zones.available.names[count.index]
map_public_ip_on_launch = true
tags = {
Name = "${var.environment}-public-${count.index + 1}"
}
}
data "aws_availability_zones" "available" {
state = "available"
}
# Security group for the application
resource "aws_security_group" "app" {
name = "${var.environment}-app-sg"
description = "Security group for application servers"
vpc_id = aws_vpc.main.id
ingress {
description = "HTTP"
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
ingress {
description = "HTTPS"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Name = "${var.environment}-app-sg"
}
}
# RDS PostgreSQL database
resource "aws_db_instance" "main" {
identifier = "${var.environment}-db"
engine = "postgres"
engine_version = "16.1"
instance_class = "db.t3.medium"
allocated_storage = 20
max_allocated_storage = 100
storage_encrypted = true
db_name = "myapp"
username = "admin"
password = var.db_password
multi_az = true
db_subnet_group_name = aws_db_subnet_group.main.name
vpc_security_group_ids = [aws_security_group.app.id]
backup_retention_period = 7
skip_final_snapshot = false
final_snapshot_identifier = "${var.environment}-db-final"
tags = {
Name = "${var.environment}-db"
Environment = var.environment
}
}
variable "db_password" {
description = "Database master password"
type = string
sensitive = true
}
# Outputs
output "vpc_id" {
value = aws_vpc.main.id
}
output "db_endpoint" {
value = aws_db_instance.main.endpoint
}

Terraform Workflow

Terminal window
# Initialize the working directory (download providers, set up backend)
terraform init
# Preview what changes Terraform will make
terraform plan
# Apply the changes (create/update/destroy resources)
terraform apply
# Show current state
terraform show
# Destroy all managed resources
terraform destroy

Terraform Best Practices

  • Use remote state — Store state in S3, GCS, or Terraform Cloud with locking to prevent concurrent modifications.
  • Use modules — Encapsulate reusable infrastructure patterns (VPC module, database module, etc.).
  • Pin provider versions — Avoid unexpected breaking changes from provider updates.
  • Use workspaces or separate state files for different environments (dev, staging, production).
  • Never store secrets in state unencrypted — Enable encryption for your state backend.
  • Run terraform plan in CI — Show the plan in pull requests so reviewers can see infrastructure changes.

Ansible for Configuration Management

While Terraform excels at provisioning infrastructure, Ansible excels at configuring the software on that infrastructure. Ansible uses SSH to connect to servers and execute tasks defined in YAML playbooks:

# playbook.yml - Configure a web server
---
- name: Configure web application server
hosts: webservers
become: yes
vars:
app_user: appuser
app_dir: /opt/myapp
node_version: "20"
tasks:
- name: Update apt cache
apt:
update_cache: yes
cache_valid_time: 3600
- name: Install required packages
apt:
name:
- nginx
- curl
- git
state: present
- name: Create application user
user:
name: "{{ app_user }}"
shell: /bin/bash
create_home: yes
- name: Deploy application configuration
template:
src: templates/nginx.conf.j2
dest: /etc/nginx/sites-available/myapp
notify: Restart Nginx
- name: Enable site
file:
src: /etc/nginx/sites-available/myapp
dest: /etc/nginx/sites-enabled/myapp
state: link
notify: Restart Nginx
handlers:
- name: Restart Nginx
service:
name: nginx
state: restarted

Terraform + Ansible Together

A common pattern is to use both tools together:

Terraform Ansible
┌─────────────────────┐ ┌──────────────────────────┐
│ Provision infra: │ │ Configure servers: │
│ - VPCs, subnets │──creates──▶│ - Install packages │
│ - EC2 instances │ servers │ - Deploy app config │
│ - RDS databases │ │ - Set up monitoring │
│ - Load balancers │ │ - Configure firewalls │
└─────────────────────┘ └──────────────────────────┘

Putting It All Together

A mature DevOps observability and IaC stack typically looks like this:

┌──────────────────────────────────────────────────────────────┐
│ Applications │
│ Instrumented with OpenTelemetry SDKs │
│ Emitting: metrics, logs, traces │
└──────────────┬──────────────┬──────────────┬─────────────────┘
│ │ │
┌────▼────┐ ┌────▼────┐ ┌─────▼─────┐
│Prometheus│ │ Loki │ │ Tempo / │
│(metrics) │ │ (logs) │ │ Jaeger │
│ │ │ │ │ (traces) │
└────┬─────┘ └────┬────┘ └─────┬─────┘
│ │ │
└──────────────┼──────────────┘
┌──────▼──────┐
│ Grafana │
│ (dashboards │
│ & alerts) │
└──────┬──────┘
┌──────▼──────┐
│Alertmanager │──▶ Slack / PagerDuty / Email
└─────────────┘
Infrastructure managed by:
Terraform (provisioning) + Ansible (configuration)
All version-controlled, reviewed, and applied via CI/CD

Next Steps