Skip to content

Incident Response

Incident Severity Levels

Not all incidents are created equal. A well-defined severity classification ensures the right level of response for each situation.

Severity Levels:
SEV-1 (Critical) ███████████████████████████████████
Major service outage affecting all or most users.
Revenue impact. Data loss risk.
Response: All hands on deck. Immediate.
Example: Payment system completely down.
SEV-2 (High) █████████████████████████████
Significant degradation affecting many users.
Partial functionality loss.
Response: On-call team escalation. Within 15 min.
Example: Checkout latency 10x normal, 30% failures.
SEV-3 (Medium) ███████████████████████
Moderate impact on a subset of users.
Workarounds available.
Response: On-call acknowledges. Within 1 hour.
Example: Search results degraded for some regions.
SEV-4 (Low) █████████████████
Minor issue. Limited user impact.
Response: Next business day.
Example: Admin dashboard loading slowly.
SeverityUser ImpactResponse TimeWho’s InvolvedCommunication
SEV-1All/most users affectedImmediateIC + all relevant teamsStatus page, exec updates
SEV-2Many users degraded15 minutesOn-call + escalationStatus page update
SEV-3Subset of users1 hourOn-call teamInternal notification
SEV-4MinimalNext business dayIndividual engineerTicket only

On-Call Practices

On-call is a critical part of running reliable systems. Done well, it empowers engineers to keep systems healthy. Done poorly, it leads to burnout, alert fatigue, and high turnover.

On-Call Structure

On-Call Rotation:
Week 1 Week 2 Week 3 Week 4
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│ Alice│ │ Bob │ │Carol │ │ Dave │
│(Primary) │(Primary) │(Primary) │(Primary)
│ Bob │ │Carol │ │ Dave │ │ Alice│
│(Secondary)│(Secondary)│(Secondary)│(Secondary)
└──────┘ └──────┘ └──────┘ └──────┘
Primary: First to be paged. Expected to respond.
Secondary: Backup if primary doesn't respond
within 10 minutes.
Escalation: If both fail → engineering manager.

On-Call Best Practices

PracticeDescription
Compensate fairlyPay for on-call time, provide time off after incidents
Limit on-call durationNo more than one week at a time, at least one week off
Maintain runbooksDocument every alert with clear response steps
Conduct handoffsBrief the incoming on-call about recent issues
Track on-call loadMonitor pages per shift, incident severity distribution
Improve the systemEvery painful on-call shift should result in improvements
Shadow shiftsNew team members shadow experienced on-call before going solo

On-Call Handoff Template

On-Call Handoff: March 11-17, 2024
ONGOING ISSUES:
- Database connection pool intermittently full (SEV-3)
Ticket: INFRA-4521
Status: Fix deployed, monitoring
RECENT CHANGES:
- March 10: Deployed payment-service v2.4.0
- March 9: Increased Redis cache from 4GB to 8GB
- March 8: Updated SSL certificates
THINGS TO WATCH:
- New Kafka cluster migration starting Wednesday
- Marketing email blast Thursday (expect 2x traffic)
ALERT UPDATES:
- Silenced: disk-space-warning on staging (planned)
- New: Added latency alert for /api/checkout (p99 > 2s)

The Incident Commander Role

The Incident Commander (IC) is the person who coordinates the response during a significant incident. They are responsible for managing the process, not necessarily fixing the problem.

Incident Response Structure:
┌───────────────────────────────────────────────────────┐
│ Incident Commander (IC) │
│ - Coordinates overall response │
│ - Makes decisions on next steps │
│ - Manages communication │
│ - Delegates tasks │
│ - Decides when to escalate/de-escalate │
└───────────────┬───────────────────────────────────────┘
┌───────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌──────────────┐
│Technical│ │ Comms │ │ Subject │
│ Lead │ │ Lead │ │ Matter Expert│
│ │ │ │ │ │
│Drives │ │Updates │ │Deep expertise│
│debugging│ │status │ │in specific │
│and fixes│ │page, │ │component │
│ │ │stakeholders│ │
└─────────┘ └──────────┘ └──────────────┘

IC Responsibilities

PhaseIC Actions
DeclarationDeclare the incident, assign severity, open incident channel
AssessmentUnderstand scope, identify affected systems, set priorities
DelegationAssign roles: technical lead, comms lead, SMEs
CoordinationRegular status updates, manage parallel workstreams
EscalationBring in additional teams or leadership as needed
ResolutionConfirm the fix, verify recovery, update stakeholders
ClosureDeclare incident resolved, schedule postmortem

Communication During Incidents

Clear, structured communication prevents confusion and keeps stakeholders informed during outages.

Incident Communication Template

INCIDENT UPDATE - SEV-1 - Payment Processing Down
Time: 2024-03-15 14:45 UTC (13 min since detection)
STATUS: Investigating
IMPACT:
- All credit card payments failing since 14:32 UTC
- Approximately 1,200 users affected
- Estimated revenue impact: $15K/hour
WHAT WE KNOW:
- Payment gateway returning 503 errors
- Issue started after deployment of billing-svc v3.2.1
- Rollback initiated at 14:40 UTC
CURRENT ACTIONS:
- [Alice] Rolling back billing-svc to v3.2.0
- [Bob] Investigating payment gateway health
- [Carol] Monitoring error rates during rollback
NEXT UPDATE: 15:00 UTC (or sooner if status changes)
INCIDENT CHANNEL: #inc-20240315-payments
IC: Dave (dave@example.com)

Status Page Updates

Status Page Communication Principles:
DO: DO NOT:
┌──────────────────────┐ ┌──────────────────────┐
│ Be honest and clear │ │ Use technical jargon │
│ Provide timeline │ │ Blame individuals │
│ State user impact │ │ Speculate on cause │
│ Give next update time│ │ Ignore the issue │
│ Apologize sincerely │ │ Over-promise fixes │
└──────────────────────┘ └──────────────────────┘
Good: "Some users may experience failed payments.
We identified the issue and a fix is being
deployed. Next update in 15 minutes."
Bad: "The billing microservice deployment caused
a null pointer exception in the payment
gateway adapter module."

Blameless Postmortems

A postmortem (also called a retrospective or incident review) is a structured analysis conducted after an incident to understand what happened, why it happened, and how to prevent it from happening again.

The Blameless Culture

Blame Culture: Blameless Culture:
┌─────────────────────────┐ ┌─────────────────────────┐
│ "Who caused this?" │ │ "What caused this?" │
│ "Bob pushed bad code" │ │ "The system allowed a │
│ "Fire Bob" │ │ bad config to deploy" │
│ │ │ │
│ Result: │ │ Result: │
│ - People hide mistakes │ │ - People report issues │
│ - Root causes hidden │ │ - Root causes found │
│ - Same issues recur │ │ - Systemic fixes made │
│ - Fear-based culture │ │ - Learning culture │
└─────────────────────────┘ └─────────────────────────┘

Postmortem Template

POSTMORTEM: Payment Processing Outage
Date: 2024-03-15
Duration: 43 minutes (14:32 - 15:15 UTC)
Severity: SEV-1
Author: Dave (IC), reviewed by team
============================================
SUMMARY:
A deployment of billing-svc v3.2.1 introduced a
regression that caused all credit card payments
to fail for 43 minutes, affecting ~3,600 users.
IMPACT:
- 3,600 users could not complete purchases
- Estimated revenue loss: $11,200
- 247 support tickets filed
TIMELINE:
14:30 - billing-svc v3.2.1 deployed to production
14:32 - First payment failures detected by monitoring
14:35 - On-call paged (Alice)
14:37 - Alice acknowledges, opens incident channel
14:38 - SEV-1 declared, Dave takes IC role
14:40 - Root cause identified: new payment adapter
breaks when currency code is empty
14:42 - Decision to rollback
14:50 - Rollback deployed
14:55 - Error rate declining
15:15 - Confirmed resolved, incident closed
ROOT CAUSE:
The new payment adapter in v3.2.1 did not handle
the case where the currency code field was empty.
~8% of legacy orders have empty currency codes.
The unit tests only covered orders with currency
codes set.
CONTRIBUTING FACTORS:
1. No integration test for legacy order formats
2. Canary deployment was only 2% of traffic
(not enough to catch 8% edge case)
3. Payment error alert had a 5-minute delay
WHAT WENT WELL:
- Alert fired within 3 minutes
- Rollback process was smooth (8 minutes)
- Communication was clear and timely
- IC assignment was immediate
WHAT WENT POORLY:
- Empty currency code edge case was not caught
- Canary percentage too low for this service
- Alert delay meant 3 minutes of silent failure
ACTION ITEMS:
[ ] Add integration tests for legacy order formats
Owner: Bob, Due: March 22
[ ] Increase canary percentage for payment services
to 10% Owner: Alice, Due: March 25
[ ] Reduce payment error alert delay to 1 minute
Owner: Carol, Due: March 20
[ ] Add pre-deploy data validation for payment
schema changes Owner: Dave, Due: April 1

Runbooks

A runbook is a documented procedure for responding to a specific alert or incident type. Good runbooks turn any on-call engineer into an effective responder, even for systems they do not own.

Runbook Structure

Runbook: High Payment Error Rate
═══════════════════════════════════
ALERT: payment_error_rate > 1% for 5 minutes
SEVERITY: SEV-2 (auto-escalates to SEV-1 if > 5%)
OWNER: Payments Team (#team-payments)
──────────────────────────────────────────────
STEP 1: Assess the situation
──────────────────────────────────────────────
□ Check the payment dashboard:
https://grafana.example/d/payments
□ Determine scope:
- Is it all payment types or specific ones?
- Is it all regions or specific regions?
- When did it start?
──────────────────────────────────────────────
STEP 2: Check for recent changes
──────────────────────────────────────────────
□ Check recent deployments:
kubectl rollout history deployment/payment-svc
□ Check recent config changes:
git log --oneline -10 payment-config/
□ If a recent deployment caused the issue:
→ Run: kubectl rollout undo deployment/payment-svc
→ Notify #team-payments
──────────────────────────────────────────────
STEP 3: Check external dependencies
──────────────────────────────────────────────
□ Check payment gateway status:
https://status.stripe.com
□ Check database connectivity:
psql -h payments-db -c "SELECT 1"
□ Check Redis cache:
redis-cli -h payments-cache ping
──────────────────────────────────────────────
STEP 4: Escalation
──────────────────────────────────────────────
If none of the above resolves the issue:
□ Page the Payments Team lead:
pd trigger --service payments --urgency high
□ Open incident channel: /incident create
□ Update status page if user-facing

Chaos Engineering

Chaos engineering is the practice of deliberately injecting failures into a system to discover weaknesses before they cause real outages. The goal is to build confidence that the system can withstand turbulent conditions.

Chaos Engineering Principles

The Chaos Engineering Process:
1. Define steady state
"Normal" system behavior
(request rate, error rate, latency)
2. Form hypothesis
"If we kill one database replica,
the system will failover in < 30s"
3. Introduce failure
Kill the database replica
in a controlled manner
4. Observe behavior
Monitor metrics, logs, traces
during the experiment
5. Analyze results
Did the system behave as expected?
Were there unexpected side effects?
6. Fix weaknesses
Address any issues discovered.
Improve resilience.

Common Chaos Experiments

ExperimentWhat It TestsTools
Kill a service instanceAuto-restart, load balancingChaos Monkey, kill command
Network latency injectionTimeout handling, circuit breakerstc (traffic control), Toxiproxy
Network partitionSplit-brain handling, consensusiptables, Chaos Mesh
Disk fillDisk space monitoring, graceful degradationdd, stress-ng
CPU stressAutoscaling, performance under loadstress-ng, Chaos Mesh
Kill availability zoneMulti-AZ redundancyAWS FIS, Chaos Mesh
DNS failureCaching, fallback resolutionManipulate /etc/resolv.conf
Clock skewTime-dependent logic, certificate validationntpdate, Chaos Mesh
# Simple chaos experiment framework
import random
import time
import requests
from dataclasses import dataclass
from typing import Callable, Optional
@dataclass
class ChaosExperiment:
name: str
hypothesis: str
steady_state_check: Callable[[], bool]
inject_failure: Callable[[], None]
rollback: Callable[[], None]
duration_seconds: int = 300
def run_experiment(experiment: ChaosExperiment):
print(f"\n{'='*60}")
print(f"CHAOS EXPERIMENT: {experiment.name}")
print(f"HYPOTHESIS: {experiment.hypothesis}")
print(f"{'='*60}\n")
# Step 1: Verify steady state
print("[1/5] Verifying steady state...")
if not experiment.steady_state_check():
print("ABORT: System not in steady state")
return False
# Step 2: Inject failure
print("[2/5] Injecting failure...")
try:
experiment.inject_failure()
except Exception as e:
print(f"ABORT: Failed to inject: {e}")
return False
# Step 3: Observe
print(f"[3/5] Observing for "
f"{experiment.duration_seconds}s...")
start = time.time()
observations = []
while time.time() - start < experiment.duration_seconds:
is_healthy = experiment.steady_state_check()
observations.append({
"time": time.time() - start,
"healthy": is_healthy
})
time.sleep(5)
# Step 4: Rollback
print("[4/5] Rolling back failure injection...")
experiment.rollback()
# Step 5: Analyze
print("[5/5] Analyzing results...")
healthy_pct = (
sum(1 for o in observations if o["healthy"])
/ len(observations)
) * 100
print(f"\nRESULTS:")
print(f" Healthy observations: {healthy_pct:.1f}%")
print(f" Hypothesis confirmed: "
f"{'YES' if healthy_pct > 95 else 'NO'}")
return healthy_pct > 95
# Example: Kill a pod experiment
def check_api_health():
try:
r = requests.get(
"http://api.example.com/health", timeout=5
)
return r.status_code == 200
except requests.exceptions.RequestException:
return False
import subprocess
experiment = ChaosExperiment(
name="Kill Payment Service Pod",
hypothesis="System recovers within 30 seconds "
"when one payment pod is killed",
steady_state_check=check_api_health,
inject_failure=lambda: subprocess.run([
"kubectl", "delete", "pod",
"payment-svc-pod-1",
"--grace-period=0", "--force"
]),
rollback=lambda: None, # K8s auto-restarts
duration_seconds=120,
)
run_experiment(experiment)

Chaos Engineering Maturity

LevelPractices
1. Ad hocManual failure injection in staging
2. StructuredDocumented experiments with hypotheses, run in staging
3. AutomatedAutomated experiments in CI/CD, some in production
4. ContinuousContinuous chaos in production with automated rollback

Summary

ConceptKey Takeaway
Severity LevelsDefine by user impact, not technical cause
On-CallCompensate fairly, maintain runbooks, limit duration
Incident CommanderCoordinates response, manages communication, delegates tasks
CommunicationBe honest, state impact, provide next update time
Blameless PostmortemsFocus on systems and processes, not individuals
RunbooksDocumented procedures turn any engineer into an effective responder
Chaos EngineeringProactively find weaknesses before they cause real outages