Profiling & Benchmarking
You cannot optimize what you do not measure. Profiling identifies where your application spends time and memory; benchmarking measures how fast it performs under controlled conditions; load testing reveals how it behaves under real-world traffic. Together, these tools transform performance optimization from guesswork into science.
Profiling Fundamentals
Profiling is the process of measuring where a program spends its resources — CPU cycles, memory allocations, I/O waits, or lock contention.
Types of Profiling
| Type | What It Measures | Tools |
|---|---|---|
| CPU profiling | Time spent executing functions | cProfile, perf, async-profiler, Chrome DevTools |
| Memory profiling | Heap usage, allocations, leaks | tracemalloc, heapdump, VisualVM, Valgrind |
| I/O profiling | Disk reads/writes, network calls | strace, dtrace, blktrace |
| Lock/contention profiling | Time waiting on locks | async-profiler, Java Flight Recorder |
| Wall-clock profiling | Total elapsed time including waits | Most profilers in “wall” mode |
Sampling vs Instrumentation
Sampling Profiler: Takes periodic snapshots of the call stack (e.g., every 1ms). Low overhead (~2-5%), safe for production.
Time ──────────────────────────────────► ↑ ↑ ↑ ↑ ↑ ↑ ↑ Sample points (stack snapshots)
Instrumentation Profiler: Wraps every function entry/exit with timing code. High accuracy, high overhead (~10-50%), development only.
func_a() { START_TIMER("func_a") func_b() START_TIMER("func_b") STOP_TIMER("func_b") STOP_TIMER("func_a") }CPU Profiling
CPU profiling reveals which functions consume the most processing time.
# cProfile: built-in CPU profilerimport cProfileimport pstats
def expensive_computation(): total = 0 for i in range(1_000_000): total += i ** 2 return total
def process_data(): results = [] for _ in range(10): results.append(expensive_computation()) return results
# Profile the functionprofiler = cProfile.Profile()profiler.enable()process_data()profiler.disable()
# Print sorted by cumulative timestats = pstats.Stats(profiler)stats.sort_stats('cumulative')stats.print_stats(10) # Top 10 functions
# Output:# ncalls tottime percall cumtime percall filename# 10 8.234 0.823 8.234 0.823 script.py:4(expensive_computation)# 1 0.001 0.001 8.235 8.235 script.py:10(process_data)
# py-spy: sampling profiler for production# Install: pip install py-spy# Record a flame graph (attach to running process):# py-spy record -o profile.svg --pid 12345# Top-like view:# py-spy top --pid 12345// Node.js built-in profiler (V8)// Run: node --prof app.js// Process: node --prof-process isolate-*.log > profile.txt
// Using console.time for quick measurementsconsole.time('processing');processData();console.timeEnd('processing');// Output: processing: 823.456ms
// Using perf_hooks for precise measurementconst { performance, PerformanceObserver } = require('perf_hooks');
const obs = new PerformanceObserver((items) => { items.getEntries().forEach((entry) => { console.log(`${entry.name}: ${entry.duration.toFixed(2)}ms`); });});obs.observe({ entryTypes: ['measure'] });
performance.mark('start');processData();performance.mark('end');performance.measure('processData', 'start', 'end');
// Chrome DevTools profiling (browser)// 1. Open DevTools -> Performance tab// 2. Click Record// 3. Perform the action// 4. Click Stop// 5. Analyze the flame chart
// clinic.js: comprehensive Node.js profiling// Install: npm install -g clinic// Doctor (general): clinic doctor -- node app.js// Flame (CPU): clinic flame -- node app.js// Bubbleprof (I/O): clinic bubbleprof -- node app.js// Java Flight Recorder (built into JDK 11+)// Start recording:// java -XX:StartFlightRecording=duration=60s,// filename=profile.jfr MyApp
// Attach to running process:// jcmd <pid> JFR.start duration=60s filename=profile.jfr
// async-profiler: low-overhead sampling profiler// Attach to running JVM:// ./asprof -d 30 -f profile.html <pid>// Generates an interactive flame graph
// JMH (Java Microbenchmark Harness)import org.openjdk.jmh.annotations.*;import java.util.concurrent.TimeUnit;
@BenchmarkMode(Mode.AverageTime)@OutputTimeUnit(TimeUnit.MICROSECONDS)@State(Scope.Thread)@Warmup(iterations = 3, time = 1)@Measurement(iterations = 5, time = 1)@Fork(1)public class SortBenchmark {
private int[] data;
@Setup public void setup() { data = new Random().ints(10_000).toArray(); }
@Benchmark public void arraySort() { int[] copy = data.clone(); Arrays.sort(copy); }
@Benchmark public void parallelSort() { int[] copy = data.clone(); Arrays.parallelSort(copy); }}// Run: java -jar benchmarks.jarMemory Profiling
Memory profiling identifies memory leaks, excessive allocations, and opportunities to reduce memory footprint.
Common Memory Issues
| Issue | Symptom | Cause |
|---|---|---|
| Memory leak | Memory usage grows continuously | References held to unused objects |
| Excessive allocation | High GC pressure, pauses | Creating too many short-lived objects |
| Large object retention | High baseline memory | Caches or collections that grow unbounded |
| Fragmentation | High virtual memory, low utilization | Many small allocations and deallocations |
# tracemalloc: built-in memory profilerimport tracemalloc
tracemalloc.start()
# -- Your code here --data = [list(range(1000)) for _ in range(1000)]
snapshot = tracemalloc.take_snapshot()top_stats = snapshot.statistics('lineno')
print("Top 5 memory allocations:")for stat in top_stats[:5]: print(f" {stat}")
# Output:# script.py:6: size=7.7 MiB, count=1001, average=7.9 KiB
# objgraph: visualize object references (find leaks)# Install: pip install objgraphimport objgraph
# Show most common object typesobjgraph.show_most_common_types(limit=10)
# Find what is holding a reference to leaked objectsobjgraph.show_backrefs( objgraph.by_type('MyClass')[:3], filename='refs.png')
# memory_profiler: line-by-line memory usage# Install: pip install memory_profilerfrom memory_profiler import profile
@profiledef my_function(): a = [1] * 1_000_000 # ~8 MB b = [2] * 2_000_000 # ~16 MB del b # Free 16 MB return a// Node.js heap snapshot// Generate: node --heapsnapshot-signal=SIGUSR2 app.js// Then: kill -USR2 <pid>// Load .heapsnapshot in Chrome DevTools -> Memory tab
// v8.getHeapStatistics() for quick overviewconst v8 = require('v8');const stats = v8.getHeapStatistics();console.log({ totalHeapSize: `${(stats.total_heap_size / 1024 / 1024).toFixed(1)} MB`, usedHeapSize: `${(stats.used_heap_size / 1024 / 1024).toFixed(1)} MB`, heapSizeLimit: `${(stats.heap_size_limit / 1024 / 1024).toFixed(1)} MB`,});
// Detecting memory leaks with process.memoryUsage()function monitorMemory(label) { const usage = process.memoryUsage(); console.log(`[${label}] RSS: ${(usage.rss / 1024 / 1024).toFixed(1)} MB, ` + `Heap: ${(usage.heapUsed / 1024 / 1024).toFixed(1)} MB`);}
setInterval(() => monitorMemory('periodic'), 5000);
// Common leak pattern: unbounded event listenersconst EventEmitter = require('events');const emitter = new EventEmitter();// Warning emitted if >10 listeners on one eventemitter.setMaxListeners(20); // Increase if intentionalFlame Graphs
A flame graph is a visualization of profiled stack traces. It shows which functions consume the most CPU time and how they were called.
Reading a flame graph:
┌─────────────────────────────────────────────────────────┐│ main() │ Total time├───────────────────────────────────┬─────────────────────┤│ processData() │ handleRequest() │├─────────────────┬─────────────────┤─────────────────────┤│ parseJSON() │ transformData()│ queryDB() │├─────────────────┤─────────┬───────┤─────────────────────┤│ │ sort() │map() │ │└─────────────────┴─────────┴───────┴─────────────────────┘
◄─────────── x-axis = proportion of total time ──────────► ▲ y-axis = call stack depth (callers below, callees above)
Width = time spent in that function (wider = more time) Color = arbitrary (often random or based on package)How to Read Flame Graphs
- Look for wide bars — These functions consume the most time.
- Look at the top — Functions at the top of the stack are where CPU time is actually spent (leaf functions).
- Look for plateaus — A wide plateau at the top means one function dominates CPU time.
- Ignore narrow spikes — They represent functions that execute briefly.
Generating Flame Graphs
| Language | Tool | Command |
|---|---|---|
| Python | py-spy | py-spy record -o flame.svg --pid PID |
| Node.js | clinic flame | clinic flame -- node app.js |
| Java | async-profiler | ./asprof -d 30 -f flame.html PID |
| Go | pprof | go tool pprof -http=:8080 profile.pb.gz |
| System-wide | perf + FlameGraph | perf record -g -- ./app && perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg |
Benchmark Methodology
Benchmarking requires discipline. Poor methodology leads to misleading results.
Rules of Benchmarking
- Warm up first — JIT compilers, caches, and OS page caches need time to reach steady state. Discard the first several iterations.
- Measure multiple iterations — A single run is meaningless. Run at least 10-30 iterations and report the median plus percentiles.
- Isolate the variable — Change only one thing at a time. Control for hardware, OS state, and background processes.
- Use stable environments — Disable CPU frequency scaling, close other applications, and pin to specific CPU cores if possible.
- Report the right metric — Mean can be skewed by outliers. Report the median (p50), p95, and p99.
- Avoid dead code elimination — Compilers may optimize away code whose result is unused. Consume the result (print it, write it to a volatile variable).
Statistical Measures
Given latencies (ms): 2, 3, 3, 4, 5, 5, 5, 8, 12, 45
Mean (average): 9.2 ms -- skewed by the 45ms outlier Median (p50): 5.0 ms -- middle value, more representative p95: 28.5 ms -- 95% of requests are faster than this p99: 45.0 ms -- the tail; important for user experience Std deviation: 12.8 ms -- high variance indicates inconsistency
Always report percentiles, not just averages!Load Testing Tools
Load testing simulates real-world traffic to measure how your system performs under stress.
Tool Comparison
| Tool | Language | Protocol Support | Scripting | Best For |
|---|---|---|---|---|
| k6 | Go (JS scripts) | HTTP, WebSocket, gRPC | JavaScript | Developer-friendly load testing |
| wrk | C | HTTP only | Lua | Quick HTTP benchmarks |
| JMeter | Java | HTTP, JDBC, JMS, LDAP, FTP | GUI + XML | Enterprise-grade testing |
| Locust | Python | HTTP (extensible) | Python | Python-centric teams |
| Gatling | Scala | HTTP, WebSocket | Scala/Java DSL | High-performance simulations |
| Artillery | Node.js | HTTP, WebSocket, Socket.IO | YAML + JS | Quick cloud-native tests |
| vegeta | Go | HTTP | CLI | Constant-rate HTTP load |
// k6 load test scriptimport http from 'k6/http';import { check, sleep } from 'k6';
// Test configurationexport const options = { stages: [ { duration: '30s', target: 50 }, // Ramp up to 50 users { duration: '1m', target: 50 }, // Hold at 50 users { duration: '30s', target: 200 }, // Ramp up to 200 users { duration: '1m', target: 200 }, // Hold at 200 users { duration: '30s', target: 0 }, // Ramp down ], thresholds: { http_req_duration: ['p(95)<500'], // 95% under 500ms http_req_failed: ['rate<0.01'], // Less than 1% errors },};
export default function () { // GET request const res = http.get('https://api.example.com/users');
// Assertions check(res, { 'status is 200': (r) => r.status === 200, 'response time OK': (r) => r.timings.duration < 500, 'body has users': (r) => r.json().length > 0, });
// Simulate user think time sleep(Math.random() * 3 + 1); // 1-4 seconds}
// Run: k6 run loadtest.js// Output includes p50, p90, p95, p99, error rate, throughput# Locust load testfrom locust import HttpUser, task, between
class WebsiteUser(HttpUser): # Wait 1-3 seconds between tasks wait_time = between(1, 3)
@task(3) # Weight: 3x more likely than other tasks def get_users(self): with self.client.get( "/api/users", catch_response=True ) as response: if response.status_code == 200: if len(response.json()) == 0: response.failure("Empty user list") else: response.failure( f"Status {response.status_code}" )
@task(1) def get_user_detail(self): user_id = 42 self.client.get(f"/api/users/{user_id}")
def on_start(self): """Called when a simulated user starts.""" self.client.post("/api/login", json={ "username": "testuser", "password": "testpass" })
# Run: locust -f locustfile.py --host=https://api.example.com# Open http://localhost:8089 for the web UILoad Testing Patterns
1. Smoke Test: Low load to verify system works Users: 1-5 Duration: 1-2 min
2. Load Test: Normal expected traffic Users: typical Duration: 5-30 min
3. Stress Test: Beyond normal capacity Users: 2-5x normal Duration: 5-15 min
4. Spike Test: Sudden burst of traffic Users: ──────╱╲────── ↑ Spike
5. Soak Test: Sustained load over hours Users: normal Duration: 4-24 hours (Finds memory leaks, connection pool exhaustion)
6. Breakpoint Test: Increase until failure Users: ramp up continuously until system breaksPerformance Budgets
A performance budget is a set of limits on metrics that affect user experience. If a change causes a metric to exceed its budget, the build or deployment fails.
Defining Budgets
| Metric | Budget | Why It Matters |
|---|---|---|
| Largest Contentful Paint (LCP) | less than 2.5 seconds | Core Web Vital: perceived load time |
| First Input Delay (FID) | less than 100ms | Core Web Vital: interactivity |
| Cumulative Layout Shift (CLS) | less than 0.1 | Core Web Vital: visual stability |
| Total JS bundle size | less than 200 KB (gzipped) | Affects parse and execution time |
| API response time (p95) | less than 500ms | Backend SLO |
| Time to First Byte (TTFB) | less than 200ms | Server processing time |
Enforcing Budgets in CI/CD
CI/CD Pipeline with Performance Budget:
┌──────────┐ ┌──────────┐ ┌───────────────┐ ┌──────────┐│ Build │───►│ Test │───►│ Performance │───►│ Deploy ││ │ │ │ │ Budget Check │ │ │└──────────┘ └──────────┘ └───────┬───────┘ └──────────┘ │ PASS / FAIL │ ┌────┴────┐ │ Bundle │ │ > 200KB?│──► FAIL build │ p95 │ │ > 500ms?│──► FAIL build └─────────┘// webpack.config.js -- bundle size budgetmodule.exports = { performance: { maxAssetSize: 200 * 1024, // 200 KB per asset maxEntrypointSize: 400 * 1024, // 400 KB per entry hints: 'error', // Fail the build },};
// Lighthouse CI budget (lighthouserc.json)const config = { ci: { assert: { assertions: { 'largest-contentful-paint': ['error', { maxNumericValue: 2500 }], 'first-input-delay': ['error', { maxNumericValue: 100 }], 'cumulative-layout-shift': ['error', { maxNumericValue: 0.1 }], 'total-byte-weight': ['warning', { maxNumericValue: 500000 }], }, }, },};# Performance budget check script for CI/CDimport requestsimport sys
BUDGETS = { 'p50_ms': 100, 'p95_ms': 500, 'p99_ms': 1000, 'error_rate': 0.01, # 1%}
def check_performance_budget(results: dict) -> bool: """Check if performance results meet budget.""" passed = True
for metric, limit in BUDGETS.items(): actual = results.get(metric, float('inf')) status = "PASS" if actual <= limit else "FAIL"
if status == "FAIL": passed = False
print(f" {metric}: {actual} (limit: {limit}) [{status}]")
return passed
# Example results from a load testresults = { 'p50_ms': 45, 'p95_ms': 320, 'p99_ms': 890, 'error_rate': 0.002,}
print("Performance Budget Check:")if not check_performance_budget(results): print("\nBudget exceeded! Blocking deployment.") sys.exit(1)else: print("\nAll budgets met. Proceeding with deployment.")Profiling Workflow
Summary
| Concept | Key Takeaway |
|---|---|
| Sampling profiler | Low overhead, safe for production; take periodic stack snapshots |
| Instrumentation profiler | High accuracy, high overhead; development only |
| Flame graph | Width shows time; look for wide bars at the top of the stack |
| Memory profiling | Find leaks by comparing heap snapshots over time |
| Benchmark methodology | Warm up, run many iterations, report percentiles, isolate variables |
| Load testing | Simulate real traffic with tools like k6, Locust, or wrk |
| Performance budgets | Set limits on metrics and enforce them in CI/CD |