Skip to content

Profiling & Benchmarking

You cannot optimize what you do not measure. Profiling identifies where your application spends time and memory; benchmarking measures how fast it performs under controlled conditions; load testing reveals how it behaves under real-world traffic. Together, these tools transform performance optimization from guesswork into science.


Profiling Fundamentals

Profiling is the process of measuring where a program spends its resources — CPU cycles, memory allocations, I/O waits, or lock contention.

Types of Profiling

TypeWhat It MeasuresTools
CPU profilingTime spent executing functionscProfile, perf, async-profiler, Chrome DevTools
Memory profilingHeap usage, allocations, leakstracemalloc, heapdump, VisualVM, Valgrind
I/O profilingDisk reads/writes, network callsstrace, dtrace, blktrace
Lock/contention profilingTime waiting on locksasync-profiler, Java Flight Recorder
Wall-clock profilingTotal elapsed time including waitsMost profilers in “wall” mode

Sampling vs Instrumentation

Sampling Profiler:
Takes periodic snapshots of the call stack (e.g., every 1ms).
Low overhead (~2-5%), safe for production.
Time ──────────────────────────────────►
↑ ↑ ↑ ↑ ↑ ↑ ↑
Sample points (stack snapshots)
Instrumentation Profiler:
Wraps every function entry/exit with timing code.
High accuracy, high overhead (~10-50%), development only.
func_a() {
START_TIMER("func_a")
func_b()
START_TIMER("func_b")
STOP_TIMER("func_b")
STOP_TIMER("func_a")
}

CPU Profiling

CPU profiling reveals which functions consume the most processing time.

# cProfile: built-in CPU profiler
import cProfile
import pstats
def expensive_computation():
total = 0
for i in range(1_000_000):
total += i ** 2
return total
def process_data():
results = []
for _ in range(10):
results.append(expensive_computation())
return results
# Profile the function
profiler = cProfile.Profile()
profiler.enable()
process_data()
profiler.disable()
# Print sorted by cumulative time
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10) # Top 10 functions
# Output:
# ncalls tottime percall cumtime percall filename
# 10 8.234 0.823 8.234 0.823 script.py:4(expensive_computation)
# 1 0.001 0.001 8.235 8.235 script.py:10(process_data)
# py-spy: sampling profiler for production
# Install: pip install py-spy
# Record a flame graph (attach to running process):
# py-spy record -o profile.svg --pid 12345
# Top-like view:
# py-spy top --pid 12345

Memory Profiling

Memory profiling identifies memory leaks, excessive allocations, and opportunities to reduce memory footprint.

Common Memory Issues

IssueSymptomCause
Memory leakMemory usage grows continuouslyReferences held to unused objects
Excessive allocationHigh GC pressure, pausesCreating too many short-lived objects
Large object retentionHigh baseline memoryCaches or collections that grow unbounded
FragmentationHigh virtual memory, low utilizationMany small allocations and deallocations
# tracemalloc: built-in memory profiler
import tracemalloc
tracemalloc.start()
# -- Your code here --
data = [list(range(1000)) for _ in range(1000)]
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
print("Top 5 memory allocations:")
for stat in top_stats[:5]:
print(f" {stat}")
# Output:
# script.py:6: size=7.7 MiB, count=1001, average=7.9 KiB
# objgraph: visualize object references (find leaks)
# Install: pip install objgraph
import objgraph
# Show most common object types
objgraph.show_most_common_types(limit=10)
# Find what is holding a reference to leaked objects
objgraph.show_backrefs(
objgraph.by_type('MyClass')[:3],
filename='refs.png'
)
# memory_profiler: line-by-line memory usage
# Install: pip install memory_profiler
from memory_profiler import profile
@profile
def my_function():
a = [1] * 1_000_000 # ~8 MB
b = [2] * 2_000_000 # ~16 MB
del b # Free 16 MB
return a

Flame Graphs

A flame graph is a visualization of profiled stack traces. It shows which functions consume the most CPU time and how they were called.

Reading a flame graph:
┌─────────────────────────────────────────────────────────┐
│ main() │ Total time
├───────────────────────────────────┬─────────────────────┤
│ processData() │ handleRequest() │
├─────────────────┬─────────────────┤─────────────────────┤
│ parseJSON() │ transformData()│ queryDB() │
├─────────────────┤─────────┬───────┤─────────────────────┤
│ │ sort() │map() │ │
└─────────────────┴─────────┴───────┴─────────────────────┘
◄─────────── x-axis = proportion of total time ──────────►
▲ y-axis = call stack depth (callers below, callees above)
Width = time spent in that function (wider = more time)
Color = arbitrary (often random or based on package)

How to Read Flame Graphs

  1. Look for wide bars — These functions consume the most time.
  2. Look at the top — Functions at the top of the stack are where CPU time is actually spent (leaf functions).
  3. Look for plateaus — A wide plateau at the top means one function dominates CPU time.
  4. Ignore narrow spikes — They represent functions that execute briefly.

Generating Flame Graphs

LanguageToolCommand
Pythonpy-spypy-spy record -o flame.svg --pid PID
Node.jsclinic flameclinic flame -- node app.js
Javaasync-profiler./asprof -d 30 -f flame.html PID
Gopprofgo tool pprof -http=:8080 profile.pb.gz
System-wideperf + FlameGraphperf record -g -- ./app && perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

Benchmark Methodology

Benchmarking requires discipline. Poor methodology leads to misleading results.

Rules of Benchmarking

  1. Warm up first — JIT compilers, caches, and OS page caches need time to reach steady state. Discard the first several iterations.
  2. Measure multiple iterations — A single run is meaningless. Run at least 10-30 iterations and report the median plus percentiles.
  3. Isolate the variable — Change only one thing at a time. Control for hardware, OS state, and background processes.
  4. Use stable environments — Disable CPU frequency scaling, close other applications, and pin to specific CPU cores if possible.
  5. Report the right metric — Mean can be skewed by outliers. Report the median (p50), p95, and p99.
  6. Avoid dead code elimination — Compilers may optimize away code whose result is unused. Consume the result (print it, write it to a volatile variable).

Statistical Measures

Given latencies (ms): 2, 3, 3, 4, 5, 5, 5, 8, 12, 45
Mean (average): 9.2 ms -- skewed by the 45ms outlier
Median (p50): 5.0 ms -- middle value, more representative
p95: 28.5 ms -- 95% of requests are faster than this
p99: 45.0 ms -- the tail; important for user experience
Std deviation: 12.8 ms -- high variance indicates inconsistency
Always report percentiles, not just averages!

Load Testing Tools

Load testing simulates real-world traffic to measure how your system performs under stress.

Tool Comparison

ToolLanguageProtocol SupportScriptingBest For
k6Go (JS scripts)HTTP, WebSocket, gRPCJavaScriptDeveloper-friendly load testing
wrkCHTTP onlyLuaQuick HTTP benchmarks
JMeterJavaHTTP, JDBC, JMS, LDAP, FTPGUI + XMLEnterprise-grade testing
LocustPythonHTTP (extensible)PythonPython-centric teams
GatlingScalaHTTP, WebSocketScala/Java DSLHigh-performance simulations
ArtilleryNode.jsHTTP, WebSocket, Socket.IOYAML + JSQuick cloud-native tests
vegetaGoHTTPCLIConstant-rate HTTP load
// k6 load test script
import http from 'k6/http';
import { check, sleep } from 'k6';
// Test configuration
export const options = {
stages: [
{ duration: '30s', target: 50 }, // Ramp up to 50 users
{ duration: '1m', target: 50 }, // Hold at 50 users
{ duration: '30s', target: 200 }, // Ramp up to 200 users
{ duration: '1m', target: 200 }, // Hold at 200 users
{ duration: '30s', target: 0 }, // Ramp down
],
thresholds: {
http_req_duration: ['p(95)<500'], // 95% under 500ms
http_req_failed: ['rate<0.01'], // Less than 1% errors
},
};
export default function () {
// GET request
const res = http.get('https://api.example.com/users');
// Assertions
check(res, {
'status is 200': (r) => r.status === 200,
'response time OK': (r) => r.timings.duration < 500,
'body has users': (r) => r.json().length > 0,
});
// Simulate user think time
sleep(Math.random() * 3 + 1); // 1-4 seconds
}
// Run: k6 run loadtest.js
// Output includes p50, p90, p95, p99, error rate, throughput

Load Testing Patterns

1. Smoke Test: Low load to verify system works
Users: 1-5 Duration: 1-2 min
2. Load Test: Normal expected traffic
Users: typical Duration: 5-30 min
3. Stress Test: Beyond normal capacity
Users: 2-5x normal Duration: 5-15 min
4. Spike Test: Sudden burst of traffic
Users: ──────╱╲──────
↑ Spike
5. Soak Test: Sustained load over hours
Users: normal Duration: 4-24 hours
(Finds memory leaks, connection pool exhaustion)
6. Breakpoint Test: Increase until failure
Users: ramp up continuously until system breaks

Performance Budgets

A performance budget is a set of limits on metrics that affect user experience. If a change causes a metric to exceed its budget, the build or deployment fails.

Defining Budgets

MetricBudgetWhy It Matters
Largest Contentful Paint (LCP)less than 2.5 secondsCore Web Vital: perceived load time
First Input Delay (FID)less than 100msCore Web Vital: interactivity
Cumulative Layout Shift (CLS)less than 0.1Core Web Vital: visual stability
Total JS bundle sizeless than 200 KB (gzipped)Affects parse and execution time
API response time (p95)less than 500msBackend SLO
Time to First Byte (TTFB)less than 200msServer processing time

Enforcing Budgets in CI/CD

CI/CD Pipeline with Performance Budget:
┌──────────┐ ┌──────────┐ ┌───────────────┐ ┌──────────┐
│ Build │───►│ Test │───►│ Performance │───►│ Deploy │
│ │ │ │ │ Budget Check │ │ │
└──────────┘ └──────────┘ └───────┬───────┘ └──────────┘
PASS / FAIL
┌────┴────┐
│ Bundle │
│ > 200KB?│──► FAIL build
│ p95 │
│ > 500ms?│──► FAIL build
└─────────┘
// webpack.config.js -- bundle size budget
module.exports = {
performance: {
maxAssetSize: 200 * 1024, // 200 KB per asset
maxEntrypointSize: 400 * 1024, // 400 KB per entry
hints: 'error', // Fail the build
},
};
// Lighthouse CI budget (lighthouserc.json)
const config = {
ci: {
assert: {
assertions: {
'largest-contentful-paint': ['error', { maxNumericValue: 2500 }],
'first-input-delay': ['error', { maxNumericValue: 100 }],
'cumulative-layout-shift': ['error', { maxNumericValue: 0.1 }],
'total-byte-weight': ['warning', { maxNumericValue: 500000 }],
},
},
},
};

Profiling Workflow


Summary

ConceptKey Takeaway
Sampling profilerLow overhead, safe for production; take periodic stack snapshots
Instrumentation profilerHigh accuracy, high overhead; development only
Flame graphWidth shows time; look for wide bars at the top of the stack
Memory profilingFind leaks by comparing heap snapshots over time
Benchmark methodologyWarm up, run many iterations, report percentiles, isolate variables
Load testingSimulate real traffic with tools like k6, Locust, or wrk
Performance budgetsSet limits on metrics and enforce them in CI/CD