Scaling Strategy
Start with vertical scaling. Add a load balancer and horizontal scaling when you outgrow a single server. Use auto-scaling for unpredictable traffic. Scale your database with read replicas first, then sharding.
Scalability is a systemβs ability to handle increasing amounts of work by adding resources. A scalable system can grow from serving 100 users to 100 million users without a complete redesign. This page covers the core patterns and strategies that make this possible.
Add more resources (CPU, RAM, disk) to a single machine.
Before: After:ββββββββββββββββββ βββββββββββββββββββββββββββ Server β β Server (upgraded) ββ 4 CPU cores β β 32 CPU cores ββ 16 GB RAM β βββΊ β 256 GB RAM ββ 500 GB SSD β β 4 TB NVMe SSD ββ β β ββ Handles 1K rps β β Handles 10K rps βββββββββββββββββββ ββββββββββββββββββββββββββPros: Simple (no code changes), no distributed system complexity, strong consistency Cons: Hardware limits (you cannot buy a server with 10,000 cores), single point of failure, expensive at high end, downtime during upgrades
Add more machines to distribute the load.
Before: After:ββββββββββββββββββ βββββββββββββββββββ Server β β Server 1 β β 3K rpsβ Handles 1K rps β ββββββββββββββββββββββββββββββββββββ ββββββββββββββββββ βββΊ β Server 2 β β 3K rps ββββββββββββββββββ ββββββββββββββββββ β Server 3 β β 3K rps ββββββββββββββββββ Total: 9K rps (add more as needed)Pros: Near-linear capacity growth, no single point of failure, cost-effective (commodity hardware), no downtime for scaling Cons: Application must be stateless or use shared state, distributed system complexity (consistency, coordination), operational overhead
| Scenario | Recommended | Reason |
|---|---|---|
| Early-stage startup | Vertical | Simpler, cheaper for small scale |
| Database server | Vertical first, then replicas | Databases are harder to distribute |
| Stateless API servers | Horizontal | Easy to add/remove instances |
| Cache layer | Horizontal | Redis Cluster, Memcached pools |
| Reaching hardware limits | Horizontal | Only option beyond largest machines |
| Predictable, gradual growth | Vertical | Scale the single server incrementally |
| Unpredictable traffic spikes | Horizontal (auto-scaling) | Add/remove instances dynamically |
A load balancer distributes incoming requests across multiple servers to ensure no single server is overwhelmed.
βββββββββββββββ β Server 1 β β (healthy) βββββββββββββ βββββββββββ€ ββ β β ββββββββββββββββ Clients ββββΊβ Load ββββββββββββββββ β β Balancer β Server 2 ββ β β β (healthy) βββββββββββββ β (L7) β β β βββββββββββββββ β βββββββββββββββ β β Server 3 β βββββββββββ€ (unhealthy) β β removed from pool β β β βββββββββββββββ| Algorithm | How It Works | Best For |
|---|---|---|
| Round Robin | Distribute requests sequentially (1β2β3β1β2β3) | Equal-capacity servers, uniform requests |
| Weighted Round Robin | Like round robin but servers get proportional share | Servers with different capacities |
| Least Connections | Send to server with fewest active connections | Varying request durations |
| Weighted Least Connections | Least connections weighted by server capacity | Mixed server fleet |
| IP Hash | Hash client IP to determine server | Session affinity without cookies |
| Least Response Time | Send to server with fastest response + fewest connections | Performance-sensitive applications |
| Random | Random server selection | Simple, surprisingly effective at scale |
| Feature | Layer 4 (Transport) | Layer 7 (Application) |
|---|---|---|
| Operates on | IP addresses, TCP/UDP ports | HTTP headers, URLs, cookies |
| Speed | Very fast (no payload inspection) | Slower (inspects content) |
| Routing decisions | Based on IP and port | Based on URL path, headers, content |
| SSL termination | No (passes through) | Yes (decrypts and re-encrypts) |
| Use case | Raw TCP/UDP traffic, gaming | HTTP APIs, microservices, A/B testing |
| Examples | AWS NLB, HAProxy (TCP mode) | AWS ALB, Nginx, HAProxy (HTTP mode) |
Load balancers must know which servers are healthy. Common health check patterns:
# Simple HTTP health check endpoint@app.get("/health")def health_check(): checks = { "database": check_db_connection(), "cache": check_redis_connection(), "disk_space": check_disk_space(), } all_healthy = all(checks.values()) status_code = 200 if all_healthy else 503 return JSONResponse( content={"status": "healthy" if all_healthy else "unhealthy", "checks": checks}, status_code=status_code )Internet β βΌβββββββββββββββ DNS-based β Round-robin across geographic regionsβ Load Bal. ββββββββ¬βββββββ β βββββ΄βββββ βΌ βΌββββββ βββββββCDN β βCDN β Edge nodes serve static contentβEdgeβ βEdgeβββββ¬ββ ββββ¬ββ β β βΌ βΌβββββββββββββββββ L7 Load β Route by URL path (/api β API, /ws β WebSocket)β Balancer ββ (ALB/Nginx) βββββ¬βββββ¬βββ¬βββ β β β βΌ βΌ βΌβββββββββββββββββββAPI ββAPI ββAPI β Stateless application serversβ 1 ββ 2 ββ 3 βββββ¬ββββββ¬ββββββ¬ββ β β β βΌ βΌ βΌβββββββββββββββββββββ Internal L4 LB β Route to database/cache clusterββββ¬βββββββββββ¬βββββ βΌ βΌββββββββ βββββββββ DB β βRedis ββPrimaryβ βClusterβββββββββ ββββββββCaching stores copies of frequently accessed data in faster storage to reduce latency and backend load. A well-implemented cache can reduce database queries by 90% or more.
Request Flow (fastest to slowest):
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 1. Browser Cache (~0ms) - HTTP cache headers ββ 2. CDN Cache (~10ms) - Edge servers ββ 3. API Gateway Cache (~1ms) - Reverse proxy cache ββ 4. Application Cache (~1ms) - In-memory (Redis) ββ 5. Database Cache (~5ms) - Query result cache ββ 6. Database Disk (~10ms) - Actual data read ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββThe application manages the cache explicitly. Most common strategy.
Read Path:1. App checks cache2. Cache HIT β return cached data3. Cache MISS β query database β store in cache β return
βββββββ 1. GET key ββββββββββ App ββββββββββββββββΊβ Cache ββ βββββββββββββββββ(Redis)ββ β 2. HIT/MISS ββββββββββ ββ β 3. (on MISS) βββββββ βββββββββββββββββΊβ DB ββ ββββββββββββββββββ ββ β 4. Return dataβββββββ ββ β 5. SET key ββββββββββ βββββββββββββββββΊβ Cache ββββββββ βββββββββdef get_user(user_id: str) -> dict: # 1. Check cache cached = redis.get(f"user:{user_id}") if cached: return json.loads(cached)
# 2. Cache miss β query database user = db.query("SELECT * FROM users WHERE id = %s", user_id)
# 3. Store in cache with TTL redis.setex(f"user:{user_id}", 3600, json.dumps(user))
return userPros: Only caches data that is actually requested, resilient to cache failures Cons: Cache miss penalty (three round trips), potential for stale data
Every write goes to the cache AND the database simultaneously.
Write Path:1. App writes to cache2. Cache writes to database3. Both are updated before returning
βββββββ 1. Write βββββββββ 2. Write βββββββ App ββββββββββββββΊβ Cache ββββββββββββββΊβ DB ββ βββββββββββββββ βββββββββββββββ ββββββββ 3. ACK βββββββββ 4. ACK ββββββPros: Cache always has latest data, strong consistency Cons: Higher write latency (two writes), caches data that may never be read
Write to the cache immediately, then asynchronously write to the database.
Write Path:1. App writes to cache (returns immediately)2. Cache asynchronously flushes to database (batched)
βββββββ 1. Write βββββββββ βββββββ App ββββββββββββββΊβ Cache βββ(async)βββΊβ DB ββ βββββββββββββββ β batched β ββββββββ 2. ACK βββββββββ ββββββ (fast!)Pros: Very fast writes, can batch database writes for efficiency Cons: Risk of data loss if cache fails before flushing, complex implementation
When the cache is full, which entries should be removed?
| Policy | How It Works | Best For |
|---|---|---|
| LRU (Least Recently Used) | Evict the entry accessed longest ago | General purpose, most common |
| LFU (Least Frequently Used) | Evict the entry accessed least often | Data with varying popularity |
| FIFO (First In, First Out) | Evict the oldest entry | Simple, predictable |
| TTL (Time to Live) | Expire entries after a fixed time | Data with known staleness tolerance |
| Random | Evict a random entry | Simple, good enough in many cases |
Cache invalidation is one of the hardest problems in computer science. Common approaches:
| Strategy | How It Works | Trade-off |
|---|---|---|
| TTL-based | Data expires after a set time | Simple, but stale data until expiry |
| Event-driven | Invalidate on write events | Fresh data, but complex to implement |
| Version-based | Each entry has a version number | Precise, but requires version tracking |
| Purge all | Clear entire cache | Simple, but cache stampede risk |
A CDN caches static content (images, CSS, JavaScript, videos) at edge servers distributed globally, serving content from the server closest to the user.
Without CDN:User (Tokyo) βββββββββ 200ms βββββββββ Origin (US East)
With CDN:User (Tokyo) ββββ 20ms ββββ CDN Edge (Tokyo) β (Cache MISS only) β ββ 200ms ββββ Origin (US East)What to cache on CDN:
What NOT to cache on CDN:
Popular CDNs: CloudFront (AWS), Cloudflare, Fastly, Akamai
Message queues enable asynchronous communication between services. A producer sends messages to a queue, and consumers process them independently.
Synchronous (without queue):ββββββββββ ββββββββββββββ βββββββββββββ βββββββββββββββ Client ββββΊβ API Server ββββΊβ Email Svc ββββΊβ Analytics ββ βββββ (waits) βββββ (slow) βββββ (slow) βββββββββββ ββββββββββββββ βββββββββββββ ββββββββββββββTotal latency: API + Email + Analytics = 500ms
Asynchronous (with queue):ββββββββββ ββββββββββββββ βββββββββ ββββββββββββββ Client ββββΊβ API Server ββββΊβ Queue ββββΊβ Email Svc β (processes later)β βββββ (returns) β β ββββΊβ Analytics β (processes later)ββββββββββ ββββββββββββββ βββββββββ βββββββββββββTotal latency: API only = 50msEach message is consumed by exactly one consumer.
Producer βββΊ [MSG3][MSG2][MSG1] βββΊ Consumer A (processes MSG1) Consumer B (processes MSG2)Use case: Task distribution (email sending, image processing, order fulfillment).
Each message is delivered to all subscribers.
Publisher βββΊ [Topic: "orders"] βββΊ Subscriber A (Inventory) βββΊ Subscriber B (Analytics) βββΊ Subscriber C (Notifications)Use case: Event broadcasting (order placed, user signed up, payment processed).
| Technology | Type | Ordering | Throughput | Best For |
|---|---|---|---|---|
| RabbitMQ | Broker-based | Per-queue | Moderate (~50K msg/s) | Task queues, RPC, complex routing |
| Apache Kafka | Log-based | Per-partition | Very high (~1M msg/s) | Event streaming, log aggregation, analytics |
| Amazon SQS | Cloud queue | Best-effort (FIFO available) | High | Simple task queues, serverless |
| Redis Streams | Log-based | Per-stream | High | Lightweight streaming, real-time |
| Apache Pulsar | Log-based | Per-partition | Very high | Multi-tenancy, geo-replication |
Producers Kafka Cluster Consumers ββββββββββββββββββββββββββββββββββββ β Topic: "orders" β βββββββββββββββββ Order ββββββββΊβ ββββββββββββββββββββ ββββββββββΊβ Consumer ββ Service β β β Partition 0 β β β Group A βββββββββββββ β β [5][4][3][2][1] β β β (Inventory) β β ββββββββββββββββββββ β ββββββββββββββββββββββββββββ β ββββββββββββββββββββ β βββββββββββββββββ Payment ββββββββΊβ β Partition 1 β ββββββββββΊβ Consumer ββ Service β β β [4][3][2][1] β β β Group B βββββββββββββ β ββββββββββββββββββββ β β (Analytics) β β ββββββββββββββββββββ β ββββββββββββββββ β β Partition 2 β β β β [3][2][1] β β β ββββββββββββββββββββ β ββββββββββββββββββββββββ# Retry with exponential backoffimport time
def process_message(message, max_retries=3): for attempt in range(max_retries): try: result = handle(message) acknowledge(message) return result except TransientError: wait_time = (2 ** attempt) + random.uniform(0, 1) time.sleep(wait_time) # 1s, 2s, 4s + jitter except PermanentError: send_to_dead_letter_queue(message) return
# All retries exhausted send_to_dead_letter_queue(message)Dead Letter Queue (DLQ): A separate queue for messages that could not be processed after all retries. Engineers can inspect and reprocess them manually or with automated tooling.
Rate limiting controls the number of requests a client can make in a given time window, protecting systems from abuse, DDoS attacks, and cascading failures.
A bucket holds tokens. Each request consumes a token. Tokens are added at a fixed rate. If the bucket is empty, the request is rejected.
Bucket capacity: 10 tokensRefill rate: 2 tokens/second
Time 0: [ββββββββββ] 10 tokens (full) 5 requests arrive β 5 tokens consumedTime 0: [ββββββββββ] 5 tokens remaining
Time 1: [ββββββββββ] 7 tokens (5 + 2 refilled) 8 requests arrive β 7 allowed, 1 rejectedTime 1: [ββββββββββ] 0 tokens
Time 2: [ββββββββββ] 2 tokens (0 + 2 refilled)Pros: Allows bursts up to bucket size, smooth rate limiting Cons: Requires per-user state
Track timestamps of all requests. Count requests in the current window.
Window: 1 minute, Limit: 5 requests
Request log for User A:[10:00:15, 10:00:22, 10:00:45, 10:00:51, 10:00:58]
10:01:05 β Window = [10:00:05, 10:01:05] Requests in window: 4 (10:00:22, 10:00:45, 10:00:51, 10:00:58) β ALLOWED (4 < 5)
10:01:10 β Window = [10:00:10, 10:01:10] Requests in window: 5 β REJECTED (5 >= 5)Pros: Very accurate, no boundary issues Cons: Memory-intensive (stores all timestamps)
Divide time into fixed windows and count requests per window.
Limit: 100 requests per minute
Window [10:00 - 10:01]: 85 requests β all allowedWindow [10:01 - 10:02]: 100 requests β all allowedWindow [10:01 - 10:02]: request 101 β REJECTED
Problem: 85 requests at 10:00:59 + 100 at 10:01:01 = 185 requests in 2 seconds! (boundary issue)Pros: Simple, low memory Cons: Boundary issue allows 2x burst at window edges
import redisimport time
r = redis.Redis()
def sliding_window_rate_limit(redis_client, user_id, limit=100, window=60): """Rate limit using sliding window with atomic Lua script.""" key = f"rate_limit:{user_id}" now = time.time()
# Lua script for atomic rate limiting lua_script = """ local key = KEYS[1] local now = tonumber(ARGV[1]) local window = tonumber(ARGV[2]) local limit = tonumber(ARGV[3])
redis.call('ZREMRANGEBYSCORE', key, 0, now - window) local count = redis.call('ZCARD', key)
if count < limit then redis.call('ZADD', key, now, now .. '-' .. math.random()) redis.call('EXPIRE', key, math.ceil(window)) return 1 else return 0 end """
allowed = redis_client.eval(lua_script, 1, key, now, window, limit) return bool(allowed)
# Usageif not sliding_window_rate_limit(r, "user_123"): return HttpResponse(status=429, body="Too Many Requests")HTTP/1.1 429 Too Many RequestsX-RateLimit-Limit: 100X-RateLimit-Remaining: 0X-RateLimit-Reset: 1700000060Retry-After: 30ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Application ββ ββ Write queries βββββββΊ Primary DB ββ (single writer) ββ β ββ βββββββββββΌβββββββββββ ββ βΌ βΌ βΌ ββ Read queries βββΊ Replica 1 Replica 2 Replica 3 ββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββSeparate the read model from the write model entirely.
Commands (Writes) Queries (Reads) β β βΌ βΌββββββββββββ βββββββββββββββββ Command β β Query ββ Handler β β Handler βββββββ¬ββββββ ββββββββ¬ββββββββ β β βΌ βΌββββββββββββ Sync/Async βββββββββββββββββ Write DB ββββββββββββββββββββββΊ Read DB ββ (Postgres)β (event stream) β (Elasticsearchββ β β or Redis) βββββββββββββ ββββββββββββββββWhen to use CQRS: When read and write patterns are very different (e.g., simple writes but complex search queries), or when you need to scale reads and writes independently.
ββββββββββββββββ ββββββββββββββββ βββββββββββββββββ User Service β β Order Serviceβ βProduct Serviceββ β β β β ββ βββββββββββ β β βββββββββββ β β βββββββββββ ββ βPostgreSQLβ β β β MySQL β β β β MongoDB β ββ β (users) β β β β(orders) β β β β(products)β ββ βββββββββββ β β βββββββββββ β β βββββββββββ βββββββββββββββββ ββββββββββββββββ ββββββββββββββββEach microservice owns its own database. Services communicate through APIs or events, never by sharing databases directly.
Benefits: Independent scaling, technology freedom, fault isolation Challenges: Distributed transactions, data consistency across services, join queries across services
A single entry point that routes requests to the appropriate microservice.
ββββββββββββββββββββ β API Gateway βMobile App βββββββββΊβ β β - Authentication βWeb App ββββββββββββΊβ - Rate Limiting β β - Routing β3rd Party ββββββββββΊβ - SSL Terminationβ β - Request/Responseβ β Transformation β ββββ¬ββββ¬ββββ¬ββββ¬ββββ β β β β ββββββββββββ β β ββββββββββββ βΌ βΌ βΌ βΌ βββββββββ βββββββββ βββββββββ βββββββββ β User β β Order β βProductβ βPaymentβ β Svc β β Svc β β Svc β β Svc β βββββββββ βββββββββ βββββββββ βββββββββPrevents cascading failures by stopping requests to a failing service.
States: CLOSED ββ(failures exceed threshold)βββΊ OPEN β² β β (timeout) β β βββ(success)ββββ HALF-OPEN ββββββββββββββ (allow limited requests)
CLOSED: Requests flow normally. Failures are counted.OPEN: All requests fail immediately. No calls to downstream.HALF-OPEN: Allow a few test requests. If they succeed, go to CLOSED. If they fail, go back to OPEN.import timefrom enum import Enum
class CircuitState(Enum): CLOSED = "closed" OPEN = "open" HALF_OPEN = "half_open"
class CircuitBreaker: def __init__(self, failure_threshold=5, recovery_timeout=30): self.state = CircuitState.CLOSED self.failure_count = 0 self.failure_threshold = failure_threshold self.recovery_timeout = recovery_timeout self.last_failure_time = None
def call(self, func, *args, **kwargs): if self.state == CircuitState.OPEN: if time.time() - self.last_failure_time > self.recovery_timeout: self.state = CircuitState.HALF_OPEN else: raise CircuitOpenError("Circuit is OPEN, failing fast")
try: result = func(*args, **kwargs) self._on_success() return result except Exception as e: self._on_failure() raise e
def _on_success(self): self.failure_count = 0 self.state = CircuitState.CLOSED
def _on_failure(self): self.failure_count += 1 self.last_failure_time = time.time() if self.failure_count >= self.failure_threshold: self.state = CircuitState.OPENHow do services find each other in a dynamic environment?
Client-Side Discovery:
βββββββββββ 1. Query βββββββββββββββββ Service ββββββββββββββββΊβ Service ββ A β β Registry ββ βββββββββββββββββ (Consul/etcd)ββ β 2. Addresses βββββββββββββββββ ββ β 3. Direct call to chosen instanceβ ββββββββββββββββΊβββββββββββββ β βService B ββββββββββββ βInstance 2β ββββββββββββ
Server-Side Discovery:
βββββββββββ ββββββββββββββ βββββββββββββ Service βββββββββββββββΊβ Load ββββββββββΊβService B ββ A β β Balancer β βInstance 1ββββββββββββ β β ββββββββββββ β (queries β ββββββββββββ β registry ββββββββββΊβService B β β internally)β βInstance 2β ββββββββββββββ βββββββββββββββββββββββββββββββββββββ Single Server ββ ββ ββββββββββ ββ β App β ββ βββββ¬βββββ ββ β ββ βββββΌβββββ ββ β DB β ββ ββββββββββ βββββββββββββββββββββββββββββββββββββββ βββββββββββββββ App Server βββββββΊβ Database ββ β β Server βββββββββββββββ ββββββββββββββ ββββββββββββββ βββββΊβ App Srv 1 βββββββββββ β ββββββββββββββ β ββββββββββ LB βββββ€ ββββββββββββββ βββββΊβ Redis ββ β βββββΊβ App Srv 2 βββββ€ βββββββββββββββ β ββββββββββββββ β βββββββββ βββββΊββββββββββββββ βββββΊβ DB β β App Srv 3 ββββββββββ β ββββββββββββββ βββββββββ βββββββUsers βββββββΊβ CDN β (static content) ββββ¬βββ β (dynamic requests) βΌ ββββββββββ β L7 LB β βββββ¬βββββ βββββββΌβββββββ βΌ βΌ βΌ ββββββββββββββββββ βApp ββApp ββApp β β 1 ββ 2 ββ 3 β βββ¬ββββββ¬ββββββ¬βββ β β β ββββββ΄ββββββ΄ββββββ΄βββββ β β βΌ βΌβββββββββ ββββββββββββ βββββββββββββββββ Redis β β DB PrimaryββββββΊβ DB Replica 1 ββClusterβ β ββββββΊβ DB Replica 2 ββββββββββ ββββββββββββ ββββββββββββββββ β ββββββΌβββββ β Kafka ββββΊ Analytics β Queue ββββΊ Notifications ββββββββββββββΊ Search indexing Global DNS (GeoDNS) βββββββββ΄βββββββββ βΌ βΌ ββββββββββββββββ ββββββββββββββββ β US Region β β EU Region β β β β β β ββββββββββ β β ββββββββββ β β βCDN Edgeβ β β βCDN Edgeβ β β βββββ¬βββββ β β βββββ¬βββββ β β βΌ β β βΌ β β ββββββββββ β β ββββββββββ β β β LB β β β β LB β β β βββββ¬βββββ β β βββββ¬βββββ β β βββββ΄ββββ β β βββββ΄ββββ β β βApp x Nβ β β βApp x Nβ β β βββββ¬ββββ β β βββββ¬ββββ β β βββββ΄ββββ β β βββββ΄ββββ β β βShard 1β β β βShard 3β β β βShard 2β β β βShard 4β β β βββββββββ β β βββββββββ β ββββββββββββββββ ββββββββββββββββ β β βββββ Cross-region replicationScaling Strategy
Start with vertical scaling. Add a load balancer and horizontal scaling when you outgrow a single server. Use auto-scaling for unpredictable traffic. Scale your database with read replicas first, then sharding.
Caching
Cache at every layer: browser, CDN, application, database. Use cache-aside for most cases. Set appropriate TTLs. Plan for cache invalidation from day one. Monitor hit rates.
Message Queues
Use queues to decouple services, handle traffic spikes, and enable async processing. Kafka for event streaming, RabbitMQ for task queues, SQS for simple cloud queues. Always implement dead letter queues.
Resilience
Rate limiting protects your system from abuse. Circuit breakers prevent cascading failures. Health checks enable automatic recovery. Design for failure β it is not a question of if, but when.