Skip to content

Scalability

Scalability is a system’s ability to handle increasing amounts of work by adding resources. A scalable system can grow from serving 100 users to 100 million users without a complete redesign. This page covers the core patterns and strategies that make this possible.


Horizontal vs Vertical Scaling

Vertical Scaling (Scale Up)

Add more resources (CPU, RAM, disk) to a single machine.

Before: After:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Server β”‚ β”‚ Server (upgraded) β”‚
β”‚ 4 CPU cores β”‚ β”‚ 32 CPU cores β”‚
β”‚ 16 GB RAM β”‚ ──► β”‚ 256 GB RAM β”‚
β”‚ 500 GB SSD β”‚ β”‚ 4 TB NVMe SSD β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ Handles 1K rps β”‚ β”‚ Handles 10K rps β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Pros: Simple (no code changes), no distributed system complexity, strong consistency Cons: Hardware limits (you cannot buy a server with 10,000 cores), single point of failure, expensive at high end, downtime during upgrades

Horizontal Scaling (Scale Out)

Add more machines to distribute the load.

Before: After:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Server β”‚ β”‚ Server 1 β”‚ ← 3K rps
β”‚ Handles 1K rps β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
──► β”‚ Server 2 β”‚ ← 3K rps
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Server 3 β”‚ ← 3K rps
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Total: 9K rps
(add more as needed)

Pros: Near-linear capacity growth, no single point of failure, cost-effective (commodity hardware), no downtime for scaling Cons: Application must be stateless or use shared state, distributed system complexity (consistency, coordination), operational overhead

When to Use Each

ScenarioRecommendedReason
Early-stage startupVerticalSimpler, cheaper for small scale
Database serverVertical first, then replicasDatabases are harder to distribute
Stateless API serversHorizontalEasy to add/remove instances
Cache layerHorizontalRedis Cluster, Memcached pools
Reaching hardware limitsHorizontalOnly option beyond largest machines
Predictable, gradual growthVerticalScale the single server incrementally
Unpredictable traffic spikesHorizontal (auto-scaling)Add/remove instances dynamically

Load Balancing

A load balancer distributes incoming requests across multiple servers to ensure no single server is overwhelmed.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Server 1 β”‚
β”‚ (healthy) β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚
β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ Clients │──►│ Load β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ β”‚ Balancer β”‚ Server 2 β”‚
β”‚ β”‚ β”‚ β”‚ (healthy) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ (L7) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ Server 3 β”‚
└────────── (unhealthy) β”‚ ← removed from pool
β”‚ βœ— β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Load Balancing Algorithms

AlgorithmHow It WorksBest For
Round RobinDistribute requests sequentially (1β†’2β†’3β†’1β†’2β†’3)Equal-capacity servers, uniform requests
Weighted Round RobinLike round robin but servers get proportional shareServers with different capacities
Least ConnectionsSend to server with fewest active connectionsVarying request durations
Weighted Least ConnectionsLeast connections weighted by server capacityMixed server fleet
IP HashHash client IP to determine serverSession affinity without cookies
Least Response TimeSend to server with fastest response + fewest connectionsPerformance-sensitive applications
RandomRandom server selectionSimple, surprisingly effective at scale

Layer 4 vs Layer 7 Load Balancing

FeatureLayer 4 (Transport)Layer 7 (Application)
Operates onIP addresses, TCP/UDP portsHTTP headers, URLs, cookies
SpeedVery fast (no payload inspection)Slower (inspects content)
Routing decisionsBased on IP and portBased on URL path, headers, content
SSL terminationNo (passes through)Yes (decrypts and re-encrypts)
Use caseRaw TCP/UDP traffic, gamingHTTP APIs, microservices, A/B testing
ExamplesAWS NLB, HAProxy (TCP mode)AWS ALB, Nginx, HAProxy (HTTP mode)

Health Checks

Load balancers must know which servers are healthy. Common health check patterns:

# Simple HTTP health check endpoint
@app.get("/health")
def health_check():
checks = {
"database": check_db_connection(),
"cache": check_redis_connection(),
"disk_space": check_disk_space(),
}
all_healthy = all(checks.values())
status_code = 200 if all_healthy else 503
return JSONResponse(
content={"status": "healthy" if all_healthy else "unhealthy", "checks": checks},
status_code=status_code
)

Real-World Example: Multi-Tier Load Balancing

Internet
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DNS-based β”‚ Round-robin across geographic regions
β”‚ Load Bal. β”‚
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”΄β”€β”€β”€β”€β”
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”
β”‚CDN β”‚ β”‚CDN β”‚ Edge nodes serve static content
β”‚Edgeβ”‚ β”‚Edgeβ”‚
β””β”€β”€β”¬β”€β”˜ β””β”€β”€β”¬β”€β”˜
β”‚ β”‚
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ L7 Load β”‚ Route by URL path (/api β†’ API, /ws β†’ WebSocket)
β”‚ Balancer β”‚
β”‚ (ALB/Nginx) β”‚
β””β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”˜
β”‚ β”‚ β”‚
β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”
β”‚API β”‚β”‚API β”‚β”‚API β”‚ Stateless application servers
β”‚ 1 β”‚β”‚ 2 β”‚β”‚ 3 β”‚
β””β”€β”€β”¬β”€β”˜β””β”€β”€β”¬β”€β”˜β””β”€β”€β”¬β”€β”˜
β”‚ β”‚ β”‚
β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Internal L4 LB β”‚ Route to database/cache cluster
β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”
β”‚ DB β”‚ β”‚Redis β”‚
β”‚Primaryβ”‚ β”‚Clusterβ”‚
β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜

Caching

Caching stores copies of frequently accessed data in faster storage to reduce latency and backend load. A well-implemented cache can reduce database queries by 90% or more.

Cache Hierarchy

Request Flow (fastest to slowest):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Browser Cache (~0ms) - HTTP cache headers β”‚
β”‚ 2. CDN Cache (~10ms) - Edge servers β”‚
β”‚ 3. API Gateway Cache (~1ms) - Reverse proxy cache β”‚
β”‚ 4. Application Cache (~1ms) - In-memory (Redis) β”‚
β”‚ 5. Database Cache (~5ms) - Query result cache β”‚
β”‚ 6. Database Disk (~10ms) - Actual data read β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Caching Strategies

Cache-Aside (Lazy Loading)

The application manages the cache explicitly. Most common strategy.

Read Path:
1. App checks cache
2. Cache HIT β†’ return cached data
3. Cache MISS β†’ query database β†’ store in cache β†’ return
β”Œβ”€β”€β”€β”€β”€β” 1. GET key β”Œβ”€β”€β”€β”€β”€β”€β”€β”
β”‚ App │──────────────►│ Cache β”‚
β”‚ │◄──────────────│(Redis)β”‚
β”‚ β”‚ 2. HIT/MISS β””β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β”‚ β”‚ 3. (on MISS) β”Œβ”€β”€β”€β”€β”
β”‚ │───────────────►│ DB β”‚
β”‚ │◄───────────────│ β”‚
β”‚ β”‚ 4. Return dataβ””β”€β”€β”€β”€β”˜
β”‚ β”‚
β”‚ β”‚ 5. SET key β”Œβ”€β”€β”€β”€β”€β”€β”€β”
β”‚ │───────────────►│ Cache β”‚
β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”˜
def get_user(user_id: str) -> dict:
# 1. Check cache
cached = redis.get(f"user:{user_id}")
if cached:
return json.loads(cached)
# 2. Cache miss β€” query database
user = db.query("SELECT * FROM users WHERE id = %s", user_id)
# 3. Store in cache with TTL
redis.setex(f"user:{user_id}", 3600, json.dumps(user))
return user

Pros: Only caches data that is actually requested, resilient to cache failures Cons: Cache miss penalty (three round trips), potential for stale data

Write-Through

Every write goes to the cache AND the database simultaneously.

Write Path:
1. App writes to cache
2. Cache writes to database
3. Both are updated before returning
β”Œβ”€β”€β”€β”€β”€β” 1. Write β”Œβ”€β”€β”€β”€β”€β”€β”€β” 2. Write β”Œβ”€β”€β”€β”€β”
β”‚ App │────────────►│ Cache │────────────►│ DB β”‚
β”‚ │◄────────────│ │◄────────────│ β”‚
β””β”€β”€β”€β”€β”€β”˜ 3. ACK β””β”€β”€β”€β”€β”€β”€β”€β”˜ 4. ACK β””β”€β”€β”€β”€β”˜

Pros: Cache always has latest data, strong consistency Cons: Higher write latency (two writes), caches data that may never be read

Write-Behind (Write-Back)

Write to the cache immediately, then asynchronously write to the database.

Write Path:
1. App writes to cache (returns immediately)
2. Cache asynchronously flushes to database (batched)
β”Œβ”€β”€β”€β”€β”€β” 1. Write β”Œβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”
β”‚ App │────────────►│ Cache │──(async)──►│ DB β”‚
β”‚ │◄────────────│ β”‚ batched β”‚ β”‚
β””β”€β”€β”€β”€β”€β”˜ 2. ACK β””β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”˜
(fast!)

Pros: Very fast writes, can batch database writes for efficiency Cons: Risk of data loss if cache fails before flushing, complex implementation

Cache Eviction Policies

When the cache is full, which entries should be removed?

PolicyHow It WorksBest For
LRU (Least Recently Used)Evict the entry accessed longest agoGeneral purpose, most common
LFU (Least Frequently Used)Evict the entry accessed least oftenData with varying popularity
FIFO (First In, First Out)Evict the oldest entrySimple, predictable
TTL (Time to Live)Expire entries after a fixed timeData with known staleness tolerance
RandomEvict a random entrySimple, good enough in many cases

Cache Invalidation

Cache invalidation is one of the hardest problems in computer science. Common approaches:

StrategyHow It WorksTrade-off
TTL-basedData expires after a set timeSimple, but stale data until expiry
Event-drivenInvalidate on write eventsFresh data, but complex to implement
Version-basedEach entry has a version numberPrecise, but requires version tracking
Purge allClear entire cacheSimple, but cache stampede risk

CDN (Content Delivery Network)

A CDN caches static content (images, CSS, JavaScript, videos) at edge servers distributed globally, serving content from the server closest to the user.

Without CDN:
User (Tokyo) ───────── 200ms ───────── Origin (US East)
With CDN:
User (Tokyo) ──── 20ms ──── CDN Edge (Tokyo)
β”‚
(Cache MISS only)
β”‚
── 200ms ──── Origin (US East)

What to cache on CDN:

  • Static assets (images, CSS, JS, fonts)
  • API responses that change infrequently
  • HTML pages (for static sites)
  • Video and audio streams

What NOT to cache on CDN:

  • Personalized content (user-specific data)
  • Real-time data (stock prices, live scores)
  • POST/PUT/DELETE requests

Popular CDNs: CloudFront (AWS), Cloudflare, Fastly, Akamai


Message Queues

Message queues enable asynchronous communication between services. A producer sends messages to a queue, and consumers process them independently.

Why Use Message Queues

Synchronous (without queue):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Client │──►│ API Server │──►│ Email Svc │──►│ Analytics β”‚
β”‚ │◄──│ (waits) │◄──│ (slow) │◄──│ (slow) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Total latency: API + Email + Analytics = 500ms
Asynchronous (with queue):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Client │──►│ API Server │──►│ Queue │──►│ Email Svc β”‚ (processes later)
β”‚ │◄──│ (returns) β”‚ β”‚ │──►│ Analytics β”‚ (processes later)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Total latency: API only = 50ms

Message Queue Patterns

Point-to-Point (Queue)

Each message is consumed by exactly one consumer.

Producer ──► [MSG3][MSG2][MSG1] ──► Consumer A
(processes MSG1)
Consumer B
(processes MSG2)

Use case: Task distribution (email sending, image processing, order fulfillment).

Publish-Subscribe (Topic)

Each message is delivered to all subscribers.

Publisher ──► [Topic: "orders"] ──► Subscriber A (Inventory)
──► Subscriber B (Analytics)
──► Subscriber C (Notifications)

Use case: Event broadcasting (order placed, user signed up, payment processed).

Message Queue Technologies

TechnologyTypeOrderingThroughputBest For
RabbitMQBroker-basedPer-queueModerate (~50K msg/s)Task queues, RPC, complex routing
Apache KafkaLog-basedPer-partitionVery high (~1M msg/s)Event streaming, log aggregation, analytics
Amazon SQSCloud queueBest-effort (FIFO available)HighSimple task queues, serverless
Redis StreamsLog-basedPer-streamHighLightweight streaming, real-time
Apache PulsarLog-basedPer-partitionVery highMulti-tenancy, geo-replication

Kafka Architecture Example

Producers Kafka Cluster Consumers
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Topic: "orders" β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Order │──────►│ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” │────────►│ Consumer β”‚
β”‚ Service β”‚ β”‚ β”‚ Partition 0 β”‚ β”‚ β”‚ Group A β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ [5][4][3][2][1] β”‚ β”‚ β”‚ (Inventory) β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Payment │──────►│ β”‚ Partition 1 β”‚ │────────►│ Consumer β”‚
β”‚ Service β”‚ β”‚ β”‚ [4][3][2][1] β”‚ β”‚ β”‚ Group B β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ (Analytics) β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ Partition 2 β”‚ β”‚
β”‚ β”‚ [3][2][1] β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Handling Failures in Message Processing

# Retry with exponential backoff
import time
def process_message(message, max_retries=3):
for attempt in range(max_retries):
try:
result = handle(message)
acknowledge(message)
return result
except TransientError:
wait_time = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait_time) # 1s, 2s, 4s + jitter
except PermanentError:
send_to_dead_letter_queue(message)
return
# All retries exhausted
send_to_dead_letter_queue(message)

Dead Letter Queue (DLQ): A separate queue for messages that could not be processed after all retries. Engineers can inspect and reprocess them manually or with automated tooling.


Rate Limiting

Rate limiting controls the number of requests a client can make in a given time window, protecting systems from abuse, DDoS attacks, and cascading failures.

Common Algorithms

1. Token Bucket

A bucket holds tokens. Each request consumes a token. Tokens are added at a fixed rate. If the bucket is empty, the request is rejected.

Bucket capacity: 10 tokens
Refill rate: 2 tokens/second
Time 0: [●●●●●●●●●●] 10 tokens (full)
5 requests arrive β†’ 5 tokens consumed
Time 0: [●●●●●○○○○○] 5 tokens remaining
Time 1: [●●●●●●●○○○] 7 tokens (5 + 2 refilled)
8 requests arrive β†’ 7 allowed, 1 rejected
Time 1: [β—‹β—‹β—‹β—‹β—‹β—‹β—‹β—‹β—‹β—‹] 0 tokens
Time 2: [●●○○○○○○○○] 2 tokens (0 + 2 refilled)

Pros: Allows bursts up to bucket size, smooth rate limiting Cons: Requires per-user state

2. Sliding Window Log

Track timestamps of all requests. Count requests in the current window.

Window: 1 minute, Limit: 5 requests
Request log for User A:
[10:00:15, 10:00:22, 10:00:45, 10:00:51, 10:00:58]
10:01:05 β†’ Window = [10:00:05, 10:01:05]
Requests in window: 4 (10:00:22, 10:00:45, 10:00:51, 10:00:58)
β†’ ALLOWED (4 < 5)
10:01:10 β†’ Window = [10:00:10, 10:01:10]
Requests in window: 5
β†’ REJECTED (5 >= 5)

Pros: Very accurate, no boundary issues Cons: Memory-intensive (stores all timestamps)

3. Fixed Window Counter

Divide time into fixed windows and count requests per window.

Limit: 100 requests per minute
Window [10:00 - 10:01]: 85 requests β†’ all allowed
Window [10:01 - 10:02]: 100 requests β†’ all allowed
Window [10:01 - 10:02]: request 101 β†’ REJECTED
Problem: 85 requests at 10:00:59 + 100 at 10:01:01
= 185 requests in 2 seconds! (boundary issue)

Pros: Simple, low memory Cons: Boundary issue allows 2x burst at window edges

Rate Limiting Implementation with Redis

import redis
import time
r = redis.Redis()
def sliding_window_rate_limit(redis_client, user_id, limit=100, window=60):
"""Rate limit using sliding window with atomic Lua script."""
key = f"rate_limit:{user_id}"
now = time.time()
# Lua script for atomic rate limiting
lua_script = """
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
local count = redis.call('ZCARD', key)
if count < limit then
redis.call('ZADD', key, now, now .. '-' .. math.random())
redis.call('EXPIRE', key, math.ceil(window))
return 1
else
return 0
end
"""
allowed = redis_client.eval(lua_script, 1, key, now, window, limit)
return bool(allowed)
# Usage
if not sliding_window_rate_limit(r, "user_123"):
return HttpResponse(status=429, body="Too Many Requests")

Rate Limiting Response Headers

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1700000060
Retry-After: 30

Database Scaling Patterns

Pattern 1: Read Replicas with Write/Read Splitting

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Application β”‚
β”‚ β”‚
β”‚ Write queries ──────► Primary DB β”‚
β”‚ (single writer) β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β–Ό β–Ό β–Ό β”‚
β”‚ Read queries ──► Replica 1 Replica 2 Replica 3 β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Pattern 2: CQRS (Command Query Responsibility Segregation)

Separate the read model from the write model entirely.

Commands (Writes) Queries (Reads)
β”‚ β”‚
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Command β”‚ β”‚ Query β”‚
β”‚ Handler β”‚ β”‚ Handler β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” Sync/Async β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Write DB │────────────────────► Read DB β”‚
β”‚ (Postgres)β”‚ (event stream) β”‚ (Elasticsearchβ”‚
β”‚ β”‚ β”‚ or Redis) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

When to use CQRS: When read and write patterns are very different (e.g., simple writes but complex search queries), or when you need to scale reads and writes independently.

Pattern 3: Database per Service (Microservices)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ User Service β”‚ β”‚ Order Serviceβ”‚ β”‚Product Serviceβ”‚
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚PostgreSQLβ”‚ β”‚ β”‚ β”‚ MySQL β”‚ β”‚ β”‚ β”‚ MongoDB β”‚ β”‚
β”‚ β”‚ (users) β”‚ β”‚ β”‚ β”‚(orders) β”‚ β”‚ β”‚ β”‚(products)β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Each microservice owns its own database. Services communicate through APIs or events, never by sharing databases directly.

Benefits: Independent scaling, technology freedom, fault isolation Challenges: Distributed transactions, data consistency across services, join queries across services


Microservices Patterns

API Gateway

A single entry point that routes requests to the appropriate microservice.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ API Gateway β”‚
Mobile App ────────►│ β”‚
β”‚ - Authentication β”‚
Web App ───────────►│ - Rate Limiting β”‚
β”‚ - Routing β”‚
3rd Party ─────────►│ - SSL Terminationβ”‚
β”‚ - Request/Responseβ”‚
β”‚ Transformation β”‚
β””β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”˜
β”‚ β”‚ β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ └──────────┐
β–Ό β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”
β”‚ User β”‚ β”‚ Order β”‚ β”‚Productβ”‚ β”‚Paymentβ”‚
β”‚ Svc β”‚ β”‚ Svc β”‚ β”‚ Svc β”‚ β”‚ Svc β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”˜

Circuit Breaker

Prevents cascading failures by stopping requests to a failing service.

States:
CLOSED ──(failures exceed threshold)──► OPEN
β–² β”‚
β”‚ (timeout)
β”‚ β”‚
└──(success)──── HALF-OPEN β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
(allow limited requests)
CLOSED: Requests flow normally. Failures are counted.
OPEN: All requests fail immediately. No calls to downstream.
HALF-OPEN: Allow a few test requests. If they succeed, go to CLOSED.
If they fail, go back to OPEN.
import time
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30):
self.state = CircuitState.CLOSED
self.failure_count = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.last_failure_time = None
def call(self, func, *args, **kwargs):
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
else:
raise CircuitOpenError("Circuit is OPEN, failing fast")
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise e
def _on_success(self):
self.failure_count = 0
self.state = CircuitState.CLOSED
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN

Service Discovery

How do services find each other in a dynamic environment?

Client-Side Discovery:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” 1. Query β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Service │──────────────►│ Service β”‚
β”‚ A β”‚ β”‚ Registry β”‚
β”‚ │◄──────────────│ (Consul/etcd)β”‚
β”‚ β”‚ 2. Addresses β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β”‚ β”‚ 3. Direct call to chosen instance
β”‚ β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Ίβ”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ β”‚Service B β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚Instance 2β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Server-Side Discovery:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Service │─────────────►│ Load │────────►│Service B β”‚
β”‚ A β”‚ β”‚ Balancer β”‚ β”‚Instance 1β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ (queries β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ registry │────────►│Service B β”‚
β”‚ internally)β”‚ β”‚Instance 2β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Real-World Scaling Example: From 0 to Millions

Stage 1: Single Server (0-1K users)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Single Server β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ App β”‚ β”‚
β”‚ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β” β”‚
β”‚ β”‚ DB β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Stage 2: Separate DB (1K-10K users)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ App Server │─────►│ Database β”‚
β”‚ β”‚ β”‚ Server β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Stage 3: Add Cache and Load Balancer (10K-100K users)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”Œβ”€β”€β”€β–Ίβ”‚ App Srv 1 │───┐
β”Œβ”€β”€β”€β”€β” β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”
β”‚ LB │──── β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”œβ”€β”€β”€β–Ίβ”‚ Redis β”‚
β”‚ β”‚ β”œβ”€β”€β”€β–Ίβ”‚ App Srv 2 │──── β””β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”
β””β”€β”€β”€β–Ίβ”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” └───►│ DB β”‚
β”‚ App Srv 3 │────────│ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”˜

Stage 4: Read Replicas, CDN, MQ (100K-1M users)

β”Œβ”€β”€β”€β”€β”€β”
Users ──────►│ CDN β”‚ (static content)
β””β”€β”€β”¬β”€β”€β”˜
β”‚ (dynamic requests)
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ L7 LB β”‚
β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”
β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”
β”‚App β”‚β”‚App β”‚β”‚App β”‚
β”‚ 1 β”‚β”‚ 2 β”‚β”‚ 3 β”‚
β””β”€β”¬β”€β”€β”˜β””β”€β”¬β”€β”€β”˜β””β”€β”¬β”€β”€β”˜
β”‚ β”‚ β”‚
β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
β”‚ β”‚
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Redis β”‚ β”‚ DB Primary│────►│ DB Replica 1 β”‚
β”‚Clusterβ”‚ β”‚ │────►│ DB Replica 2 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
β”‚ Kafka │──► Analytics
β”‚ Queue │──► Notifications
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”€β”€β–Ί Search indexing

Stage 5: Sharding, Multi-Region (1M+ users)

Global DNS (GeoDNS)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ US Region β”‚ β”‚ EU Region β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚CDN Edgeβ”‚ β”‚ β”‚ β”‚CDN Edgeβ”‚ β”‚
β”‚ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚
β”‚ β–Ό β”‚ β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ LB β”‚ β”‚ β”‚ β”‚ LB β”‚ β”‚
β”‚ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚
β”‚ β”Œβ”€β”€β”€β”΄β”€β”€β”€β” β”‚ β”‚ β”Œβ”€β”€β”€β”΄β”€β”€β”€β” β”‚
β”‚ β”‚App x Nβ”‚ β”‚ β”‚ β”‚App x Nβ”‚ β”‚
β”‚ β””β”€β”€β”€β”¬β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”¬β”€β”€β”€β”˜ β”‚
β”‚ β”Œβ”€β”€β”€β”΄β”€β”€β”€β” β”‚ β”‚ β”Œβ”€β”€β”€β”΄β”€β”€β”€β” β”‚
β”‚ β”‚Shard 1β”‚ β”‚ β”‚ β”‚Shard 3β”‚ β”‚
β”‚ β”‚Shard 2β”‚ β”‚ β”‚ β”‚Shard 4β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
└──── Cross-region
replication

Summary

Scaling Strategy

Start with vertical scaling. Add a load balancer and horizontal scaling when you outgrow a single server. Use auto-scaling for unpredictable traffic. Scale your database with read replicas first, then sharding.

Caching

Cache at every layer: browser, CDN, application, database. Use cache-aside for most cases. Set appropriate TTLs. Plan for cache invalidation from day one. Monitor hit rates.

Message Queues

Use queues to decouple services, handle traffic spikes, and enable async processing. Kafka for event streaming, RabbitMQ for task queues, SQS for simple cloud queues. Always implement dead letter queues.

Resilience

Rate limiting protects your system from abuse. Circuit breakers prevent cascading failures. Health checks enable automatic recovery. Design for failure β€” it is not a question of if, but when.