System Design Essentials: Redis, Rate Limiting, Circuit Breaker, and Scaling Strategies

Lou Chang included in SystemDesign SRE

2026-04-15 About 2400 words 11 minutes

Contents

This post covers core System Design concepts frequently appearing in SRE interviews: Redis data structures and use cases, four Rate Limiting algorithms, Circuit Breaker’s three states, and common design patterns for Scaling Reads and Scaling Writes — the building blocks behind any high-traffic deployment.

Redis

Memory-Speed Lookups Versus Disk-Latency Queries

Every database read involves disk I/O with latency around 1-10ms. Redis stores data in memory with latency around 0.1ms. For hot data (frequently read, infrequently changed), caching can reduce database pressure by orders of magnitude.

Redis is a key-value store, but not just a cache — its data structures enable it to serve many different roles.

Core Data Types

Structure	Analogy	Suitable Use Cases
String	Variable	cache, session, counter
Hash	Object/dict	user data, configuration
List	Double-ended queue	message queue, recent activity log
Set	Unordered collection	tags, friend list, deduplication
Sorted Set	Collection with scores	leaderboard, priority queue, time series

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


String:
  SET user:123:name "Alice"
  GET user:123:name  -> "Alice"
  INCR user:123:loginCount  -> atomic increment

Hash:
  HSET user:123 name "Alice" age 30 city "Toronto"
  HGET user:123 name  -> "Alice"
  HGETALL user:123    -> all fields

List:
  RPUSH queue:tasks "job1" "job2"
  LPOP queue:tasks  -> "job1"  (FIFO)

Set:
  SADD user:123:tags "golang" "devops"
  SMEMBERS user:123:tags  -> {"golang", "devops"}

Sorted Set:
  ZADD leaderboard 9500 "user:123"
  ZADD leaderboard 8200 "user:456"
  ZREVRANGE leaderboard 0 9 WITHSCORES  -> top 10 with scores

Typical Production Roles

Cache: Most common. After reading from the database, store in Redis. Next time, fetch directly from Redis. TTL controls expiration.

Distributed Lock: When multiple services compete for the same resource (e.g., ticket booking), use SET key value NX EX 300 to create a lock. NX = only set if key doesn’t exist (atomic operation), preventing two instances from acquiring the lock simultaneously. EX 300 = auto-release after 300 seconds, preventing the lock from never being released.

Leaderboard: Sorted Set natively supports leaderboards. ZADD adds scores, ZREVRANGE gets top N, time complexity O(log N).

Rate Limiting: Use INCR for counting, each key represents a time window. Multiple API Gateway instances share the same Redis for global rate limiting.

Proximity Search: GEOADD adds coordinates, GEODIST calculates distance, GEOSEARCH finds points within a range. Suitable for food delivery, ride-hailing, map search.

Streams: Message stream similar to Kafka, supporting consumer groups, message persistence, and replay. More reliable than Pub/Sub.

Pub/Sub: Publish/subscribe pattern for real-time push notifications. Characteristic is fire-and-forget — messages are sent with no guarantee the recipient received them, and messages during disconnection are lost. Suitable for real-time notifications, not for scenarios requiring guaranteed delivery.

Single-Entry Traffic Hotspot

A Hot Key is when a large number of requests concentrate on the same Redis key. For example, a celebrity posting causes millions of people to simultaneously read the same post’s cache key, overloading the Redis node where that key resides.

Three mitigations:

Local Cache: Store an in-memory cache on each application service instance, reducing the number of requests hitting Redis. The downside is that each instance’s cache may be inconsistent, and the total memory consumed across all instances is much larger.

Replica: Copy the Hot Key to multiple Redis nodes, e.g., post:viral:123:replica1, post:viral:123:replica2. When reading, randomly pick a replica to distribute load. The downside is needing extra management logic to maintain replica consistency.

Key Sharding: Add a random suffix to the key post:viral:123:{0~9}, writing to all suffixes simultaneously and reading from a random one. This essentially spreads the Hot Key across multiple physical keys, leveraging Redis Cluster’s different slots for load distribution. This is the most fundamental strategy because it directly distributes traffic to different nodes, not just reducing request count.

Standalone, Sentinel, and Cluster Topologies

Mode	Characteristics	Suitable For
Standalone	Single node, simple	Development environment
Sentinel	Primary-replica replication + automatic failover	Medium systems, HA requirements
Cluster	Automatic sharding, multiple primary nodes	Large systems, horizontal scaling

Traffic Shaping and Throttling

Fixed Window, Sliding Window, Token Bucket, Leaky Bucket

Fixed Window: Divide time into fixed windows (e.g., per minute), each window has a counter. When a request arrives, the counter increments. Requests exceeding the limit are rejected.

1
2
3
4
5
6


Window: 00:00 - 01:00
Counter: 95 requests
Limit: 100

request at 00:59 -> counter = 96 -> allowed
request at 01:00 -> new window, counter = 0 -> allowed

Problem: At window boundaries, you can instantly hit double the traffic — 100 requests at 00:59 + 100 requests at 01:00 = 200 requests in 2 seconds.

Sliding Window: Counts “requests in the past N seconds” rather than “requests in this window.” Solves Fixed Window’s boundary issue. The tradeoff is needing to record each request’s timestamp, resulting in higher storage cost.

Token Bucket: The bucket generates tokens at a fixed rate (e.g., 10 per second), with a capacity limit. Each request consumes one token. When the bucket is empty, requests are rejected. Allows brief traffic bursts as long as tokens are available. Suitable for APIs allowing some flexibility.

1
2
3
4
5
6
7


Bucket capacity: 100 tokens
Refill rate: 10 tokens/second

Burst scenario:
  - bucket is full (100 tokens)
  - 80 requests arrive at once -> allowed (80 tokens consumed)
  - next second: 10 tokens refilled

Leaky Bucket: Requests enter a fixed-capacity queue (bucket), flowing out at a fixed rate for processing. When the bucket is full, new requests are rejected. Output rate is perfectly smooth with no burst. Suitable for scenarios requiring strictly controlled traffic (e.g., paid APIs).

Algorithm	Burst Support	Implementation Complexity	Suitable For
Fixed Window	Yes (boundary)	Lowest	Simple APIs
Sliding Window	No	Medium	Precise rate limiting
Token Bucket	Yes (controllable)	Medium	General APIs
Leaky Bucket	No	Medium	Strictly smooth output

Cross-Instance Counting via Shared Store

Multiple API Gateway instances each counting independently means users can bypass rate limits — hitting three instances with 100 requests each is actually 300 requests.

Fix: All instances share Redis for counting.

1
2
3


Gateway A --\
Gateway B ----> Redis (shared counter) -> enforce global limit
Gateway C --/

1
2


INCR rate:user:123:minute:1744840800   # key = user + current minute timestamp
EXPIRE rate:user:123:minute:1744840800 60

INCR is an atomic operation — Redis guarantees only one operation can modify a key at a time. There’s no problem of two instances reading the same number and each incrementing it.

Cascading Failure Prevention

Upstream Drag-Down from Slow Dependents

Service A calls Service B. B starts responding slowly (e.g., database overloaded). A’s requests pile up waiting for B’s response, A’s threads are all stuck, and A also starts failing to handle new requests. This is Cascading Failure — one service’s problem spreads upstream.

Circuit Breaker’s design: when downstream failure rate exceeds a threshold, short-circuit directly (stop attempting to call B), returning an error or fallback immediately. This keeps A available and prevents it from being dragged down by B.

Closed, Open, and Half-Open Cycle

1
2
3
4


Closed -> (error rate > threshold) -> Open
Open   -> (after timeout)          -> Half-Open
Half-Open -> (test request fails)  -> Open
Half-Open -> (test request ok)     -> Closed

Closed (normal): All requests are sent normally. Failure rate is tracked. When it exceeds the threshold, switch to Open.

Open (tripped): No longer attempts to call downstream. Returns error or fallback result directly. Gives downstream time to recover.

Half-Open (probing): After waiting a period, allows a small number of test requests through. If successful, switch back to Closed. If still failing, return to Open and continue waiting.

Fallback can be: returning cached stale data, returning default values, or degrading functionality (e.g., e-commerce showing popular products instead of personalized recommendations).

Read-Path Capacity Growth

The standard path for read traffic scaling: Index → Denormalization → Read Replica → Sharding → Caching Layer → CDN.

Keeping Cached Data Fresh

Cache’s core question: after data is updated, when does the stale data in cache expire?

TTL: Simplest. Set an expiration time, auto-expires when time is up. Downside is between data update and TTL expiry, users see stale data.

Write-through: Every database write synchronously updates the cache. Highest data consistency, but increases write latency.

Write-behind (Write-back): Write cache first, asynchronously batch-write to the database. Fast writes, but data may be lost if cache crashes.

Tagged Invalidation: Tag cache keys. When updating data, invalidate all cache keys under that tag. For example, “all product caches with category:electronics.”

Versioned Keys: Add a version number to the key, e.g., product:123:v5. When updating data, increment the version number. The old key naturally expires (nobody reads it, waits for TTL auto-cleanup). Benefit is no need to actively delete cache — old keys continue to exist but won’t be read. Downside is needing a place to store “what the current latest version number is.”

TTL and Write-through are commonly used together — Write-through ensures data updates are immediately reflected, TTL acts as a safety net.

Geographic Periphery Replication

CDN (Content Delivery Network) stores copies of static resources at multiple nodes (edges) worldwide. Users read from the nearest edge node instead of hitting the origin server every time.

Dynamic content can also be cached at CDN — e.g., API responses cached at CDN edge for 30 seconds. Be mindful of cache invalidation: after data updates, proactively purge CDN cache.

Herd Behavior on TTL Expiration

At the moment cache expires, a large number of requests simultaneously penetrate to the database, causing instant database overload. This is called Cache Stampede.

Three mitigations:

Request Coalescing: The first request hitting a cache miss queries the database, other requests wait. When the database returns, all waiting requests receive the result together. The challenge is needing a coordination mechanism to determine “who is first” — typically implemented with Distributed Lock. The one acquiring the lock queries the DB, others read cache after the lock is released.

Stale-While-Revalidate: First return the expired stale data to users (don’t make users wait), while asynchronously updating the cache. Users experience no latency, but there’s a brief data inconsistency window. Suitable for scenarios where timeliness isn’t critical.

Probabilistic Early Refresh: Before cache expires, use probability to decide whether to refresh early. The closer to expiration time, the higher the refresh probability. This way caches don’t all expire at the same time, distributing traffic across different time points. Suitable for high-traffic systems, doesn’t need Distributed Lock, relatively simple to implement.

In practice, Request Coalescing and Stale-While-Revalidate are often used together, with Probabilistic Early Refresh as a more advanced approach.

Write-Path Capacity Growth

Write traffic scaling is more complex than read traffic because writes must guarantee consistency.

Choosing a Write-Optimized Storage Engine

Choosing a database is the first step in write scaling. Different DBs have vastly different support for high writes.

Database	Suitable For	Write Characteristics
PostgreSQL	General transactional systems	ACID, strongly consistent
Cassandra	High writes, time series	Write-optimized, eventual consistency
InfluxDB	IoT, metrics	Time-series data, high compression
Column Store	Analytical queries	Batch writes, not suitable for real-time

Partitioning Rows Across Multiple Instances

When data volume is too large, horizontally split data across multiple database instances.

Partition Key Selection: Determines which shard data falls on. Poor selection causes hot shard — large amounts of data concentrates on one shard.

1
2
3
4
5
6


user_id % 4 -> shard 0, 1, 2, 3

user:1 -> shard 1
user:2 -> shard 2
user:3 -> shard 3
user:4 -> shard 0

Problem: cross-shard queries are complex, JOINs cannot span shards. Aggregation must be done at the application layer.

Splitting Tables Into Separate Databases

Split different tables or columns into different databases.

1
2
3
4
5


Before: one DB with all tables
After:
  Users DB    -> user_id, name, email
  Orders DB   -> order_id, user_id, amount
  Products DB -> product_id, name, price

Difference from Foreign Keys: Vertical Partitioning physically places tables in different DB instances, not just logical associations. Cross-DB JOINs must be handled at the application layer.

Asynchronous Write Buffering

During peak traffic, put write requests into a Queue first, process asynchronously.

1

User request -> API -> Queue -> Consumer -> DB

Benefits: API responds immediately (without waiting for DB write), traffic spikes are absorbed by the Queue. Downside: data isn’t written immediately, users may briefly not see the results of their own actions.

Priority-Based Request Dropping Under Pressure

When the system is overloaded, proactively reject low-priority requests to protect core functionality.

1
2
3
4


Traffic spike:
  - Priority 1 (payment): always process
  - Priority 2 (read profile): process if capacity > 60%
  - Priority 3 (recommendation): drop if capacity < 40%

Difference from Rate Limiting: Rate Limiting limits per-user traffic, Load Shedding is system-level priority decision-making.

Multi-Tier Count Roll-Up

For high-frequency write count data (e.g., likes, play counts), if every event is written directly to the central database, the root will overload.

1
2
3
4
5
6
7
8
9


Without aggregation:
  Event 1: likes++ -> Central DB
  Event 2: likes++ -> Central DB
  ... (1M events/sec -> Central DB explodes)

With Hierarchical Aggregation:
  Level 1 (edge nodes): buffer locally, aggregate every 1s
  Level 2 (regional nodes): aggregate from L1 every 10s
  Level 3 (central): receive pre-aggregated data every 60s

L1 node aggregates once per second. What’s sent to L2 isn’t 1000 individual events but “1000 likes in this 1 second.” L2 then aggregates 10 seconds of data and sends to central. Central receives traffic reduced by orders of magnitude.

Note: what root receives is aggregated counts, not each individual event. If the business requirement is “know how many times each user individually liked,” more granular data must be retained at the edge layer. Hierarchical Aggregation is suitable for “total count statistics,” not “detail tracking.”