Design Distributed Cache — System Design Interview Guide
Design Distributed Cache is a system-design interview that asks you to build a distributed in-memory cache that fronts a slower backing store: clients fetch and write keys with sub-millisecond latency, the cache spans many nodes for capacity, and node failures are handled without losing user-visible availability. The hard part is consistent hashing, replication, and eviction at scale.
By Alex Chen, Founder, InterviewChamp.AI · Last verified
Reported in interviews at
- Meta
- Amazon
- Twitter/X
Sourced from Glassdoor, Levels.fyi, and Blind interview reports.
Functional requirements
- GET and SET operations on keys with arbitrary byte values up to ~1 MB per value
- Configurable TTL per key; expired keys are not returned
- Atomic increment/decrement on numeric values
- Optional: compare-and-swap (CAS) for optimistic concurrency
- Cache invalidation API for the backing store to push updates
- Optional: pub/sub for change notifications across nodes
Non-functional requirements
- Latency: <1ms p99 for GET and SET
- Throughput: ~1M+ ops/sec per node, ~100M+ ops/sec per cluster
- Capacity: many TB of cached data across the cluster, ~64-256 GB RAM per node
- Availability: 99.99%; cache misses on failure are acceptable but never serve stale data after TTL
- Hit rate: target 90%+ for the working set under steady-state traffic
Capacity estimation
Scale anchors for a large web platform's cache tier: ~10 TB of working-set data, ~100M ops/sec at peak across the cluster. With ~128 GB per node (assuming ~75% used for cached values, ~25% for metadata and overhead), the cluster needs ~100 nodes for capacity alone.
For throughput: each node serves ~1M ops/sec, so the cluster supports ~100M ops/sec with ~100 nodes — fits within the capacity envelope.
Network: at 1M ops/sec/node × 1 KB average value, each node moves ~1 GB/sec of cache traffic = ~8 Gbps. Each node needs 10 Gbps NIC minimum.
Key-value sizes: typical applications have a power-law distribution. Most values are small (<1 KB — e.g. user profile, session data). Some are large (10-1000 KB — e.g. cached HTML fragments, search results). Beyond ~1 MB, the cache becomes inefficient (one large value can dominate a node's memory) — large values are stored in object storage with cache holding only their references.
Metadata overhead per entry: ~64 bytes (key hash, TTL timestamp, LRU pointers, refcount). At 100M entries per node × 64 bytes = 6.4 GB metadata.
High-level design
Three architectural pillars: client routing, server storage, and replication.
Client routing via consistent hashing: clients hash the key and route the request to one specific server in the cluster. Consistent hashing places servers on a ring of N positions (typically a 32-bit or 128-bit hash space); each key's position on the ring determines its owner (the next server clockwise). When a node is added or removed, only the keys 'between' the changed node and its predecessor are rebalanced — not all keys.
Virtual nodes: each physical server is represented by many virtual positions on the ring (typically 100-500 virtual nodes per server). This evens out the load distribution (without virtual nodes, the ring positions cluster unevenly and some servers carry 2-3x the load of others). Mention virtual nodes — interviewers consistently probe this.
Server storage: each node holds an in-memory hashmap of (key → value, ttl, last_access_time). Operations are O(1). For eviction, the node maintains a separate LRU structure (typically a doubly-linked list of access order) so the least-recently-used keys can be evicted when memory pressure hits. TTL expiration runs lazily: on each GET, check TTL; if expired, return MISS and free the slot. A background scanner periodically sweeps for expired keys to reclaim memory faster than lazy alone.
Replication: in production, each key is replicated to 2-3 nodes for availability. Replication topology is determined by consistent hashing: the key's primary owner is the first node clockwise; replicas are the next N-1 nodes. Writes go to the primary, then propagate asynchronously to replicas. Reads can go to the primary (consistent, no stale data) or to replicas (lower latency in some topologies, can serve stale data on writer-replica lag).
Cache-aside pattern: applications typically use the cache via the cache-aside pattern: on read, check cache first; on miss, fetch from the backing store, then SET the value in cache with a TTL. On write, update the backing store, then optionally invalidate or update the cache entry. The cache layer is dumb — it doesn't know about the backing store; the application owns the consistency contract.
Deep dive — the hard problem
Two deep dives: hot-key handling and consistency model.
Hot keys: in a cache cluster, the load distribution across nodes is dictated by consistent hashing — keys are uniformly distributed across nodes, assuming the request distribution itself is uniform. But real-world traffic is not uniform: a single celebrity profile, a viral post, or a system config key can attract 1000x the QPS of an average key. The owner node of a hot key saturates while other nodes idle.
Two production tactics handle hot keys.
Client-side caching: clients hold a small local cache (a few hundred entries with short TTL, e.g. 1 second) for very-hot keys. The owner node detects hot keys (sliding-window request counters) and tells clients to cache them locally. This pushes hot-key load to clients, where it's effectively free. The tradeoff is the staleness window (up to 1 second) — acceptable for many use cases.
Key sharding: split a hot key's value across multiple replicas (e.g. 'user:42:profile' is replicated to 'user:42:profile:shard:0' through 'user:42:profile:shard:7' on 8 different nodes). Clients randomly pick a shard. Reads are spread; writes must fan out to all 8. Used for read-mostly hot keys where the value is essentially immutable.
Mention both — the senior signal is recognizing that consistent hashing alone doesn't solve hot keys.
Consistency model: cache consistency is harder than it looks. Three common patterns.
Write-through: writes go to cache AND backing store synchronously. Cache is always consistent with backing store. Tradeoff: writes are slower (must round-trip both); the cache must succeed for the write to succeed (introduces a dependency).
Write-back: writes go to cache first; backing store is updated asynchronously. Faster writes but data can be lost on cache failure between write and async flush. Used in some performance-critical paths but the durability tradeoff is significant.
Cache-aside (the standard): writes go to backing store; cache entry is either invalidated (delete on cache) or updated (refresh on cache). On invalidate, the next read repopulates from backing store. Most common pattern; tradeoff is a brief stale-cache window if invalidation is asynchronous, plus a thundering-herd risk on hot-key invalidation.
Thundering herd: when a hot key is invalidated, all client reads miss simultaneously and stampede the backing store. Defenses: per-key locks (only the first client fetches from backing store; others wait), probabilistic early refresh (clients refresh slightly before TTL expires so the refresh is staggered), or read-through caching (the cache fetches from backing store itself, naturally serializing).
Third tradeoff: node failure handling. When a node dies, its share of keys is no longer served from that node. Two options. (A) Wait for replicas to be promoted (low latency, requires N replicas everywhere, more memory cost). (B) Serve MISS for the missing keys and let the backing store repopulate via cache-aside (lower memory cost, more backing-store load during the failure window). Production typically does (B) with replicas for the hottest keys only.
Fourth: rebalancing during cluster changes. Adding a node moves a fraction of keys to it; consistent hashing limits the disruption to (1/N) of the keyspace. During the rebalance, the new node must populate its share — either by 'cold-start' (start empty, let cache-aside repopulate from backing store) or by 'warm-start' (replicate from the previous owner before serving). Cold-start increases backing-store load temporarily; warm-start makes the rebalance smoother but more complex. Mention the tradeoff.
Common mistakes
- Skipping consistent hashing and proposing 'each server holds 1/N of keys by simple modulo' — adding or removing a node rebalances all keys, melting the system
- Forgetting virtual nodes in consistent hashing — without them, load distribution is uneven across physical servers
- Treating cache as durable storage — cache nodes lose data on restart, the backing store must be the source of truth
- Ignoring hot keys — consistent hashing doesn't solve uneven access patterns
- Designing cache invalidation as best-effort fire-and-forget — without an explicit consistency contract, stale-cache bugs are inevitable
Likely follow-up questions
- How would your design handle a network partition where half the cluster can't talk to the other half?
- What changes if you have to support strong consistency (no stale reads ever)?
- How would you implement a global cache spanning multiple data centers with replication?
- How would you handle a cluster-wide cache flush (e.g. for a security incident requiring all stale credentials to be invalidated)?
- How would you support both random-access keys and range-scan operations (e.g. find all keys starting with 'user:42:')?
Practice Design Distributed Cache live with an AI interviewer
Free, no sign-up required. Get real-time feedback on your design.
Practice these liveFrequently asked questions
- How long is the Design Distributed Cache interview?
- 45-60 minutes typical. Often used as a focused infrastructure question after a higher-level system design. The expectation is consistent hashing, eviction, and at least one of (hot keys, consistency model, node failure).
- Do I need to know the consistent-hashing algorithm details?
- Yes — name 'consistent hashing with virtual nodes' and explain the ring intuition. The senior signal hinges on whether the candidate can explain WHY virtual nodes are needed, not just that they exist.
- Should I mention specific in-memory cache products?
- Saying 'an in-memory cache like the standard production options' or 'a distributed in-memory key-value cache' is fine. Listing product names without using their distinguishing properties is empty signal.
- What is the most important concept for Design Distributed Cache?
- Consistent hashing plus the hot-key problem. The senior signal hinges on (a) explaining consistent hashing with virtual nodes correctly and (b) proposing a tactic for hot keys that consistent hashing alone doesn't solve.