Skip to main content

System Design Questions

Design Distributed Key-Value Store — System Design Interview Guide

Design Distributed Key-Value Store is a system-design interview that asks you to build a durable, horizontally-scalable key-value store: PUT and GET operations across petabytes of data, sub-10ms latency, no central coordinator, and survival of any single node failure with no data loss. The hard part is sharding, replication, and the consistency contract.

By Alex Chen, Founder, InterviewChamp.AI · Last verified

Reported in interviews at

  • Amazon
  • Google
  • Meta
  • Apple
  • LinkedIn

Sourced from Glassdoor, Levels.fyi, and Blind interview reports.

Functional requirements

  • PUT and GET operations on keys with arbitrary value payloads up to a few MB
  • Atomic conditional writes (compare-and-swap on a version or value)
  • Range scan over keys with a prefix (optional, often required for indexes)
  • Durability: writes are not lost once acknowledged
  • Configurable replication factor (typically 3) and configurable consistency level per operation
  • Multi-region replication for global availability (optional, advanced)

Non-functional requirements

  • Latency: <10ms p99 for GET, <20ms p99 for PUT (single-region, quorum read/write)
  • Throughput: ~10K+ ops/sec per partition, scaling linearly with partition count
  • Capacity: petabytes of data across the cluster, ~100 TB+ per node common at scale
  • Availability: 99.99% (single region); higher with multi-region replication
  • Durability: no acknowledged write is ever lost; 11 nines of durability is industry-standard

Capacity estimation

Scale anchors for a large production key-value store: petabytes of data, billions of keys, millions of ops/sec.

A single partition typically holds 10-100 GB of data and serves 5K-50K ops/sec. For a 100 TB dataset with 50 GB partitions, the cluster has ~2,000 partitions. With replication factor 3, that's 6,000 partition replicas. Distributed across ~100-500 nodes, each node holds ~12-60 partition replicas.

Network: at 1M ops/sec cluster-wide × 1 KB average value, the cluster moves ~1 GB/sec = ~8 Gbps. Per node, ~10-100 Mbps depending on partition distribution. Modest by today's standards.

Write amplification: each PUT writes to the partition's primary plus 2 replicas, so the cross-node write rate is 3x the client write rate. Disk write amplification (from compaction in log-structured stores) adds another 5-30x — disk throughput is often the bottleneck.

Key distribution: production key-value stores assume keys are roughly uniformly distributed via the partitioning hash. Hot keys are a degenerate case requiring per-key remediation (see deep dive).

Versioning metadata: each value carries a version vector (or sequence number, or hybrid logical clock) — typically 16-32 bytes per value. At billions of keys this is negligible.

High-level design

Three architectural layers: partitioning, replication, and the read/write path.

Partitioning: keys are mapped to partitions via consistent hashing or a hash-range scheme. Consistent hashing places each partition on a position on a hash ring; each key's hash determines its partition. The advantage is incremental rebalancing — adding a partition only moves 1/N of keys. Hash-range partitioning splits the keyspace into ordered ranges (key K is in partition P if start_hash(P) ≤ hash(K) < end_hash(P)), which makes range scans cheaper but rebalancing harder. Production systems pick one.

Replication: each partition has N replicas (typically N=3) placed on different physical nodes, ideally different racks or availability zones. The replicas form a replication group; one is the leader for writes (leader-based replication) or any can accept writes (leaderless replication). The replication group reaches consensus on each write via a protocol like Paxos or Raft (leader-based) or via quorum reads/writes (leaderless).

Leader-based replication (Raft-style): one replica is leader, accepts all writes, replicates them to followers via an ordered log. Followers apply log entries in order. Failover: if the leader fails, followers elect a new leader from among themselves; the new leader takes over. Simple to reason about; well-understood.

Leaderless replication (Dynamo-style): all replicas accept writes. Each write goes to W of N replicas; each read goes to R of N replicas. With R + W > N, every read sees the most recent write (under no network partition). Tradeoff: leaderless tolerates more concurrent failures (no leader to lose), but requires conflict resolution when writes happen simultaneously (often via version vectors or last-write-wins).

Read/write path: a client request enters via any coordinator node (any node can serve as coordinator). The coordinator looks up the partition for the key (via a partition map cached locally), forwards the request to the partition's replicas, gathers responses, and replies to the client.

For leader-based: writes go to the leader; reads can go to the leader (consistent) or any follower (potentially stale).

For leaderless: writes go to W replicas in parallel; reads go to R replicas in parallel and the freshest value wins.

Finally: a partition map service tracks which partitions live on which nodes. Every node has a cached view of the map; on changes (node added, partition migrated), the map is gossiped or pushed to all nodes. The map is the membership contract for the cluster.

Deep dive — the hard problem

Two deep dives: consistency choices, and handling hot keys + node failure.

Consistency choices — the central design lever.

Strong consistency: every read returns the most recent write. Achieved with quorum reads + quorum writes (R + W > N), or with leader-based replication where reads go to the leader. Tradeoff: every write requires acknowledging from a quorum (waiting for the slowest of W replicas), so write latency is gated by the slowest replica. Doesn't tolerate split-brain — a network partition can halt writes on one side.

Eventual consistency: writes propagate asynchronously to all replicas; reads may see stale data. Achieved with R=1, W=1 or W=async. Tradeoff: high availability and low latency (writes return as soon as one replica acknowledges) but reads can see stale data for a brief window.

Quorum (tunable): per operation, choose R and W such that R + W > N for strong, or R + W ≤ N for eventual. Production systems often default to R=N/2+1 and W=N/2+1 — the 'simple majority' quorum, with N=3 meaning W=R=2, surviving one replica failure.

Last-write-wins vs version vectors: in eventual consistency, two concurrent writes can produce conflict. Last-write-wins (LWW): each value has a timestamp; the later write wins. Simple but loses data if clocks are skewed. Version vectors: each replica maintains a vector of (replica_id → sequence_number); concurrent writes produce conflicting versions that the client resolves. More accurate; more complex.

Mention the CAP theorem: under network partition, you must choose between consistency and availability. Production systems make this choice explicitly per operation (e.g., 'this read must be strongly consistent, this read can be stale').

Hot keys: when a single key gets 1000x the load of an average key, its partition saturates. Two production tactics.

Splitting hot keys: the partition map can be augmented with a 'hot key shard count' — for a known-hot key, the key's value is split across multiple sub-partitions, and operations spread across them. Reads scatter-gather. Writes split. Works for keys whose value can be partitioned (e.g., a counter sharded into multiple sub-counters that sum on read).

Client-side caching: the coordinator detects hot keys and tells clients to cache them locally with short TTL. Reduces backend load by 100x for the hottest keys.

Node failure: when a replica fails, its partitions lose one replica. The cluster must re-replicate to a fresh node to restore the replication factor. Two phases: (a) detect failure — typically via gossip-based failure detection with a tunable timeout; and (b) bootstrap a fresh replica — copy the partition from a surviving replica to the new node, while serving reads from the surviving replicas. The bootstrap can take hours for large partitions; during this window, the partition has reduced redundancy.

Graceful node operations: planned node removal first marks the node as 'leaving', triggers re-replication of its partitions to other nodes, then removes the node only after redundancy is restored. This avoids the reduced-redundancy window.

Third tradeoff: the storage engine on each node. Two common designs: B-tree (range queries are cheap, point reads are cheap, writes amplify into random I/O) and log-structured merge tree (LSM — writes are append-only and very fast, reads may touch multiple sorted runs and amplify, requires periodic compaction). Most modern distributed key-value stores use LSM-style engines (RocksDB-like) because writes dominate, and reads are accelerated by bloom filters and a block cache. Mention this if asked; not load-bearing in 45 minutes.

Common mistakes

  • Defaulting to a single consistency model without acknowledging the tradeoffs — production stores expose tunable consistency per operation
  • Skipping the partition map and assuming every node knows the cluster layout — without a propagated map, partition lookups fail during cluster changes
  • Confusing leader-based replication with consensus on every write — leader-based ordering is consensus on writes within a partition, not across partitions
  • Forgetting the CAP tradeoff under partition — a candidate who claims both strong consistency AND high availability under partition is wrong by definition
  • Treating hot keys as outside the design — they're a common production case and require per-key remediation

Likely follow-up questions

  • How would your design support multi-region replication for global low-latency reads while preserving strong consistency for writes?
  • What changes if you need to support transactions across multiple keys in different partitions?
  • How would you handle a network partition where two halves of the cluster can't communicate?
  • How would you support a secondary index (e.g. find all keys where user_id = 42)?
  • How would your design support adding a node to scale capacity by 50% without serving any errors during the rebalance?

Practice Design Distributed Key-Value Store live with an AI interviewer

Free, no sign-up required. Get real-time feedback on your design.

Practice these live

Frequently asked questions

How long is the Design Distributed Key-Value Store interview?
60 minutes typical. Senior+ rounds expect coverage of partitioning, replication, the consistency model, and at least one of (hot keys, failure handling, multi-region).
Do I need to know Paxos and Raft in detail?
Naming them and explaining the high-level idea (leader-based replicated log with consensus) is enough. Drawing the actual algorithm is overkill for a 60-minute interview.
Should I mention CAP theorem?
Yes — frame the consistency-vs-availability tradeoff explicitly. CAP under partition is the foundational result every distributed-systems candidate should be able to reason about.
What is the single most important concept for Design Distributed Key-Value Store?
Tunable consistency through quorum (R + W > N for strong, vs eventual otherwise). The senior signal hinges on whether the candidate can explain when each is appropriate and what the latency/availability costs are.