Why is erasure coding cheaper than replication if both achieve the same durability?

Replication stores the entire object N times (3-way replication = 200% overhead). EC stores N data chunks + K parity chunks (e.g., 10+4 = 40% overhead) and reconstructs from any N. The math works because the probability of losing more than K chunks in an EC group is astronomically small at modern drive failure rates with a sane repair window. EC trades a bit of compute (parity generation, reconstruction) for ~4-5x less storage cost.

Do I need to know specific erasure-coding algorithms?

Naming Reed-Solomon and explaining the (N, K) parameter is enough. Going deeper into Cauchy matrices or LDPC codes is beyond the senior bar; what matters is the durability math and the failure-domain placement discipline.

Is this the same as Design Dropbox or Design Google Drive?

Different layer. Dropbox/Google Drive are user-facing file-sync products; they're built on top of a blob store. This question is about the blob store itself — the durability primitive that file-sync, image hosting, video streaming, log archival, and database backups all sit on top of.

What is the most important concept for Design Distributed Blob Store?

Erasure coding plus failure-domain placement. The senior signal hinges on (a) recognizing that 3-way replication is too expensive at exabyte scale and proposing EC, and (b) understanding that EC's durability guarantee depends on placement across independent failure domains.

System Design Questions

Design Distributed Blob Store — System Design Interview Guide

Design Distributed Blob Store is a system-design interview that asks you to build an object storage service at internet scale: trillions of objects, exabytes of total capacity, 11-nines of durability, and high-throughput read and write. The hard part is the durability math (erasure coding vs replication tradeoff), efficient placement across racks and regions, and metadata indexing that doesn't become the bottleneck.

By Sam K., Founder, InterviewChamp.AI · Last verified 2026-05-25

Reported in interviews at

Amazon
Google
Microsoft
Meta
Apple

Sourced from Glassdoor, Levels.fyi, and Blind interview reports.

Functional requirements

PUT an object: bytes of arbitrary size (1 byte to ~5 TB), receive an immutable object key
GET an object: stream bytes back, with optional byte-range support
DELETE an object: tombstone immediately, background GC reclaims space
List objects in a bucket with prefix-based filtering and pagination
Object versioning: each PUT creates a new version, history preserved
Optional: server-side encryption, lifecycle policies (transition to colder tier, expiration)

Non-functional requirements

Durability: 11 nines (99.999999999%) — annual loss probability of less than 10^-11 per object
Availability: 99.99% for reads; 99.9% for writes (writes are more sensitive to placement)
Latency: GET first-byte <100ms p99 for warm objects; PUT acknowledged <200ms p99 for small objects
Throughput: ~1M PUTs/sec, ~10M GETs/sec at peak across the service
Capacity: ~exabytes of total stored data, ~trillions of objects
Read/write ratio: typically 10:1 reads to writes for general-purpose workloads

Capacity estimation

Scale anchors: ~1 trillion objects, ~10 exabytes total capacity. Average object size = 10 EB / 1T = ~10 MB (long tail of small objects + heavy tail of large objects). Median object is much smaller (~10 KB); a small fraction of multi-GB objects pull up the average.

Hardware: ~16 TB HDDs are the workhorse for capacity-tier storage. 10 EB / 16 TB = ~625K drives raw. With erasure-coded overhead (~1.4x for 10+4 EC), physical capacity needed is ~14 EB = ~875K drives. Spread across ~50K storage nodes (16-24 drives per node) = ~1500 nodes per zone × ~30 zones globally. This is in the ballpark of public cloud blob stores.

Throughput: 1M PUTs/sec × 10 MB average = ~10 TB/sec write ingest. Internally, with EC overhead and replicas across zones, the wire-level traffic is ~3-5x the user-visible write rate = ~30-50 TB/sec. Network is the constraint at this scale; storage nodes need 25-100 Gbps NICs and the backbone needs to handle the cross-zone replication fan-out.

Metadata: 1T objects × ~1 KB metadata per object (key, size, version, ACL, content-type, EC layout, storage location pointers, checksums) = ~1 PB of metadata. That metadata store is itself a distributed system at PB scale. Indexing requirements: lookup by object key (PUT, GET, DELETE), list by bucket + prefix (LIST operation). The list operation is more demanding because it needs sorted iteration.

Durability math: with 3-way replication across independent failure domains (e.g., 3 zones), the probability of all 3 copies failing during the repair window is the cube of the per-copy failure probability over that window. With ~1 hour annual cumulative repair-window risk per copy at modern drive AFR, 3-way replication delivers ~10-11 nines. EC (10+4) across the same failure domains delivers similar durability at ~40% the storage cost.

High-level design

Four layers: API/coordinator tier, metadata store, data tier, and durability layer.

API/coordinator tier: stateless front-end services that terminate client connections. Receives PUTs and GETs, authenticates the request, looks up metadata, orchestrates the actual data movement. The coordinator is responsible for the full lifecycle of a PUT — placement decision, parallel writes to the data tier, durability confirmation, metadata update.

Metadata store: a strongly-consistent distributed key-value store keyed by (bucket, object_key). Stores per-object metadata: size, version, content hash, encryption key reference, storage location pointers (which storage nodes/EC groups hold the data), ACLs, lifecycle tags. The metadata store is sharded by hash of bucket name (or by bucket-key range for LIST efficiency).

LIST operation requires sorted prefix iteration. The metadata store provides a separate range-indexed view, often via a secondary index sorted by (bucket, key) where each shard owns a key range. LIST queries hit the relevant range-shards and stream results.

Data tier: a fleet of storage nodes, each with many local drives. Each drive holds chunks of object data. Storage nodes are dumb: they receive 'write this chunk at this offset, return success when persisted' and 'read this chunk, return bytes'. No object-level intelligence at the storage node; intelligence lives in the coordinator.

Durability layer: enforces the replication/EC policy. For small objects (e.g., <100 KB), 3-way replication across 3 zones is simpler and cheaper (the metadata overhead of EC isn't worth it for tiny objects). For larger objects, erasure coding: split the object into N data chunks, compute K parity chunks (Reed-Solomon), distribute the N+K chunks across N+K failure domains. Loss of any K-or-fewer chunks is recoverable.

PUT flow: client streams bytes to a coordinator. Coordinator selects N+K storage nodes (one per failure domain) using a placement algorithm. As bytes arrive, coordinator computes EC chunks on the fly and forwards each chunk to its target node. When all N+K nodes ack, the coordinator commits the metadata entry (which holds the location pointers) and acks the client.

GET flow: client requests object_key. Coordinator looks up metadata, gets the storage location pointers, fetches N chunks (any N of the N+K) in parallel, reconstructs the object, streams bytes back. For low-latency reads, coordinators issue parallel reads to N+M nodes (M=2-3 extras) and use whichever N respond first — hedging against slow nodes.

Deep dive — the hard problem

Two deep dives: erasure coding economics, and cross-region replication for disaster recovery.

Erasure coding economics. Reed-Solomon EC with parameters (N, K) means N data chunks + K parity chunks = N+K total chunks. Any N of the N+K is sufficient to reconstruct. Common configurations: 6+3 (50% overhead), 10+4 (40% overhead), 12+4 (33% overhead). Compare to 3-way replication (200% overhead) — EC delivers similar durability at ~1/4 to 1/5 the storage cost.

The N+K choice is a multi-dimensional tradeoff. Larger N (more data chunks) = lower overhead and better economics, but every write must successfully fan out to more storage nodes, and every read must fetch from more nodes. Tail latency rises with N (the probability that at least one of N reads is slow grows with N). Larger K (more parity chunks) = higher durability and tolerance for more concurrent failures, but more compute cost per write.

Production tradeoffs. For warm tiers (frequently accessed), use smaller N (e.g., 6+3) — better tail latency on reads, slightly higher storage cost. For cold tiers (rarely accessed), use larger N (e.g., 12+4) — best storage economics, accept that reads are slower and tail-latency-bounded.

Durability math with EC: assuming uniform per-drive AFR (annualized failure rate) of ~2% and repair window of a few hours per failure, the probability of losing more than K chunks in an (N+K) group during a single repair window is astronomically small — millions of times smaller than 3-way replication's loss probability. This is why hyperscalers all moved from replication to EC for capacity tiers.

Failure-domain discipline is critical. The N+K chunks MUST be placed across N+K independent failure domains. Two chunks of a (10+4) group on the same rack means a single rack failure can lose 2 chunks, leaving you with 12 of 14 — still recoverable, but the effective failure-tolerance is degraded. Production placement places each chunk in a distinct rack within a zone, and the N+K chunks span enough racks that any single power-domain or top-of-rack-switch failure costs at most 1 chunk.

Cross-region replication for disaster recovery. Single-zone durability via EC handles drive and rack failures. But a regional disaster (a hurricane takes out an entire region, a fiber cut isolates a region) can lose the whole zone's data. For workloads where regional loss is unacceptable, replicate across regions.

Two patterns.

Sync cross-region replication: every PUT acks only after the data is durably stored in at least 2 regions. Pro: zero data loss on a region loss. Con: PUT latency now includes a cross-region round-trip, typically 50-150ms. Most general-purpose blob stores don't do sync cross-region by default; it's a per-bucket opt-in for workloads that demand it.

Async cross-region replication: PUTs ack as soon as durable in the local region. A background replicator copies new objects to the secondary region on an ongoing basis, typically with a lag of seconds to minutes. Pro: no PUT latency penalty. Con: a region loss during the replication lag window loses any objects not yet copied. Used as the default — the lag is bounded and the data-loss risk is bounded.

Third tradeoff: small-object overhead. EC has a per-group overhead (metadata, EC headers). For an object that's 1 KB, the overhead is a significant fraction of the object size. Production blob stores handle this by batching: many small objects are packed into a single EC group ('chunked' or 'aggregated' write), with the metadata layer tracking which sub-range of the group holds which object. Reading a small object becomes a sub-range read from the group.

Fourth: GC and tombstones. DELETE marks the object as tombstoned in metadata but doesn't immediately free the underlying chunks (other versions may share chunks via dedup; ongoing reads may hold references). A background GC scans for tombstoned objects past their grace period and reclaims the storage. GC must be careful: a half-completed GC that frees a chunk still referenced by another object loses data. The discipline is reference-counting on chunks plus a grace period (typically days) before any reclaim.

Fifth: integrity. Every chunk has a content hash stored alongside it. Reads verify the hash on the fly; mismatches trigger a repair from another chunk in the EC group and an alert. Background scrubbers periodically read every chunk to catch silent corruption (bit-rot) before it accumulates enough damage to cross the recovery threshold.

Common mistakes

Choosing 3-way replication everywhere — at exabyte scale, the storage cost is 5x worse than EC for the same durability; EC is the production default
Ignoring failure domains in placement — putting multiple chunks of the same EC group on the same rack defeats the EC durability guarantee
Forgetting small-object overhead — naive EC on 1KB objects spends more on metadata than on data; production batches small objects into shared EC groups
Treating LIST as cheap — listing trillions of objects under a prefix requires range-sorted metadata indexing; the simple hash-sharded metadata store can't do this efficiently
Skipping background scrubbing — bit-rot accumulates silently and can exhaust EC recovery capacity if you only repair on read

Likely follow-up questions

How would you support multipart upload for objects larger than the network can transmit in a single request?
How would you implement strong read-after-write consistency across regions (currently most blob stores offer it within a region only)?
What changes if you needed to support sub-millisecond GET latency for hot objects — does the architecture still hold?
How would you implement deduplication so two PUTs with identical content share underlying storage?
How would you build a 'cold-storage' tier with much cheaper hardware but much higher retrieval latency, and migrate objects to it based on access patterns?

Related system design scenarios

Frequently asked questions

Why is erasure coding cheaper than replication if both achieve the same durability?: Replication stores the entire object N times (3-way replication = 200% overhead). EC stores N data chunks + K parity chunks (e.g., 10+4 = 40% overhead) and reconstructs from any N. The math works because the probability of losing more than K chunks in an EC group is astronomically small at modern drive failure rates with a sane repair window. EC trades a bit of compute (parity generation, reconstruction) for ~4-5x less storage cost.
Do I need to know specific erasure-coding algorithms?: Naming Reed-Solomon and explaining the (N, K) parameter is enough. Going deeper into Cauchy matrices or LDPC codes is beyond the senior bar; what matters is the durability math and the failure-domain placement discipline.
Is this the same as Design Dropbox or Design Google Drive?: Different layer. Dropbox/Google Drive are user-facing file-sync products; they're built on top of a blob store. This question is about the blob store itself — the durability primitive that file-sync, image hosting, video streaming, log archival, and database backups all sit on top of.
What is the most important concept for Design Distributed Blob Store?: Erasure coding plus failure-domain placement. The senior signal hinges on (a) recognizing that 3-way replication is too expensive at exabyte scale and proposing EC, and (b) understanding that EC's durability guarantee depends on placement across independent failure domains.