Skip to main content

System Design Questions

Design Distributed Log — System Design Interview Guide

Design Distributed Log is a system-design interview that asks you to build a horizontally-scalable, append-only log service: producers append events to named topics, consumers read events in order from any offset, and the system replicates each event across multiple nodes for durability. The hard part is sharding a topic across many nodes while preserving per-partition order and surviving any single broker failure without data loss.

By Alex Chen, Founder, InterviewChamp.AI · Last verified

Reported in interviews at

  • Amazon
  • Google
  • Meta
  • LinkedIn
  • Netflix

Sourced from Glassdoor, Levels.fyi, and Blind interview reports.

Functional requirements

  • Append events to a named topic; each event gets a monotonic offset within its partition
  • Read events from any offset in a partition, in order, either as a one-shot batch or a long-lived stream
  • Partition a topic across N brokers; producers choose the partition by key hash or round-robin
  • Multiple consumer groups can read the same topic independently, each tracking its own offset
  • Configurable retention: retain events for N days or until total size exceeds M bytes, then prune
  • At-least-once delivery from broker to consumer; exactly-once is opt-in with idempotent producers

Non-functional requirements

  • Throughput: ~1M events/sec write across the cluster, ~10M events/sec aggregate read (10:1 fanout typical)
  • Latency: <10ms p99 producer ack on append; <50ms p99 end-to-end for a consumer waiting on a new event
  • Durability: replication factor 3; tolerate any 1 broker failure with zero data loss; survive 2 failures with the cluster remaining writable for ack=1
  • Availability: 99.99%; producers must never block on broker election for more than a few seconds
  • Ordering: strict per-partition ordering; no global ordering guarantee across partitions
  • Storage: petabyte-scale per cluster; cheap per-byte (sequential disk writes, not random)

Capacity estimation

Anchor: a busy event-streaming cluster handles ~1M events/sec sustained, with peaks at ~3M/sec during traffic spikes. Average event size ~1 KB (mix of small click events at 100-500 bytes and larger structured logs at 2-5 KB). Write bandwidth: 1M × 1 KB = 1 GB/sec sustained.

With replication factor 3, the cluster writes 3 GB/sec of data across nodes (each event lands on three brokers). If a cluster has 30 brokers, each broker sustains ~100 MB/sec of write — comfortable for modern NVMe disks rated at 1-3 GB/sec sequential write.

Storage per day: 1 GB/sec × 86400 sec × 3 replicas = ~260 TB/day. Default 7-day retention = ~1.8 PB per cluster. Tiered storage to cheap object storage extends effective retention to 30-90 days without exploding disk cost (see deep dive).

Offset metadata: each partition is a sequence of segment files. Segment ~1 GB each. A topic with 100 partitions × 1.8 PB / 1 GB = 1.8M segments. Per-segment metadata (start offset, byte range, base timestamp) is ~200 bytes; total ~360 MB — trivial.

Consumer offset tracking: each consumer group stores one offset per partition it consumes. A cluster with 1000 topics × 100 partitions × 50 consumer groups = 5M offset rows. Each row ~50 bytes = 250 MB. Stored in a compacted internal topic so the offset store itself reuses the log infrastructure.

Network: producers and consumers are mostly inside the same data center but cross-region replication is common. Cross-region bandwidth: ~1 GB/sec at replication factor 1 between regions, or ~500 MB/sec compressed (events compress 2-5x).

High-level design

Three components: brokers (store partitions on disk), a metadata/control plane (tracks partition leaders, replicas, ISR membership), and producers/consumers (the clients).

Topics are split into N partitions. Each partition is an ordered, append-only log stored as a series of segment files on disk. Within a partition, every event has a monotonic 64-bit offset. Across partitions there is no ordering guarantee — applications that need order pick a partition key (user_id, order_id) so all events for that key land in the same partition.

Each partition has a leader broker and R-1 follower brokers (typically R=3). Producers send writes to the leader; the leader appends to its local log, replicates to followers, waits for the configured ack count (1, all, or quorum), then acks the producer. Followers pull from the leader in a tight loop, maintaining replicas slightly behind. The set of replicas that are caught up to the leader is the in-sync replica set (ISR); only ISR members are eligible for leader election on failure.

Consumers read from the leader (in most implementations — some allow follower reads for locality). A consumer remembers its current offset per partition and asks the broker for events starting from that offset. Long-poll: the consumer sends a fetch request; the broker either returns immediately if data is available or blocks for up to N ms waiting for new appends, then returns whatever arrived.

Consumer groups partition consumption: if a topic has 100 partitions and a consumer group has 10 members, each member is assigned 10 partitions. The metadata service runs a rebalance protocol when members join or leave, reassigning partitions. Consumer offsets are stored in an internal compacted topic on the cluster itself — reuses the log infrastructure for offset durability.

Segment files on disk: each segment is a few hundred MB to ~1 GB. New appends go to the active segment. When the active segment fills up, it's closed and a new one is opened. Sealed segments are immutable — eligible for compression, tiering to cheap storage, and serving zero-copy reads via sendfile() to consumers.

The control plane (a small odd-sized cluster of coordination nodes) holds metadata: which broker is leader for which partition, ISR membership, topic configuration. Broker failures trigger leader election from the ISR. The control plane uses a consensus algorithm (Raft-style) for its own state; the data plane (the actual log) does not run consensus per-write — it uses leader-follower replication with a configurable ack mode.

Deep dive — the hard problem

Three deep dives: the replication contract, the ack-mode tradeoffs, and tiered storage.

Replication contract — leader-follower with ISR. When a producer writes to the leader, the leader assigns the next offset, appends to local disk, then propagates to followers. Followers pull from the leader (pull-based, not push) — simpler flow control and the leader doesn't need to track follower buffers. A follower is 'in-sync' if it has acknowledged all writes up to a recent threshold; followers that fall too far behind are kicked out of ISR and not eligible for leader election until they catch up.

On leader failure, the control plane picks a new leader from the current ISR. If the ISR is empty (all replicas fell behind), the operator must choose: wait for a previous leader to come back (preserves data, may block writes for hours) or elect an out-of-sync replica as leader (resumes writes immediately, loses any events that the dead leader had ack'd but not yet replicated). The default is wait-for-safe-election; the override is 'unclean leader election' and it's a documented data-loss risk.

Ack modes — three levels of producer ack.

ack=0: fire and forget. Producer doesn't wait for any acknowledgment. Highest throughput, lowest latency, but a broker crash before the write hits disk loses the event. Used only for non-critical metrics-style data.

ack=1: leader-only ack. Producer waits for the leader to append to its local log, then gets the ack. Followers replicate asynchronously. If the leader crashes after acking but before any follower has the event, the event is lost. Default for most workloads — good latency, durable enough for most use cases.

ack=all: wait for full ISR. Producer waits until every in-sync replica has the event. Highest durability — no event is acked unless it's on all replicas. Latency is determined by the slowest replica. Used for financial/audit data where 'definitely committed' is more important than throughput.

The combination of ack=all plus a minimum-ISR config (e.g. 'fail the write if ISR drops below 2') gives a strong durability contract: every acked event survives any single broker failure with certainty.

Tiered storage — disk on the broker is expensive; the long tail of historical data doesn't need to live on broker NVMe.

Production pattern: hot segments (the active segment and the most recent few) live on broker local disk and serve all reads. Sealed segments older than a configurable threshold (e.g. 6 hours, 1 day) are uploaded to cheap object storage in the same region and removed from broker disk. The broker keeps an index entry pointing at the remote object.

When a consumer requests a historical offset that maps to a tiered segment, the broker streams the segment from object storage directly to the consumer — typically via a redirect or proxy. Latency for tiered reads is higher (10-100 ms vs <10 ms for hot reads), but most consumers read recent data and rarely touch the tier.

Tiered storage cost: object storage is ~10-20x cheaper per GB than broker NVMe and has effectively infinite capacity. A cluster that previously held 7 days of retention on broker disk can hold 90 days at the same broker disk budget if old data is tiered. Senior interviewers reward the candidate who proposes this — it's the difference between a clever toy log service and a production-grade one.

Fourth tradeoff: exactly-once. The producer-broker leg supports idempotent appends via a producer ID and per-partition sequence number — the broker dedupes retries with the same (PID, seq) pair. The broker-consumer leg supports exactly-once via transactional commit of offsets-plus-output as a single atomic operation. Mention this as an optional layer; most consumers run at-least-once with idempotent downstream processing.

Common mistakes

  • Designing a single global ordering — distributed logs preserve per-partition order only; insisting on global order forces the whole log onto one machine
  • Push-based replication from leader to followers — pull-based is cleaner because the leader doesn't have to track follower buffer state or back-pressure
  • Ignoring the ack-mode tradeoff — every senior interview drills on the durability/latency cost of ack=0 vs ack=1 vs ack=all
  • Forgetting tiered storage — without it, retention is bounded by broker disk and the system can't economically hold weeks of history
  • Treating consumer offset storage as a separate database — production systems reuse the log infrastructure (compacted topic) for offset durability

Likely follow-up questions

  • How would you support compaction (keep only the latest event per key) for changelog-style topics?
  • What changes if you need cross-region replication with sub-second lag?
  • How would you support exactly-once semantics end-to-end (producer through consumer's downstream write)?
  • How would your design handle a partition whose key distribution becomes hot (90% of writes hit one partition)?
  • How would you implement schema evolution so a consumer running an old schema can still read events written with a new schema?
  • How would you support a 'rewind to N hours ago' operation across thousands of consumers without overloading the brokers?

Practice Design Distributed Log live with an AI interviewer

Free, no sign-up required. Get real-time feedback on your design.

Practice these live

Frequently asked questions

Is Design Distributed Log the same as Design Message Queue?
They overlap but differ in semantics. A log preserves events durably and allows multiple consumer groups to replay from any offset. A message queue typically removes a message after one consumer acks it. Log designs emphasize ordering, replication, and retention; queue designs emphasize delivery guarantees and visibility timeouts.
Do I need to mention Raft or Paxos?
Yes for the control plane (the metadata that tracks leaders and ISR). No for the data plane — the actual log uses leader-follower replication with configurable acks, not full consensus per write. Conflating the two is a senior anti-signal.
How long is the Design Distributed Log interview?
45-60 minutes is typical at senior level. Expect deep questions on replication, ack modes, and at least one of (tiered storage, exactly-once, partition rebalance).
What is the most important concept for Design Distributed Log?
The ISR (in-sync replica set) and ack-mode tradeoff. Every senior interview hinges on whether the candidate can articulate exactly what survives a broker failure under ack=1 vs ack=all, and what the minimum-ISR knob protects against.