Why are leases alone insufficient for correct distributed locking?

Leases depend on clock synchronization. Under GC pauses, network delays, or clock drift, a holder may believe its lease is still valid after the cluster has already expired it and granted the lock to someone else. Fencing tokens close this gap by giving the protected resource the ability to reject stale holders.

Do I need to mention specific products like ZooKeeper or etcd by name?

Not required. The interview is about the patterns: consensus-backed state, fencing tokens, lease management, watch notifications, split-brain prevention. Reference the pattern, not the brand.

How long is the Design Distributed Lock Service interview?

45-60 minutes at senior level. The interviewer drills hard on fencing tokens and partition behavior. A candidate who proposes leases without fencing is signaling junior-level understanding; the strong signal comes from naming the failure modes explicitly.

What is the most important concept for Design Distributed Lock Service?

Fencing tokens plus the realization that the lock service alone cannot enforce mutual exclusion — the protected resource must participate by checking tokens. This split of responsibility (lock service grants and orders; resource enforces) is the senior-bar signal.

System Design Questions

Design Distributed Lock Service — System Design Interview Guide

Design Distributed Lock Service is a system-design interview that asks you to build a coordination service: clients acquire named locks across a cluster, only one client can hold a lock at a time, and the system must handle network partitions and client crashes without deadlocking forever. The hard part is the fencing-token contract — a lock holder must know if its lease expired before another client took the lock, and the protected resource must reject stale lock holders.

By Sam K., Founder, InterviewChamp.AI · Last verified 2026-05-25

Reported in interviews at

Google
Amazon
Meta
Microsoft
Apple

Sourced from Glassdoor, Levels.fyi, and Blind interview reports.

Functional requirements

Acquire a named lock with a configurable lease duration; the lock is exclusive and held until released or expired
Release a lock explicitly when work is done; release before lease expiry to allow other clients to acquire
Renew (extend) the lease while still holding the lock; lets a long-running client keep the lock without re-acquiring
Watch a lock for changes: client gets notified when a lock is released or its holder changes
Read the current holder of a lock without acquiring it (for diagnostics and monitoring)
Hierarchical locks with parent/child relationships for coordinating multi-step workflows

Non-functional requirements

Correctness: at most one client holds a given lock at any logical time; fencing tokens prevent double-execution under partition
Acquire latency: <50ms p99 for an uncontended lock; bounded by the consensus round-trip
Throughput: ~10K lock operations/sec per cluster; locks are coordination primitives, not high-volume data writes
Availability: 99.99% for lock operations; the cluster tolerates a minority of nodes failing
Lease duration: configurable 5 seconds to 1 hour; default 30 seconds
Notification latency: <1 second for watch fires from lock change to client notification

Capacity estimation

Lock services are coordination primitives, not bulk data stores. Scale anchors are throughput per operation rather than petabytes of data.

Typical platform: ~10K distinct named locks active at any moment, ~5K lock-acquire/release cycles per second across the cluster, ~50K lease renewals per second (long-running holders renew every few seconds).

State per lock: ~200 bytes (lock_name, current_holder_id, fencing_token, lease_expiry, watchers). At 10K locks × 200 bytes = 2 MB of total state — trivially fits in memory on every node.

Including inactive locks (created but currently not held) and historical state, total persisted state stays under 100 MB. Lock services are not bulk-data services; they're consensus services with small state.

The bottleneck isn't storage — it's the consensus round-trip per write. Every acquire/release/renew goes through the consensus protocol (write to leader, replicate to majority, ack). Each consensus write costs ~5-20ms in a same-datacenter cluster, ~20-50ms in a multi-AZ cluster.

Throughput: a 5-node cluster running a typical consensus protocol handles ~10-30K writes/sec on a single key, ~100K writes/sec across all keys (because writes can be batched). For a lock service this is plenty — production systems rarely need more.

Watch notifications: when a lock changes, the cluster notifies all watchers. A popular lock might have 100+ watchers; the notification fan-out is bounded by the watch list length. Notifications go over long-lived gRPC streams from clients to nodes.

High-availability requirements: the cluster typically runs 5 nodes (tolerates 2 failures) or 3 nodes (tolerates 1 failure). Larger clusters add latency without adding much availability — the marginal benefit drops off quickly.

High-level design

Architecture: an odd-sized cluster (3 or 5 nodes) running a consensus protocol like Raft. All writes go through the leader and are replicated to a majority before being acked. Reads can be served from any node (with linearizable reads requiring a leader confirmation) or from a follower (with possible staleness).

Lock state machine: each lock is a row in the cluster's state machine with fields (name, holder_id, fencing_token, lease_expiry, watchers). Operations are:

- Acquire(name, holder_id, lease_seconds): if the lock is free (no holder, or lease_expiry < now), set holder_id, increment fencing_token, set lease_expiry = now + lease_seconds, fire watcher notifications. If held, return failure. - Release(name, holder_id): if holder_id matches, clear holder, fire watcher notifications. - Renew(name, holder_id, lease_seconds): if holder_id matches, set lease_expiry = now + lease_seconds. Does NOT increment fencing_token (renewal is the same holder, same token). - Watch(name): subscribe to notifications for this lock. The cluster maintains a list of watchers per lock; on any state change, send a notification.

Every operation is a consensus write. The leader proposes, the majority acks, then the operation is applied to the state machine and a response is returned.

Fencing token: every successful acquire increments a monotonic counter (per lock or per cluster). The acquired lock object includes the current token. When the lock holder uses the lock to perform some action on an external resource, the holder includes the token in the request. The external resource tracks the highest token it has ever seen and rejects any request with a lower token.

Why fencing matters: leases are time-bounded but clocks drift. A holder believes its lease is valid for 10 more seconds but the cluster considers it expired and gave the lock to another client. Both clients send writes to the same external resource. Without fencing, both writes succeed and the resource is corrupted. With fencing, the resource sees token 7 from the new holder, then token 6 from the stale holder; token 6 < highest_seen_token, so the stale write is rejected.

Fencing tokens are the protection contract between the lock service and the protected resource. The lock service itself can't enforce mutual exclusion across all failure modes (clocks, partitions, GC pauses) — only the protected resource can.

Client library: typically provides a try-acquire-with-backoff loop, an automatic-renewal goroutine that keeps the lease alive, and a callback hook for 'you lost the lock' events. Clients should periodically check that their fencing token is still the highest seen by the resource — if not, their lock is gone even if their local lease still says valid.

Deep dive — the hard problem

Three deep dives: the fencing-token contract, lease expiry under network partitions, and split-brain prevention.

Fencing-token contract — the canonical correctness pattern.

A lease-based lock can never be fully correct on its own. The classic failure: client A holds a lock with a 30-second lease. Client A pauses for garbage collection for 60 seconds. The cluster expires A's lease and gives the lock to client B. A wakes up from GC, still believes it holds the lock, sends a write to the protected resource. B is also sending writes. Resource sees both — silent corruption.

Fix: every acquire returns a monotonically-increasing fencing token. A's token is 7. B's token is 8 (one higher because acquires are serialized through consensus). The protected resource tracks the highest token it has ever seen on each protected key. When A sends a write with token 7, the resource compares to its highest_seen (8) and rejects.

This works because consensus guarantees fencing tokens are monotonic — the cluster can never give out the same token to two different holders, and a later holder always has a higher token than any earlier holder.

The resource being protected must support this contract — it must store and check the fencing token. Lock services that don't expose tokens (or resources that don't check them) have a correctness hole that shows up under stress.

Lease expiry under network partition.

A lock holder is partitioned from the cluster. The cluster can't reach the holder, the lease expires, the cluster gives the lock to another client. Meanwhile the partitioned holder believes it still holds the lock (it can't hear the cluster's expiry decision).

Client-side safeguard: the holder's library checks 'is my lease still valid' before doing any operation that depends on the lock. The check is local — compare local_clock to lease_expiry — but it's only as good as the local clock. A clock that drifts forward (GC pause makes local time appear to skip) thinks the lease is still valid when it isn't.

Defense: the library checks against a monotonic clock (not wall clock) and uses a conservative safety margin (e.g. 'consider lease valid only if we have at least 5 seconds left'). Combined with fencing tokens, this means the worst case is the holder discovers it lost the lock at the next operation and the resource rejects the stale write.

Split-brain prevention.

What if the cluster itself partitions? Half the nodes can't see the other half. Naively, both halves might independently decide they're the leader and grant locks separately. This is the split-brain failure.

Consensus protocols (Raft) prevent this by requiring a majority for any write. In a 5-node cluster, a 3-2 partition means the 3-node side has majority and can write; the 2-node side has minority and cannot. If a network partition isolates the leader on the minority side, the minority side stops accepting writes; the majority side elects a new leader; the cluster proceeds with one logical leader.

When the partition heals, the minority side rejoins as followers and catches up on the writes it missed. No conflicting locks were granted because only one side could write during the partition.

The cost: a cluster that loses majority (e.g. 3 of 5 nodes fail) becomes read-only or completely unavailable until majority is restored. This is the consensus contract — availability is sacrificed for correctness during catastrophic failures.

Fourth tradeoff: read consistency. Reads can be linearizable (go through the leader, see all preceding writes), bounded-staleness (go to any node, may be slightly behind), or stale (any node, no freshness guarantee). Lock state lookups typically want linearizable — 'is this lock currently held?' requires a fresh answer. Monitoring queries can use bounded-staleness for less overhead.

Fifth tradeoff: watch reliability. When a lock changes, the cluster notifies watchers via a long-lived stream. If the stream breaks, the client reconnects and asks 'tell me about changes since my last seen revision N.' The cluster keeps a bounded change log; if the client is too far behind, it gets a 'too old, please re-read state' response. Production systems compact the change log so it doesn't grow unbounded.

Common mistakes

Designing a lock service without fencing tokens — leases alone don't survive clock drift or GC pauses
Running consensus on an even number of nodes — even sizes have the same fault tolerance as the next smaller odd size but with more election overhead
Allowing watch notifications to be best-effort with no replay — clients that briefly disconnect miss state changes and have no way to recover
Treating a lock service as a high-throughput key-value store — it's a low-volume coordination primitive, not a data store
Ignoring the protected-resource contract — fencing tokens only work if the resource being protected checks them

Likely follow-up questions

How would you support shared (multiple-reader, single-writer) locks in addition to exclusive locks?
What changes if you need cross-region locks where clients in different continents acquire the same lock?
How would your design handle a client that acquires a lock and then permanently disappears (network or process death)?
How would you implement leader-election-via-lock where N candidates race for a lock and the winner becomes the leader?
How would you support transactional multi-lock acquisition (acquire locks A and B atomically, or neither)?
How would you implement a 'lock with priority queue' where pending acquirers are served in FIFO order when the lock releases?

Related system design scenarios

Frequently asked questions

Why are leases alone insufficient for correct distributed locking?: Leases depend on clock synchronization. Under GC pauses, network delays, or clock drift, a holder may believe its lease is still valid after the cluster has already expired it and granted the lock to someone else. Fencing tokens close this gap by giving the protected resource the ability to reject stale holders.
Do I need to mention specific products like ZooKeeper or etcd by name?: Not required. The interview is about the patterns: consensus-backed state, fencing tokens, lease management, watch notifications, split-brain prevention. Reference the pattern, not the brand.
How long is the Design Distributed Lock Service interview?: 45-60 minutes at senior level. The interviewer drills hard on fencing tokens and partition behavior. A candidate who proposes leases without fencing is signaling junior-level understanding; the strong signal comes from naming the failure modes explicitly.
What is the most important concept for Design Distributed Lock Service?: Fencing tokens plus the realization that the lock service alone cannot enforce mutual exclusion — the protected resource must participate by checking tokens. This split of responsibility (lock service grants and orders; resource enforces) is the senior-bar signal.