Design a Collaborative Document Editor (Google-Docs style) — System Design Interview Guide
Design a Collaborative Document Editor is a system-design interview that asks you to build the engine behind real-time multi-user editing — Google Docs, online code editors, design canvases. The hard part is conflict resolution: when two users edit the same character position simultaneously, the system must converge to a consistent state without losing anyone's intent. The canonical mechanisms are Operational Transformation (OT) and Conflict-Free Replicated Data Types (CRDTs).
By Alex Chen, Founder, InterviewChamp.AI · Last verified
Reported in interviews at
- Microsoft
- Atlassian
- Figma
- Notion
Sourced from Glassdoor, Levels.fyi, and Blind interview reports.
Functional requirements
- Multiple users edit the same document simultaneously and see each other's edits in near-real-time
- Cursor and selection presence: each user sees other users' cursor positions and selections
- Offline editing: a user disconnected from the network can keep editing, and edits reconcile on reconnect
- Document history: every edit is preserved, and the user can replay/revert to any past state
- Rich formatting: bold, italic, headings, lists, embedded images — not just plain text
- Permissions: viewer, commenter, editor roles; share via link or named user
Non-functional requirements
- Edit latency: <100ms p95 from local keystroke to render of remote collaborator's edit
- Convergence: all clients of a document arrive at the same final state regardless of edit order
- Intention preservation: each user's edit produces the result they would expect (insert at the position they meant, not a shifted position)
- Scale: ~10M+ concurrent open documents, ~1B+ documents total, ~50K+ collaboration servers
- Document size: support documents up to ~50 MB rendered (millions of characters with formatting)
- Concurrent editors per document: support up to ~100 simultaneous editors on a single document
Capacity estimation
Public-scale assumption: ~10M+ concurrent open documents across the platform. Of those, ~5% have multiple users editing simultaneously (~500K actively-collaborative documents). The remaining 95% are single-user sessions where collaboration overhead is unused but present.
Edit-event throughput: an active typist generates ~5 keystrokes/sec. A document with 5 active editors generates ~25 events/sec. At 500K concurrent collaborative documents averaging 3 active editors each, that's ~7.5M events/sec across the platform. Most events are tiny (~50 bytes per insert character with operation metadata) so bandwidth is modest: ~400 MB/sec aggregate edit traffic, fan-out multiplied by collaborator count to ~1-2 GB/sec total egress.
Document storage: a 50-page document is ~1 MB of text + formatting. Average documents are ~10-50 KB. Across 1B documents, raw content storage is ~10-50 TB. The operation log (every keystroke ever made on every document) is far larger: ~5 keystrokes/sec/user × ~100M users × ~hours per session = ~1 PB+ of operation history. Snapshot rollups + log compaction control this.
Concurrent connection count: each active editor maintains a persistent connection to a collaboration server. With ~10M concurrent users, you need ~10M open connections distributed across ~50K servers (~200 connections per server is comfortable for a websocket-style server with persistent state).
Metadata: per-document row (~500 bytes for owner, title, created_at, modified_at, permission list summary). 1B documents × 500 bytes = ~500 GB metadata. Sharded by document_id.
High-level design
Four core services: collaboration server (the live-document state machine), persistence (operation log + snapshots), presence (cursor and selection broadcast), and a permission/access layer.
Collaboration server: each active document is owned by exactly one collaboration server at a time. When a user opens a document, the client connects via websocket to the server that currently owns the document. If the document isn't loaded anywhere, a router assigns it to a server with capacity. The collaboration server holds the document's current state in memory plus a tail of recent operations. Edits from connected clients arrive as operations (insert at position P, delete range R, format range F). The server applies operations, broadcasts the result to all other connected clients, and appends to the persistent operation log.
Convergence is the hard part. Two clients can submit conflicting operations before either sees the other's edit. The collaboration server (or the clients, depending on architecture) transforms or merges these operations so all clients converge to the same final state. This is OT or CRDT mechanics — see deep dive.
Persistence: the operation log is the source of truth. Every operation is appended to a per-document append-only log in a durable store. Periodically (every N operations or T minutes), a snapshot of the document state is written so loading the document doesn't require replaying millions of operations from the beginning. Loading a document = load the most recent snapshot + replay the operations since that snapshot. Snapshot cadence trades disk usage against load latency.
Presence: cursor positions, selections, and 'user X is currently typing' indicators are broadcast through a separate pub-sub channel (not the persistent operation log — presence is ephemeral and high-volume). Each client subscribes to the document's presence channel and publishes its own cursor position every ~500ms. Presence updates aren't durable; if the user disconnects, their cursor disappears.
Permission layer: every websocket connection authenticates a (user, document) pair. The collaboration server checks the user's role (viewer, commenter, editor) on connect and on every operation. A viewer's operations are rejected; a commenter's operations are restricted to comment threads; an editor can do anything. Permission changes propagate via a permission-change event that forces re-evaluation on all open connections to the document.
Deep dive — the hard problem
Three deep dives: OT vs CRDT, the offline-reconciliation story, and document ownership and failover.
OT vs CRDT: the central question. Both achieve convergence; they differ in mechanism and tradeoffs.
Operational Transformation (OT): when an operation arrives at the server, it's transformed against operations that have happened since the sender last synced. If user A inserts 'X' at position 5 and user B concurrently inserts 'Y' at position 3, B's operation arrives first and shifts subsequent positions; A's operation must be transformed (5 → 6) before applying. The transformation function is the heart of OT — for plain text it's simple (adjust position based on prior inserts/deletes), but for rich text with formatting and structured operations it gets complex fast. Google Docs uses OT and has invested heavily in correctness of the transformation function.
OT requires a central server to define a canonical order of operations. The server receives each operation, transforms it against any operations that have happened since the sender's version, applies it, and broadcasts the transformed operation to other clients. Clients themselves may not need to run transformation if the server does it (server-authoritative OT) or may need a local transformation layer for offline edits (client-side OT).
Conflict-Free Replicated Data Types (CRDTs): each character (or unit) gets a unique identifier that doesn't depend on position. Inserts are 'insert character with ID X between IDs A and B.' Deletes mark by ID. Because operations reference stable IDs not shifting positions, operations commute — applying them in any order produces the same result. No transformation needed.
CRDTs simplify the client-server contract: clients can sync peer-to-peer without a central authority, and the server's job is just to relay and persist. The cost: storage. Each character carries metadata (its unique ID, ordering metadata, tombstones for deleted characters). A document that has been heavily edited can have CRDT metadata 10-100x larger than its visible text. CRDT implementations use various techniques (Yjs-style efficient encodings, garbage collection of unreachable IDs) to keep this in check.
The modern trend leans CRDT for new systems — simpler client-server model, better offline support, no central transformation server. OT remains entrenched in established products (Google Docs) because the transformation function is heavily-engineered and switching paradigms is risky. In the interview, name both, explain when you'd choose each, and pick one for your design — be ready to justify.
Intention preservation: convergence isn't enough. If user A selects 'cat' and types 'dog' (a replace), and user B simultaneously inserts ' the ' before the word, the user-expected result is 'the dog,' not 'the cat' with 'dog' appearing somewhere else. The transformation function (OT) or the operation design (CRDT) must preserve intent — A meant to replace the word, even if its position shifted. Mention intention preservation explicitly; it's the senior signal that separates a candidate who reads about OT from one who has thought about the corner cases.
Offline reconciliation: when a user edits offline, their operations queue up locally. On reconnect, the queued operations stream to the server, which transforms each against the operations that happened on the server while the user was offline. Long offline sessions accumulate large operation queues; the server may reject reconciliation if the queue is too large or the divergence too severe, and present the user with a manual conflict-resolution UI. Browser-based clients use IndexedDB or similar to persist the queue across browser restarts so a user closing their laptop offline doesn't lose work.
Document ownership and failover: each active document lives on exactly one collaboration server. If that server crashes, connected clients reconnect via the router, which assigns the document to a new server. The new server loads the most recent snapshot + operation log from durable storage and resumes. There's a brief gap (seconds) during failover where clients can't apply edits; offline-reconciliation mechanics make this graceful — clients queue operations locally during the gap and replay on reconnect.
For very high-collaboration documents (50+ active editors), a single collaboration server may bottleneck on broadcast fan-out. Production systems shard the broadcast layer: the server handles operation transformation centrally but delegates broadcast to a pub-sub layer with multiple subscribers per document. Or, with CRDT-based designs, clients can peer-mesh directly for low-latency broadcast with the central server handling persistence and late-joiner sync.
Fourth tradeoff: rich formatting and structured content. Plain-text OT/CRDT is well-understood. Rich text (bold, italic, links, embedded images, tables, nested lists) adds operation types and increases transformation-function complexity. Some systems represent rich text as a tree of nodes with CRDT-tracked tree operations (tree CRDTs); others use a flat character stream with formatting attribute spans. Each has tradeoffs around concurrent edits to overlapping formatting ranges.
Common mistakes
- Not naming OT or CRDT — convergence under concurrent edits is the central problem and the candidate must name the standard mechanisms
- Confusing convergence with intention preservation — all clients reaching the same state isn't enough if the state isn't what users meant
- Skipping offline reconciliation — real users go offline mid-edit and the system must handle it
- Forgetting that presence is separate from edits — cursor positions are high-volume and ephemeral, the operation log is durable and lower-volume
- Treating document ownership as permanent — collaboration servers crash and documents must hand off to new servers
Likely follow-up questions
- How would you handle a 'time travel' feature where a user can scrub the document back to any past moment and edit from there?
- What changes if the document supports embedded comments threaded on specific text ranges?
- How would you support a document with 1000 concurrent viewers but only 10 concurrent editors?
- How would you implement end-to-end encryption — the server can't see the document content but collaboration still works?
- How would you design the system to merge two separate documents into one without losing edit history from either?
Practice Design a Collaborative Document Editor (Google-Docs style) live with an AI interviewer
Free, no sign-up required. Get real-time feedback on your design.
Practice these liveFrequently asked questions
- How long is the Design Collaborative Editor interview?
- 60-75 minutes. Expect deep questions on OT vs CRDT, offline reconciliation, and presence. This is a senior-bar question — junior candidates often pass with just naming OT, but senior candidates dig into the transformation-function corner cases.
- Do I need to know the OT transformation function in detail?
- Naming OT, explaining 'incoming operation gets transformed against ops that happened since sender's last sync,' and walking through one concrete example (concurrent inserts at the same position) is enough. Drawing the full insert-vs-delete-vs-format transformation table is overkill.
- Which is better, OT or CRDT?
- Neither dominates. OT is simpler for centralized servers and has more battle-tested implementations (Google Docs). CRDT is simpler for peer-to-peer and offline-heavy scenarios but has higher storage overhead. New systems often choose CRDT (Linear, Figma, modern note apps); legacy systems stay on OT. In the interview, name both, pick one with justification, and be ready to defend.
- What is the most important concept for Design Collaborative Editor?
- Convergence + intention preservation under concurrent edits, plus the OT/CRDT tradeoff. The senior signal is recognizing that 'just append operations to a log' is wrong because concurrent operations need to be reconciled, and naming the mechanism that does the reconciling.