Design a Content Moderation System — System Design Interview Guide
Design a Content Moderation System is a system-design interview that asks you to build the pipeline that scans user-generated content (images, video, text) and removes or restricts violating content — CSAM, hate speech, harassment, spam, sexual content, terrorist content. The hard part is balancing accuracy, throughput, human-reviewer cost, and legal mandates that demand zero tolerance for certain content categories.
By Alex Chen, Founder, InterviewChamp.AI · Last verified
Reported in interviews at
- Meta
- TikTok
- Microsoft
Sourced from Glassdoor, Levels.fyi, and Blind interview reports.
Functional requirements
- Scan every user upload (post, comment, image, video, livestream) before or shortly after publish
- Classify content against multiple policy categories (CSAM, hate speech, sexual content, violence, spam, harassment)
- Take automated action: allow, restrict (age-gate, geo-block, demonetize), remove, escalate to human review
- Route ambiguous cases to a human-reviewer queue with policy-aware prioritization
- Maintain an appeal flow: user contests a takedown, escalated reviewer re-evaluates
- Report mandatory categories (CSAM) to law-enforcement databases per regional law
Non-functional requirements
- Throughput: ~100K uploads/sec at peak across a global platform
- Latency: <2s p95 for pre-publish text scan; <30s p95 for image scan; video can take minutes
- Recall on CSAM and terrorist content: ~100% — false negatives are unacceptable
- False-positive rate on benign content: <0.5% — over-aggressive removal drives users off the platform
- Reviewer cost: human-review tier must scale sub-linearly with content volume (most cases auto-resolve)
- Audit: every moderation decision must be replayable for legal discovery
Capacity estimation
Public-scale assumption: ~500M uploads/day across text, image, short-video. That's ~6K/sec average and ~100K/sec at peak (early-evening US + early-morning Asia overlap). Of that volume, ~80% is text (comments, posts), ~15% image, ~5% video.
ML-classifier inference cost dominates: each image goes through a multi-task classifier (~50ms on a GPU shard). At 100K image+video/sec peak that's 5000 GPU-seconds/sec, or ~500 GPUs continuously inferencing. Video is decoded into 1-fps keyframes plus an audio classifier, so a 1-minute video is ~60 image classifications + 1 audio classification.
Hash-matching against known-bad content (CSAM hash database, terrorist content shared-hash database) runs at ~1M comparisons/sec on a single host using perceptual hashes (PhotoDNA-style, pHash, dHash). With a database of ~10M known-bad hashes, a Bloom filter pre-check followed by exact match is sub-millisecond.
Human-reviewer queue depth: if 0.5% of content is escalated to human review, that's ~2.5M items/day. At ~30 seconds per review and a 10-minute deep review for ambiguous cases, you need ~1500 reviewers per shift × 3 shifts = ~5000 reviewers globally. Reducing the escalation rate by 0.1% saves ~500 reviewers.
Storage: every moderation decision is logged for audit (decision, model_version, confidence, policy_category, reviewer_id_if_any, timestamp). At 500M decisions/day × ~500 bytes = ~250 GB/day, ~90 TB/year. Manageable in a partitioned columnar store with 1-year hot retention and cold archive thereafter.
High-level design
Five-layer pipeline: ingest, hash match, ML classifier, human review, action. Each layer reduces the volume that flows to the next, so the expensive layers see a fraction of total upload traffic.
Ingest layer: every upload publishes an event to a moderation topic before the content is publicly visible. For text-only content, the scan is synchronous on the post-write path (publish blocks on the 1-2 second scan). For images and video, the upload completes immediately and the content is marked 'pending review'; the public-feed system filters pending content out of public feeds until cleared. Optimistic-publish (show to the author immediately, hold from public until cleared) is the common UX.
Hash-match layer: every image and video keyframe is perceptually hashed and looked up against the known-bad hash database (CSAM hashes from NCMEC, terrorist content from GIFCT, platform-specific repeat-violator hashes). A hit is an automatic remove and report — no ML classifier needed. This catches the most legally-sensitive content cheaply.
ML classifier layer: content that didn't hash-match goes through a multi-task abuse classifier. Image content runs through a vision model that outputs scores for ~20 policy categories (nudity, violence, hate symbols, weapons, drugs). Text runs through a language model that scores hate speech, harassment, spam, threats. Video samples 1 keyframe per second plus an audio track. The classifier outputs (category, score) tuples; downstream rules map score ranges to actions.
Action layer: based on classifier output and policy rules, the system takes one of four actions. (1) Clean — no action, content goes live. (2) Restrict — age-gate, demonetize, or geo-block; content stays up but with limits. (3) Auto-remove — high-confidence violation, content removed and creator notified. (4) Escalate to human review — score is in the ambiguous band, route to a reviewer queue.
Human review layer: a queue of escalated items prioritized by (a) policy severity — CSAM and terrorist content jump the queue, (b) recency, (c) creator's prior-violation history, (d) audience size of the content (a post with 1M views needs faster review than one with 10). Reviewers see the content with policy context and pick from the same action menu. Their decisions feed back as labeled training data for the next ML classifier version.
Deep dive — the hard problem
Three deep dives: the multi-stage classifier with policy thresholds, the human-reviewer system, and the appeal flow.
Multi-stage classifier with policy thresholds: a single confidence score isn't enough — different policy categories tolerate different false-positive rates. CSAM detection runs with a very low threshold (auto-remove on score >0.3) because false-positives are recoverable and false-negatives are unacceptable. Hate-speech detection runs with a higher threshold (auto-remove on score >0.9, escalate on 0.5-0.9, allow below 0.5) because the cost of removing satire or quoted criticism is high. Each policy category has its own (auto-allow, escalate, auto-remove) thresholds tuned on a held-out evaluation set with quarterly recalibration.
The classifier ensemble is layered: a fast cheap model (~5ms) screens out clearly clean content, a heavier model (~50ms) handles the ambiguous middle, and a multimodal cross-checker (~200ms) handles the hardest 1% (e.g., text-on-image memes that combine an innocent image with a hateful caption). Cascade architecture keeps inference cost proportional to ambiguity.
Language-specific models matter: hate speech in Hindi or Tagalog doesn't lift cleanly from an English-trained classifier. Mature platforms maintain ~20-30 language-specific classifiers, each with separate training pipelines. Low-resource languages rely on machine-translation-then-classify as a fallback with degraded accuracy.
Human-reviewer system: reviewers are the most expensive layer, so the system optimizes their throughput. The queue UI presents one item at a time with policy context (which categories the ML flagged, the creator's prior-violation count, the audience reach). Reviewers have ~30 seconds for clear cases and escalate hard cases to a second-tier reviewer. Dual-review for sensitive cases: any CSAM or terrorist-content escalation goes to two independent reviewers and a third tiebreaker on disagreement.
Reviewer wellbeing is a real concern — viewing graphic content all day causes PTSD-like effects. Production systems blur, grayscale, or mute audio by default and require an explicit reveal action per item, plus mandatory break cadence and counseling availability. This is operational policy but worth naming in the interview — interviewers value candidates who think about human cost.
Reviewer decisions feed an active-learning loop: items where the ML was uncertain and a reviewer made a call become high-value training data. The labeling system tracks reviewer agreement; high-disagreement items get a third reviewer and become 'hard examples' that the next model version trains on with extra weight.
Appeal flow: when content is removed, the creator gets a notification with the policy category and a 'request review' button. The appeal queue is separate from the primary review queue and is staffed by senior reviewers with more context (policy precedent, the creator's full history). An appeal can result in (a) original decision upheld, (b) original decision reversed and content restored with creator notified, (c) referred to policy team for a precedent-setting case.
A well-designed appeal flow is critical for fairness and PR: a high-profile false-positive that doesn't get reversed quickly becomes a news story. The appeal SLA at major platforms is 24-72 hours for routine cases, faster for high-audience cases.
Fourth tradeoff: pre-publish vs post-publish moderation. Pre-publish blocks the user until the scan completes — adds latency, frustrates legitimate posters. Post-publish lets content go live immediately and removes after detection — faster UX but allows brief windows of visible violations. Most platforms use a hybrid: cheap hash and text scans run pre-publish (block on hit); expensive ML scans run post-publish with the content held out of public feeds until cleared. Mention this tradeoff explicitly — interviewers expect it.
Fifth: adversarial content. Bad actors evolve to evade classifiers (slight image rotations, leetspeak text, audio-track-only violations on muted-video uploads). Production systems run a red-team that continually probes the classifier with adversarial examples and retrains on those failures. The classifier-vs-adversary loop is permanent; the model is never 'done.'
Common mistakes
- Treating moderation as a single binary classifier — production needs per-category thresholds and multi-stage cascade
- Forgetting hash-match for known-bad content — far cheaper than ML for the most legally-sensitive cases
- Skipping the human-review tier — fully-automated moderation has false-positive rates that destroy user trust
- Ignoring the appeal flow — high-profile false-positives without reversal become PR disasters
- Not naming the adversarial-evasion problem — bad actors continually probe the classifier and the model must retrain
Likely follow-up questions
- How would you handle livestream moderation where you have seconds to act, not minutes?
- How would you design the moderation system to support a new language with no labeled training data?
- What changes if the platform supports end-to-end encrypted messages where the server can't read content?
- How would you handle a coordinated brigading attack where a thousand accounts post the same harmful content within an hour?
- How would you design the system to be auditable for a regulator who asks 'show me every moderation decision on X topic in the last year'?
Practice Design a Content Moderation System live with an AI interviewer
Free, no sign-up required. Get real-time feedback on your design.
Practice these liveFrequently asked questions
- How long is the Design Content Moderation interview?
- 60 minutes is typical at trust-and-safety teams (Meta Integrity, Google Trust & Safety, TikTok). Expect deep questions on per-category thresholds, the human-review tier, and the appeal flow.
- Do I need to know the actual ML model architectures?
- Naming the cascade (cheap screener → heavy classifier → multimodal cross-checker) and explaining why is enough. Drawing the actual transformer architecture is overkill — the senior signal is in the policy thresholds, the cascade economics, and the human-loop design.
- How is this different from spam detection?
- Spam detection is one policy category inside the broader moderation system. The differences: spam has high tolerance for false-positives (worst case is a missed promotional message), while CSAM/terrorism have zero tolerance for false-negatives. The interview question is the latter problem dressed in the former's vocabulary.
- What is the most important concept for Design Content Moderation?
- The cascade with per-category thresholds plus the human-review loop. The senior signal is recognizing that one classifier with one threshold can't serve both legal-mandate categories and gray-area speech, and proposing the multi-tier architecture that lets each category be tuned separately.