How long is a Design Fraud Detection interview?

60 minutes at fraud-focused or payments companies (PayPal, Block, Visa). Expect deep coverage of feature engineering + calibration + the chargeback feedback loop. At general companies it's often 45 minutes with less emphasis on the ML pipeline.

Do I need to know specific ML algorithms?

Naming the algorithm family (gradient-boosted trees for the primary scorer, optionally a deep model for borderline cases) is enough. Detailed hyperparameters or training-loop architecture is overkill for the system-design surface — that's an ML-engineer interview.

What's the senior-bar topic?

Calibration and asymmetric cost. Specifically: explain that the model outputs a calibrated probability, the decision engine computes expected loss as risk × amount × loss-given-fraud, and the block threshold is set by the cost ratio. If you don't mention calibration, that's a partial-fail at senior.

Should I discuss graph-based fraud detection?

Mention it as a feature category (shared-device counts, shared-IP counts) without going deep on graph-neural-network architectures. The system-design surface is feature serving, not graph algorithms — keep it at the integration boundary unless the interviewer drills in.

System Design Questions

Design a Real-Time Fraud Detection System — System Design Interview Guide

Design a real-time fraud detection system asks you to score every payment or account event for fraud risk in single-digit milliseconds, block the high-risk ones, and let the rest through. The hard part is balancing false-positive cost (rejecting a legitimate customer is expensive) against false-negative cost (approving a fraudster is also expensive) while serving an ML model on the hot path of every transaction.

By Sam K., Founder, InterviewChamp.AI · Last verified 2026-05-25

Reported in interviews at

PayPal
Block
Visa
Amazon
Meta

Sourced from Glassdoor, Levels.fyi, and Blind interview reports.

Functional requirements

Score every incoming event (payment authorization, account login, account-creation, password reset) for fraud risk in real time
Return a decision (allow / review / block) within a strict latency budget on the hot path
Maintain a feature store with hundreds of per-user, per-device, per-IP, and per-merchant signals derived from historical event streams
Serve a fraud-scoring model that consumes those features and outputs a calibrated risk score
Surface flagged events to a human review queue with full context (event, user history, model explanation)
Feed labeled outcomes (chargeback wins, manual review verdicts) back into the model training pipeline

Non-functional requirements

End-to-end latency: <50ms p99 on the hot path (the entire fraud check sits inside the calling service's latency budget)
Scale: ~50K events/sec at peak across all event types; ~10K of those are payment authorizations
Availability: 99.99%+; a fraud-system outage either takes payments down or opens a fraud-loss window
Model staleness: training data must be fresh within hours, not days (fraud patterns shift fast)
Feature freshness: hot features (transactions in last 5 minutes) must be visible within seconds of the event

Capacity estimation

Anchor on public scale for the largest payment platforms with embedded fraud systems: ~50K events/sec at peak across payment authorizations, account logins, password resets, signups, and money movements. Of those, the payment authorization stream is the most expensive (every authorization must be scored before the platform commits) at ~10K TPS peak.

Latency budget breakdown for the 50ms target: ~5ms feature fetch from the hot feature store, ~10ms model inference (one or two model passes — a fast linear or shallow tree-based first pass, optionally followed by a heavier deep model on borderline cases), ~5ms rule-engine evaluation (override rules: known bad IPs, sanctions-list matches), ~5ms decision logging + write to the event store, ~25ms buffer. Anything that doesn't fit gets cut from the hot path and moved offline.

Storage estimates: ~50K events/sec × 86,400 sec/day = 4.3B events/day. Each event ~2 KB (event payload + computed features + model score + decision + reason codes) = ~8.6 TB/day raw. Hot index (last 30 days for online feature aggregation) ~250 TB. Warm tier (90-180 days for model training) ~1.5 PB. Cold archive (regulatory retention, typically 5-7 years for payment fraud) is large but sits on cheap object storage.

Feature store: ~500M users × ~1 KB per user feature vector (50-100 features: transaction counts in 1h/24h/7d/30d windows, device counts, IP counts, velocity ratios, merchant diversity) = ~500 GB. ~1B devices × ~500 bytes = ~500 GB. ~100M IPs × ~500 bytes = ~50 GB. Plus per-merchant features (~10M merchants × ~1 KB) = ~10 GB. Total feature store ~1 TB hot, sharded for parallel reads.

Model size: a gradient-boosted tree model with ~1000 trees and depth ~6 is ~10-50 MB serialized. Easily fits in memory on every inference node. A deep model would be larger (~100-500 MB) but is reserved for borderline cases. Model rollouts are gradual — see deep dive.

High-level design

Five core layers: event ingestion + streaming, feature store, model serving, decision engine + rules, and the offline training pipeline.

Event ingestion accepts events from every part of the platform that needs scoring: payment authorization, login, signup, money movement, profile change. Each event includes a context payload (user ID, device fingerprint, IP, geolocation, amount, merchant, etc.). Ingestion is dual-purpose: synchronous for hot-path scoring, asynchronous for stream processing.

The feature store is the central piece. It serves two reads: online (single-millisecond lookups for hot-path scoring) and offline (bulk reads for model training). Online features are computed in two ways. Real-time aggregations — counts and sums over short windows (last 1 minute, 5 minutes, 1 hour) — are computed by a streaming job that consumes the event stream and writes to a low-latency key-value store. Batch features — counts and sums over longer windows (last 7 days, 30 days) — are computed nightly and written to the same store. The hot path reads both in a single multi-key fetch. The online/offline parity is enforced by sharing the same feature-computation code between the streaming job and the training pipeline (a 'feature framework' that the same code expression compiles into both Flink-like streaming and Spark-like batch).

Model serving is a stateless inference layer. Each inference node holds the current model artifact in memory. Inference is a single API call that takes a feature vector and returns a calibrated risk score (probability of fraud, 0.0 to 1.0). Models are versioned and rolled out via shadow → canary → full deployment — every model decision is logged with the model version that produced it so post-hoc analysis can quantify per-version performance.

The decision engine combines the model score with a deterministic rule engine. Rules handle the cases where the model is unreliable or where compliance demands a hard policy: sanctioned-country block, known-bad-device block, transaction-amount cap for new accounts, velocity caps (no more than N transactions per hour per card). Rules are evaluated alongside the model; the final decision is the most-restrictive outcome (rule block dominates model allow). The engine outputs one of three decisions: allow, review (route to human queue), or block.

The offline training pipeline runs continuously. It ingests the labeled outcomes from the chargeback feed (a chargeback weeks after the original transaction confirms it was fraud), the manual review feed (analyst verdicts on review-queue events), and customer support escalations (rejected transactions that the customer claims were legitimate). The labels are merged with the historical event stream and feature snapshots, and a training job re-fits the model on the most recent N months of data, typically retraining daily and rolling new models out weekly.

Deep dive — the hard problem

Three deep dives: feature engineering for fraud, false-positive cost in calibration, and the chargeback-feedback delay.

Feature engineering — the model is only as good as the features. The senior signal here is naming the categories of features and explaining why each helps. (a) Velocity features: counts of transactions/logins per entity per time window. Velocity is the strongest signal — a card running 50 transactions in 10 minutes is likely card-testing. (b) Diversity features: how many distinct merchants/IPs/devices an entity has touched. High diversity in a short window is suspicious. (c) Graph features: shared-device counts between users, shared-IP counts. A user sharing a device with 20 other accounts is likely a fraud ring node. (d) Historical features: account age, time since last password change, lifetime spend, lifetime chargeback count. (e) Cross-entity ratios: amount vs. average amount for this merchant, amount vs. average for this user, deviations from the user's typical hour-of-day or device pattern.

The streaming-versus-batch split matters: velocity in the last 5 minutes must be computed by a streaming aggregator (cannot wait for the next batch). Lifetime stats are fine in batch (refresh nightly). Mention this split — it's the difference between catching card-testing in real time and catching it the next morning.

False-positive cost — this is the most-asked senior-bar question. Every false positive (rejecting a legitimate transaction) has a cost: the customer abandons, the merchant loses revenue, and the brand takes a hit. Every false negative (approving a fraudulent transaction) has a cost: the platform absorbs the chargeback, plus the operational cost of dispute handling. The two costs are asymmetric and vary by event type. A blocked $5 coffee transaction is a tiny customer-experience cost. A blocked $5000 wire transfer is a major customer-experience cost. A missed $5 fraud is a small loss. A missed $5000 fraud is a major loss.

The calibration response is to set the block threshold dynamically by the cost ratio: for low-value transactions, allow more risk (high threshold). For high-value transactions, block at a lower risk score (low threshold). The model outputs a calibrated probability (not just a rank score), and the decision engine computes expected loss = risk × amount × loss_given_fraud and blocks when expected loss exceeds a tunable threshold. This requires a calibrated model — a model output of 0.05 must actually mean ~5% of those events are fraud across all events with score 0.05. Calibration is checked by binning predictions and plotting against actual fraud rates; a well-calibrated model produces a diagonal calibration plot. Mention calibration explicitly — most candidates skip it.

Chargeback-feedback delay — the ground truth label arrives weeks or months after the original transaction. A fraud chargeback can be filed up to 120 days after the original charge in most card networks. This delay creates a label-latency problem: the model trained today can only label transactions from 4 months ago and earlier. Pattern shifts in the intervening 4 months are invisible to the model.

The standard mitigations: (a) proxy labels — manual analyst review on flagged transactions provides a faster (hours-to-days) label, used to augment the chargeback feed; (b) account-takeover signals — sudden change in user behavior often precedes a chargeback by days, so a behavior-shift detector creates a fast proxy label; (c) chargeback rate as a monitoring signal — even before per-transaction labels arrive, aggregate chargeback rate trends can detect a fraud pattern shift and trigger model retraining; (d) shadow-mode model deployment for new model versions — run the new model alongside the old model for weeks, comparing decisions without blocking, before promoting to production.

Fourth deep dive: the human review queue. Borderline transactions (model score in the gray zone) route to human analysts. The queue is prioritized by expected loss — analysts spend their first hour on the highest-loss-potential items. The analyst's verdict (genuine vs. fraud) feeds back to the training pipeline as a high-quality label. Mention the analyst's tooling: the case viewer must show the event context, the historical activity for this user/device/IP, similar past cases, and the model's reason codes (which features drove the score). Without reason codes, analysts spend more time per case and the queue throughput drops.

Common mistakes

Treating fraud as a single ML model in isolation — production systems combine model + deterministic rules + human review, and the senior signal is in describing that combination
Skipping the streaming-vs-batch split for features — velocity features must be real-time, not nightly batch
Ignoring false-positive cost — blocking legitimate transactions is often more expensive than the fraud loss itself
Forgetting model calibration — a raw classifier rank without calibrated probabilities can't drive expected-loss thresholding
Skipping the chargeback-feedback delay — labels arrive months late, so model freshness can't rely on them alone

Likely follow-up questions

How would you detect a coordinated fraud ring where multiple accounts share devices, IPs, or shipping addresses?
What changes if regulatory rules require a human reason for every blocked transaction (explainability)?
How would you handle a sudden spike in fraud rate suggesting the model is missing a new attack vector?
How would you implement a 'challenge' step (step-up authentication) between allow and block?
How would you build a feature store that serves both real-time inference and offline training without skew?

Related system design scenarios

Frequently asked questions

How long is a Design Fraud Detection interview?: 60 minutes at fraud-focused or payments companies (PayPal, Block, Visa). Expect deep coverage of feature engineering + calibration + the chargeback feedback loop. At general companies it's often 45 minutes with less emphasis on the ML pipeline.
Do I need to know specific ML algorithms?: Naming the algorithm family (gradient-boosted trees for the primary scorer, optionally a deep model for borderline cases) is enough. Detailed hyperparameters or training-loop architecture is overkill for the system-design surface — that's an ML-engineer interview.
What's the senior-bar topic?: Calibration and asymmetric cost. Specifically: explain that the model outputs a calibrated probability, the decision engine computes expected loss as risk × amount × loss-given-fraud, and the block threshold is set by the cost ratio. If you don't mention calibration, that's a partial-fail at senior.
Should I discuss graph-based fraud detection?: Mention it as a feature category (shared-device counts, shared-IP counts) without going deep on graph-neural-network architectures. The system-design surface is feature serving, not graph algorithms — keep it at the integration boundary unless the interviewer drills in.