Design a Payment Gateway — System Design Interview Guide
Design a payment gateway asks you to build the routing layer that sits between a merchant's checkout and the card networks: tokenize cards, route authorization requests through the right acquiring bank, handle 3D Secure step-up, and settle funds back to the merchant. The hard part is multi-network routing with deterministic retries and a tokenization vault that satisfies PCI scope.
By Alex Chen, Founder, InterviewChamp.AI · Last verified
Reported in interviews at
- PayPal
- Block
- Adyen
- Visa
- Mastercard
Sourced from Glassdoor, Levels.fyi, and Blind interview reports.
Functional requirements
- Accept a card from the merchant's checkout (raw PAN or pre-tokenized card)
- Tokenize the card into a network-scoped or vault-scoped token for reuse
- Route the authorization request to the appropriate acquiring bank based on card BIN, currency, and routing rules
- Handle 3D Secure (3DS) step-up authentication when required by the issuing bank or regional regulation
- Capture, refund, and reverse authorizations through the same routing path
- Settle authorized + captured funds to the merchant's bank on a daily schedule, less interchange and gateway fees
Non-functional requirements
- Authorization latency: <800ms p99 from merchant API call to authorization decision (3DS adds 2-30 seconds, user-driven)
- Availability: 99.99%+; every minute of downtime is direct merchant revenue loss
- Scale: ~100M+ transactions/day at large gateways, ~5K-10K TPS at peak (Black Friday, holiday surges)
- PCI DSS Level 1 compliance: card data isolated, encrypted, never logged in plaintext
- Deterministic retries: a network timeout to the acquiring bank must never produce a double-authorization
Capacity estimation
Anchor on public scale for the largest payment gateways: ~100M+ transactions/day = ~1200 TPS average, with peak load 5-10x average during Black Friday (~10K TPS at the platform level). Average authorization is a single round-trip to one acquiring bank (~300-500ms), but ~5-15% of transactions go through 3DS step-up which adds a user-interaction delay (2-30 seconds), and ~1-3% fall back to a secondary acquirer on first-attempt decline (the 'cascading' deep dive below).
Storage per transaction is ~3-5 KB (transaction row + routing decision log + network response payload + audit entries). Annual transaction storage: ~150 TB/year primary data + ~3-5x in network-response logs and audit trails for the dispute window (180 days minimum, often 540 days for regulatory recordkeeping).
Token vault: PCI-scoped, network-isolated. ~500M+ stored payment methods across all merchants × ~500 bytes per record (encrypted PAN + last4 + BIN + brand + tokenization metadata). Total ~250 GB — tiny in bytes, but every byte sits inside the PCI audit boundary. The vault is the only system that ever sees plaintext PANs after the initial tokenization API call.
Routing table: per-merchant, per-BIN, per-currency rules deciding which acquirer to try first and what cascade order to use on decline. Typically 100K-1M rules across all merchants. Held in memory and refreshed on rule change (rules change slowly — minutes, not seconds).
The shape that matters: this is not a high-QPS problem. It's a routing + reliability problem. 10K TPS is small in modern infrastructure terms. The challenges are deterministic routing under acquirer outages, PCI isolation of the card vault, and exactly-once semantics across the authorization → capture → settlement lifecycle.
High-level design
Five core domains: card vault (tokenization), gateway API, routing engine, acquirer connectors, and settlement.
The card vault is a PCI-scoped service running in an isolated network segment. It accepts a raw PAN over a TLS endpoint, encrypts it with an HSM-backed key, stores the ciphertext in a dedicated isolated store, and returns an opaque token. Detokenization is allowed only from the acquirer-connector layer at authorization time, and every detokenization call is logged. All other services in the platform see only tokens — this collapses the PCI audit scope to one network segment and one service.
The gateway API is the merchant-facing surface. It accepts an authorization request, validates the input, looks up the token (or accepts a raw card and tokenizes inline), and forwards an enriched request to the routing engine. Every request carries a merchant-supplied idempotency key; duplicate requests within the idempotency TTL return the original result without a second authorization.
The routing engine decides which acquiring bank to send the transaction to. The decision uses card BIN (which determines the issuing bank and network), transaction currency, merchant routing rules, and recent acquirer health data. A merchant might have two acquirers per region (one primary, one fallback) and the engine picks based on cost, approval-rate history, and acquirer uptime.
Acquirer connectors are protocol adapters. Each acquiring bank speaks a slightly different dialect (often ISO 8583 over a persistent TCP socket, sometimes a REST API). The connector translates the platform's internal authorization message into the acquirer's wire format, opens or reuses a persistent connection, sends the message, waits for the response with a strict deadline (typically 8-15 seconds — card networks are slow), and translates the response back. Connectors are the only systems that see the detokenized PAN.
Settlement is a daily batch process. It aggregates captured authorizations per merchant, deducts gateway fees and interchange, and initiates a bank-transfer payout via an ACH or wire connector. A separate reconciliation job compares the gateway's internal ledger against the acquirer's daily settlement report — every mismatch surfaces as an alert for the operations team.
The 3D Secure flow is a step-up authentication. When the issuing bank requires 3DS (driven by regional regulation like PSD2 SCA or by issuer risk policy), the authorization response carries a redirect URL. The merchant redirects the cardholder's browser to the issuing bank's 3DS challenge page; the bank authenticates the user (SMS code, biometric prompt in the bank app); the bank posts back a 3DS authentication result; the gateway re-submits the authorization with the 3DS proof attached. This is a multi-step asynchronous flow that complicates the idempotency surface — see deep dive.
Deep dive — the hard problem
Two deep dives: routing cascade with deterministic retries, and the 3DS asynchronous authorization state machine.
Routing cascade — when the primary acquirer declines a transaction, the gateway can retry through a secondary acquirer (with merchant opt-in). Cascading lifts approval rates by 1-3% on average. But it introduces correctness hazards: every retry is a separate authorization request to a separate acquirer, each of which can succeed, decline, or time out. The gateway must guarantee that at most one acquirer holds an active authorization for the same logical transaction.
The mechanism is a transaction state machine with a single authoritative row per logical transaction (keyed by the merchant's idempotency key) and a child collection of attempt rows (one per acquirer attempted). The parent row's status is the rollup: SUCCESS if any attempt succeeded, DECLINED if all attempts declined, ERROR if any attempt is in an ambiguous state. Critically, before retrying through a secondary acquirer on a primary timeout, the gateway must first run a status-check against the primary to disambiguate the timeout — if the primary actually authorized successfully, retrying would create a duplicate authorization on the cardholder's account. The disambiguation call may take seconds; during that window the parent transaction is in a deferred state and the merchant sees 'processing' rather than an immediate answer.
The overspend math under cascading: if every timeout were treated as a decline and immediately retried, ~1% of timeouts would result in duplicate auths (timeouts that were actually successes on the network). At 10K TPS and 5% timeout rate, that's ~50 duplicate auths/sec — unacceptable. With status-check-before-retry, duplicates drop to near-zero, at the cost of slower failure responses on timeout.
3DS state machine — the challenge introduces a multi-step async flow. Step 1: gateway submits the authorization; acquirer responds 'requires 3DS challenge'; gateway returns a redirect URL to the merchant. Step 2: cardholder is redirected to the issuing bank's challenge page; cardholder completes the challenge (or abandons). Step 3: bank posts back to a callback URL on the gateway with the authentication result. Step 4: gateway re-submits the authorization with the 3DS proof; acquirer authorizes (or declines if the auth fails for other reasons). The gateway must hold transaction state across all four steps, with a timeout on step 3 (the cardholder might never complete the challenge — 10-20% of 3DS sessions are abandoned). The standard approach is a per-transaction TTL: if the 3DS callback doesn't arrive within ~15 minutes, the transaction transitions to ABANDONED and any held funds are released. The merchant gets a webhook on the terminal state.
Idempotency interacts with 3DS in a subtle way: if the merchant retries the original authorization with the same idempotency key during the 3DS window, the gateway must return the in-progress state ('requires 3DS challenge') with the same redirect URL — not initiate a new authorization. Production gateways treat 3DS-pending as a sticky state on the parent transaction row.
Third deep dive: PCI scope minimization. The vault is the only PCI-scoped service. All other services receive tokens, never PANs. The connector layer detokenizes at the last moment before serializing the wire message to the acquirer, and the detokenization result lives only in process memory for the duration of the request. Mention scope minimization explicitly — interviewers reward it because it's the controlling design constraint for the whole gateway.
Fourth: acquirer health monitoring. Each acquirer connector exports approval rate, latency, and error rate per minute. The routing engine reads these metrics on every routing decision and downweights acquirers that are failing. A flapping acquirer (uptime < 95%) is automatically excluded for a cooldown period. This is operational hygiene more than core design, but it's the difference between a gateway that survives an acquirer outage and one that takes the merchant down with it.
Common mistakes
- Putting raw card data anywhere outside the vault — breaks PCI scope and pollutes the audit boundary
- Treating timeouts as declines without a status-check call — produces duplicate authorizations under cascading
- Forgetting the 3DS sticky-state requirement — merchant retries during the challenge window create duplicate auths
- Designing routing as a static map rather than a dynamic decision — can't respond to acquirer outages or approval-rate drops
- Skipping reconciliation against the acquirer's daily settlement report — the only way to catch missed transactions
Likely follow-up questions
- How would you support multi-currency transactions where the cardholder's currency differs from the merchant's settlement currency?
- What changes if you need to support card-network tokenization (network tokens that survive PAN reissuance)?
- How would you implement smart routing that maximizes approval rate per BIN-country pair using historical data?
- How would you handle a primary acquirer's persistent connection going down mid-authorization?
- How would you support recurring billing with automatic token refresh when the underlying card is reissued?
Practice Design a Payment Gateway live with an AI interviewer
Free, no sign-up required. Get real-time feedback on your design.
Practice these liveFrequently asked questions
- How is a payment gateway different from a full payments platform?
- A payment gateway focuses on the routing layer between the merchant and the card networks: tokenization, BIN routing, 3DS step-up, acquirer-cascade retries. A full payments platform (the Design Online Payments scenario) covers the gateway plus a card vault, charge engine, webhook delivery, settlement, and the merchant-facing API surface. In practice the line is blurry — modern PSPs do both. For an interview, the gateway scenario is more focused on routing and the 3DS state machine.
- Do I need to know ISO 8583 in detail?
- No — naming 'card-network message format' as the integration boundary is enough. Specific MTI codes (0100, 0110) and field positions are bonus signal but not required unless this is a payments-platform-team interview.
- What's the most-asked deep dive?
- Acquirer-cascade with deterministic retries. Specifically: 'a primary-acquirer call times out; how do you decide whether to retry through the secondary?' If you don't mention the status-check-before-retry pattern, that's a partial-fail at senior.
- Should I discuss interchange fees and pricing?
- Mention interchange as the fee structure that motivates smart routing (different acquirers and card brands carry different interchange rates). Drilling into specific interchange tables is overkill unless this is a pricing-team interview.