Why does the scheduler need to be a singleton?

Because the scheduler decides task starts, and two schedulers can both decide to start the same task. For non-idempotent tasks this is a correctness bug. Leader election via a consensus cluster gives exactly-one-scheduler with sub-30-second failover.

Do I need to discuss Airflow or other named tools?

No. The interview is about the architecture pattern (DAG-based scheduling, leader election, executor pool, dependency resolution). Naming specific products is neither expected nor rewarded — the patterns are the signal.

How long is the Design Distributed Task Scheduler interview?

45-60 minutes. Expect deep drills on leader election (Why a singleton? How does failover work?), dependency resolution (How do you avoid O(N) scans?), and at least one of (backfill, SLA monitoring, dynamic DAGs).

What is the most important concept for Design Distributed Task Scheduler?

The single-scheduler-leader pattern with metadata-store as source of truth. The interviewer is looking for whether the candidate identifies that the scheduler must be a singleton for correctness, then proposes a coordinator-based leader election that survives the scheduler crashing.

System Design Questions

Design Distributed Task Scheduler — System Design Interview Guide

Design Distributed Task Scheduler is a system-design interview that asks you to build a DAG-based workflow scheduler at scale: users define directed acyclic graphs of tasks with dependencies, the scheduler runs each task when its parents complete, and the system handles retries, backfills, SLAs, and resource allocation across a fleet of executor nodes. The hard part is the leader/coordinator role — exactly one node must decide which task runs next, but the system must keep working when that node dies.

By Sam K., Founder, InterviewChamp.AI · Last verified 2026-05-25

Reported in interviews at

Amazon
Google
Meta
Netflix
LinkedIn

Sourced from Glassdoor, Levels.fyi, and Blind interview reports.

Functional requirements

Define a workflow as a DAG of tasks with parent-child dependencies and a cron-like schedule
Execute each task when all its parents have succeeded; handle parallel branches in the DAG
Retry failed tasks N times with configurable backoff; mark the task as failed after max retries
Backfill: trigger historical DAG runs for past dates that were missed or need re-execution
Per-task resource declaration (CPU, memory, GPU); the scheduler matches tasks to executors with capacity
Per-task SLA: alert if a task hasn't completed within configured time from its scheduled start

Non-functional requirements

Scale: 100K active DAGs, 10M task instances per day across all DAGs
Scheduling latency: <30 sec from 'all parents succeeded' to 'task dispatched to executor'
Availability: 99.9%; scheduler downtime delays runs but never loses task state
Durability: a task that has started executing must never be lost — either it finishes or it's marked failed and retried; never silent
Correctness: no task runs before its parents succeed; no task runs twice (idempotent retry semantics)
Multi-tenancy: isolate users so one user's DAG storm doesn't starve another user's schedule

Capacity estimation

Anchor: a large platform runs ~100K DAGs with ~100 tasks each on average → ~10M task definitions, ~10M task instances per day at typical schedules.

At 10M task instances/day across 86400 sec, that's ~120 task-starts/sec average, ~500/sec at the daily peak (midnight UTC is a common scheduler peak because many cron DAGs are scheduled at hour boundaries).

Task metadata per instance: dag_id, task_id, run_id, start_time, end_time, status, retry_count, parent_run_ids, resource_request, executor_id. ~500 bytes per row. 10M rows/day × 500 bytes = 5 GB/day of new metadata. 30-day retention = 150 GB total — fits in a single sharded relational store.

DAG definition storage: a typical DAG definition is a few KB of code/config. 100K DAGs × 5 KB = 500 MB — trivial. Stored versioned so historical runs can be reproduced with the DAG version they were authored under.

Executor pool: typical platform runs ~5K executor nodes, each handling ~20 concurrent tasks. Aggregate capacity ~100K concurrent tasks. At 500 starts/sec and average task duration ~5 min, expected in-flight tasks = 500 × 300 = 150K — slightly over capacity, so the queue grows briefly at peak.

State updates: each task transitions through ~5-10 status updates (queued → scheduled → running → succeeded). At 10M tasks/day × 8 updates avg = 80M state updates/day = ~1K updates/sec. Modest write rate.

Log and artifact storage: each task emits stdout/stderr (~10 KB-1 MB per task) and may produce artifacts (model files, reports). 10M tasks × 100 KB average = 1 TB/day. Stored in cheap object storage with task metadata pointing at the URL.

High-level design

Four components: a metadata store (DAG definitions, task state, run history), a scheduler (decides what runs next), an executor pool (runs the tasks), and a coordinator (resolves the single-scheduler-leader role).

Metadata store: a sharded relational store holds DAG definitions, task runs, and dependency edges. Sharded by dag_id so all runs for one DAG live on one shard; cross-DAG queries are rare. Indexes on (dag_id, scheduled_time) and (status, scheduled_time) for the scheduler's hot queries.

Scheduler: a single leader process that runs in a loop. Each iteration: query the metadata store for tasks where (status = 'pending' AND all parents have status = 'succeeded'). For each such task, transition status to 'scheduled' and emit a dispatch message to the executor pool. Schedule is per-DAG-run — when a DAG-run starts at its cron time, all root tasks of the DAG become pending; as they succeed, their children become eligible.

The single-leader constraint matters: if two schedulers both saw a task as ready and both dispatched, the task would run twice. The coordinator (a small consensus cluster running Raft-style election) elects one scheduler as leader; followers stand by. On leader failure, a follower wins the next election and takes over the same metadata store. Worst-case scheduling pause ~10-30 sec during election.

Executor pool: stateless workers that pull dispatch messages from a queue. On receiving a dispatch, the executor runs the task in a container or process, captures stdout/stderr, and reports status updates back to the metadata store. Heartbeats every 30 sec; if an executor stops heartbeating, the scheduler marks its in-flight tasks as failed and retries them (with retry_count incremented).

Resource matching: tasks declare CPU/memory/GPU requirements. The dispatch queue is keyed by resource class — separate queues for high-CPU, high-memory, GPU. Executors register their capacity and only pull from queues they can satisfy. Bin-packing happens at the dispatch step; the scheduler doesn't reserve resources globally because resource availability changes second to second.

Backfill: a separate API takes a date range and a DAG and creates historical runs for each scheduled time in the range. These historical runs flow through the same scheduler and executor pool but with lower priority (separate dispatch queue) so they don't starve live schedules.

The DAG definition itself is versioned. When a user pushes a new DAG version, the next scheduled run uses the new version; in-flight runs continue with the version they started on. Backfills can choose which version to use.

Deep dive — the hard problem

Three deep dives: the scheduler leader election, the dependency-resolution algorithm, and SLA monitoring.

Scheduler leader election — why exactly one scheduler.

The scheduler is the authority on 'which task starts next.' If two schedulers both decide to dispatch task T, both dispatches reach the executor pool, both executors run T. For idempotent tasks this is recoverable (the second run sees results from the first and exits cleanly); for non-idempotent tasks (writes to a database, sending emails, billing operations) it's a bug.

The coordinator (a consensus cluster — 3 or 5 nodes running Raft) holds a leader lease. The scheduler that holds the lease runs the dispatch loop; others wait. Lease is renewed every few seconds; if the leader dies, the lease expires and a follower acquires it within ~10-30 sec.

Key property: when the leader transitions, the new leader reads metadata-store state to determine what to dispatch next. The metadata store, not the leader's memory, is the source of truth — so leader failover doesn't lose work in progress.

Fencing: when a new leader takes over, it bumps an epoch counter. Old leader's dispatch messages tagged with the previous epoch are rejected by executors. This prevents a split-brain where a network-partitioned old leader thinks it's still leader and double-dispatches.

Dependency resolution — naively, the scheduler queries 'tasks where all parents have succeeded' every iteration. At 10M tasks the query is expensive.

Production pattern: maintain a 'ready set' of tasks whose parents all completed. Updated incrementally — when a task succeeds, look up its children and check if all their parents are now done; if so, add to the ready set. The scheduler's dispatch loop just pulls from the ready set.

The incremental update is event-driven: each task status change triggers a downstream-check. Reduces dependency resolution to O(num_children) per task completion instead of O(num_tasks) per scheduler tick.

For very large DAGs (10K+ tasks in one DAG), the ready-set update is still bounded by the number of children of the completed task — typically small. The DAG structure makes this efficient.

SLA monitoring — each task can declare an SLA: 'should complete within 1 hour of scheduled start.' A separate SLA monitor process scans for tasks where now > scheduled_start + sla_duration and status is not in (succeeded, failed). For each, emit an alert.

Key design choice: the SLA monitor is its own process, not bundled in the scheduler. Scheduler is on the hot path of dispatching; SLA monitoring is best-effort alerting. Decoupling them means SLA alerts don't slow scheduling and scheduler crashes don't suppress alerts.

Fourth tradeoff: dynamic DAGs. Some workflows define their child tasks at runtime (e.g., 'spawn one child task per file in the input directory; the file count is only known when the parent task runs'). Support requires the task to emit a 'expand into N children' message back to the scheduler, and the scheduler to insert those children into the DAG at runtime. Adds significant complexity — most platforms support a limited form (mapped tasks with a known list of children at DAG-parse time, vs fully dynamic).

Fifth tradeoff: priority and starvation. A long-tail of low-priority backfills can starve a small high-priority workload if the dispatcher is FIFO. Production schedulers use weighted-fair-queueing across DAGs or priority classes — each user/team gets a guaranteed slice of executor capacity. Mention this for multi-tenant scale.

Common mistakes

Running multiple schedulers concurrently without leader election — double-dispatches non-idempotent tasks
Polling the metadata store on every tick for 'tasks where all parents succeeded' — O(N) scan that doesn't scale; need incremental ready-set updates
Bundling SLA monitoring into the scheduler — slows the hot path and tightly couples scheduling to alerting
Treating the executor pool as a single homogeneous queue — heterogeneous resource demands (CPU vs GPU vs memory) need per-class queues
Ignoring backfill priority — without separate queues, a 90-day backfill starves the live schedule

Likely follow-up questions

How would you support sub-DAGs where a task at runtime expands into a child DAG that the scheduler must trace through?
What changes if the total number of DAGs grows to 1M with 10K tasks each (10B task instances/day)?
How would you implement 'DAG-level concurrency limits' (e.g. at most 2 runs of this DAG executing at once)?
How would you support a 'pause this DAG' operation that lets in-flight tasks finish but holds new scheduled runs?
How would your design handle a task that produces external side effects and must not be retried without manual approval?
How would you implement cross-DAG dependencies where a task in DAG A waits for a task in DAG B to succeed?

Related system design scenarios

Frequently asked questions

Why does the scheduler need to be a singleton?: Because the scheduler decides task starts, and two schedulers can both decide to start the same task. For non-idempotent tasks this is a correctness bug. Leader election via a consensus cluster gives exactly-one-scheduler with sub-30-second failover.
Do I need to discuss Airflow or other named tools?: No. The interview is about the architecture pattern (DAG-based scheduling, leader election, executor pool, dependency resolution). Naming specific products is neither expected nor rewarded — the patterns are the signal.
How long is the Design Distributed Task Scheduler interview?: 45-60 minutes. Expect deep drills on leader election (Why a singleton? How does failover work?), dependency resolution (How do you avoid O(N) scans?), and at least one of (backfill, SLA monitoring, dynamic DAGs).
What is the most important concept for Design Distributed Task Scheduler?: The single-scheduler-leader pattern with metadata-store as source of truth. The interviewer is looking for whether the candidate identifies that the scheduler must be a singleton for correctness, then proposes a coordinator-based leader election that survives the scheduler crashing.