Skip to main content

Data Engineer Interview Questions for 2026: 40+ Questions Across SQL, Spark, Airflow, dbt + the New-Grad Pivot Guide

Data engineer roles still hire entry-level talent in 2026 while SWE pipelines have tightened. If you're a CS new grad considering the pivot, the DE interview tests SQL depth, ETL design, and distributed-systems intuition, with less LeetCode pressure than a pure SWE loop. This guide covers the 40+ questions you'll see, the tool-by-tool deep dives, and the new-grad pivot strategy.

By Alex Chen, Founder, InterviewChamp.AI · Last updated

34 min read

What does a data engineer interview test for in 2026?

A data engineer interview in 2026 tests four things: SQL fluency at the window-function-and-CTE level, ETL pipeline design with awareness of idempotency and backfills, distributed-systems intuition (partitioning, eventual consistency, the read-vs-write tradeoff), and at least one tool-specific deep dive on Spark, Airflow, dbt, or a cloud warehouse like Snowflake or BigQuery.

The LeetCode bar is meaningfully lower than a pure SWE loop. Most data engineering interviews substitute one or two algorithm rounds for a SQL whiteboard round and a system-design round focused on data flow rather than service architecture. That's both the opportunity and the trap for CS new grads pivoting in: less algorithm grinding, but a much larger surface area of tools and vocabulary to master.

The format varies by company size. A 4-5 round loop at large employers, sometimes a single onsite at mid-market, often a take-home plus one technical round at early-stage startups. The four axes above hold across nearly every tier. Master those, walk into any data engineering loop, hold your own.

Should CS new grads pivot from SWE to data engineering?

For most CS new grads who have been applying since spring 2025 with no offers, the honest answer in 2026 is: yes, it's worth a serious look. The data engineering pipeline is meaningfully less saturated than the pure SWE pipeline at the entry level. Employers reporting on r/cscareerquestions through Q1 2026 consistently note fewer applicants per opening at the entry-level data engineer role compared to the same companies' SWE postings. Class-of-2025 graduates pivoting into data engineering in late 2025 reported faster-than-average time-to-offer compared to the same cohort's SWE search.

But "less saturated" doesn't mean "easier." It means "different." Three honest reads before committing to the pivot:

Read 1: Data engineering isn't easier than SWE. It's different. The LeetCode bar is lower. The SQL bar is much higher. The system-design bar is comparable but oriented around data flow rather than service architecture. The vocabulary surface is wider. You need to know what a partition key is, what idempotency means, why a shuffle is expensive, what a backfill looks like in production. CS new grads who pivot expecting an easier interview burn out in week 2 when they realize they need to study a new stack from scratch.

Read 2: Your CS coursework helps more than you think. If you took a database course, the SQL is partly there. You'll need to refresh window functions and query optimization, but the foundations are intact. If you took a distributed-systems or networking course, the system-thinking transfers directly to data engineering's distributed-systems vocabulary (CAP theorem, eventual consistency, partitioning). If you wrote one Python backend in coursework, you have enough Python for the data engineering bar. The pivot is shorter than it looks for a CS grad, typically 30-45 days of focused prep, compared to a non-CS career changer.

Read 3: The pivot is honest only if you actually study the material. What's not honest: rewriting a SWE-only resume to claim production data engineering experience you don't have. Recruiters spot this within one phone screen. What is honest: positioning yourself as a CS grad with strong SQL from coursework and a Python background who has been studying data engineering and built two projects in Spark and Airflow to learn the stack. The pivot story works when the projects exist. It doesn't work when the resume claims experience the candidate can't talk about.

If those three reads still land favorably, the pivot is worth 30 days of focused prep. The rest of this guide walks the questions you'll face, tool by tool, plus a week-by-week study plan.

The 40+ data engineer interview questions you should rehearse

The rotation at the entry level is wider than a SWE loop because the tool surface is wider. The questions below cover what shows up at most companies in the 2026 cycle, organized by axis. Each question carries a brief sample-answer outline. Not a full answer, but the structure of a passing answer at the new-grad bar.

SQL interview questions for data engineers (10 questions)

SQL is the area where prep returns the most. It shows up in nearly every loop: the phone screen, the onsite whiteboard, and often the system-design follow-up. Master these ten.

1. Find the second-highest salary in a table without using LIMIT or TOP. Sample answer: Use a window function. SELECT salary FROM (SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rank FROM employees) ranked WHERE rank = 2. The interviewer is checking that you know window functions and that you handle ties correctly. DENSE_RANK over ROW_NUMBER so two employees tied at the top both rank 1 and the next rank is 2.

2. For each department, find the employee with the highest salary. Sample answer: Use RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) in a CTE, then filter to rank = 1. The trap is reaching for a GROUP BY department_id, MAX(salary) and then trying to join back. That works but reads as a junior pattern compared to the partition-by approach.

3. Calculate the running total of revenue by day. Sample answer: SUM(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW). The follow-up: same calculation but grouped by customer. Add PARTITION BY customer_id to the window. Knowing the difference between ROWS and RANGE is a senior signal.

4. Find customers who placed orders in three consecutive months. Sample answer: Multi-CTE with LAG() or self-joins. Compute the previous month's customer set, the month before that, and the current month, then intersect. The window-function version reads cleaner: LAG(month, 1) OVER (PARTITION BY customer_id) and LAG(month, 2) OVER (PARTITION BY customer_id), then filter where the difference is exactly one month.

5. Read this EXPLAIN plan and name two ways to speed it up. Sample answer: Look for full table scans, missing indexes, expensive joins (especially nested-loop joins on large tables), and unnecessary sorts. Common speedups: add an index on the filter column, rewrite a correlated subquery as a join, partition the table by the most common filter column. The interviewer is checking that you've actually read execution plans before, not just written SQL.

6. Write a query to find duplicates in a table. Sample answer: SELECT col1, col2, COUNT(*) FROM table GROUP BY col1, col2 HAVING COUNT(*) > 1. The follow-up: now delete the duplicates while keeping one row. Use a CTE with ROW_NUMBER() OVER (PARTITION BY duplicate_columns ORDER BY id) and delete rows where the row number is greater than 1.

7. Pivot a long-format table into a wide format. Sample answer: Use CASE WHEN expressions inside aggregate functions, one per pivoted column. Modern databases also support PIVOT syntax, but the CASE pattern is universal. Be ready to discuss when pivoting belongs in SQL vs in the BI layer.

8. Find the median value of a column. Sample answer: Use PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column) if the database supports it. If not, use a window function to rank rows and pick the middle one (or average of the two middle ones for even counts). The median question screens for awareness that SQL doesn't have a built-in MEDIAN aggregate in most databases.

9. Optimize this slow query: a join between a large fact table and a large dimension table. Sample answer: Three common speedups: (a) ensure both join keys are indexed, (b) reduce the dimension table to only the columns and rows actually needed before joining, (c) consider a broadcast hint if your database supports it and the dimension is small enough. At the warehouse level, partitioning the fact table by the join key dramatically helps.

10. What's the difference between WHERE and HAVING? Sample answer: WHERE filters rows before grouping. HAVING filters groups after grouping. You can use aggregates in HAVING (e.g., HAVING COUNT(*) > 10) but not in WHERE. Modern query planners often optimize across both, but knowing the semantic difference is a basic competency check.

ETL design interview questions (8 questions)

ETL design questions probe whether you can think about pipelines as systems rather than scripts. Idempotency, backfills, schema evolution, and dead-letter handling come up in nearly every loop.

1. Design a daily pipeline that ingests yesterday's clickstream and loads it into the warehouse. Sample answer: A scheduled job (Airflow daily DAG) that pulls files dropped into object storage at midnight, parses and validates them, lands raw rows in a staging table, then transforms and merges into the production table. State the schedule, state the idempotency key (date + source-file hash), state what happens on failure (retry with exponential backoff, then dead-letter). One stated tradeoff: batch vs micro-batch. Batch is simpler at this scale. Micro-batch only if downstream consumers need fresh data.

2. The pipeline failed for three days. How do you backfill? Sample answer: Identify the missed date range, ensure the pipeline is idempotent (running it twice produces the same result), then trigger backfills one date at a time or in parallel depending on downstream constraints. State two risks: (a) schema may have evolved between then and now, so the historical data needs to be transformed under the old schema then forward-migrated, (b) resource cost. Running three days in parallel may overwhelm the warehouse.

3. Why does your ETL task need to be idempotent? Sample answer: Because failures happen, and failure recovery often means re-running the same task. If the task isn't idempotent, retrying produces duplicate rows or corrupt state. The standard pattern: include a deterministic key (date, source file, batch ID) and use upsert (MERGE in most warehouses) rather than insert.

4. How do you handle schema evolution upstream? Sample answer: Three layers of defense. (a) Schema validation at ingestion: reject or quarantine rows that don't match the expected schema. (b) Schema registry or contract: upstream producers register their schema, downstream consumers validate against the contract. (c) Schema-on-read at the warehouse: land the raw JSON or Parquet in the staging table, transform with an awareness that fields may be missing.

5. Design a pipeline that handles late-arriving data. Sample answer: Two strategies. (a) Event time + watermark: the pipeline keeps a window open for late-arriving events for some bounded period, then closes the window. (b) Re-run pattern: on a schedule (say, hourly), re-process the last N hours of data to absorb anything that arrived late. Most production pipelines use option (b) for simplicity. State the tradeoff: option (a) is precise but operationally complex. Option (b) is approximate but simple.

6. How do you handle a dead-letter queue? Sample answer: Bad messages (parse failures, schema mismatches, downstream rejections) get written to a dead-letter destination: a separate table, a separate file in object storage, or a separate Kafka topic. The dead-letter is monitored, and a separate process (often a daily alert) reviews and either fixes the upstream issue, manually replays the messages, or accepts the loss.

7. Design an incremental load instead of a full refresh. Sample answer: Track a high-water mark, the maximum value of updated_at (or last_modified_id) from the last successful run. On the next run, pull only rows with updated_at > previous_max. State two failure modes: (a) deleted rows aren't captured by updated_at so a separate delete-detection mechanism is needed, (b) clock skew between source and warehouse can cause missed rows. Use a small buffer (subtract a few minutes from the watermark).

8. What's the difference between a sensor and a regular task in Airflow? Sample answer: A sensor waits for an external condition to be met before allowing the downstream tasks to run. File landing in object storage, partition appearing in a warehouse, a value in a database row. Regular tasks just run on schedule. The trap is using a "poke" sensor that checks every 30 seconds and ties up a worker. Use "reschedule" mode for long-running sensors so the worker is freed between checks.

Distributed systems interview questions for data engineers (6 questions)

The data engineer system-design bar isn't about quorum protocols. It's about whether you understand the data-flow tradeoffs that come from running pipelines across many machines.

1. Explain CAP theorem in the context of a data warehouse. Sample answer: CAP says you can pick at most two of consistency, availability, and partition tolerance. In a distributed warehouse, partition tolerance is non-negotiable (the network can always fail), so the real choice is between consistency (every query sees the latest data) and availability (queries succeed even during partitions). Most cloud warehouses lean toward consistency for finalized data and toward availability for in-flight ingestion.

2. What's eventual consistency and when does it matter? Sample answer: Eventual consistency means writes propagate across replicas over time, so a read shortly after a write may return stale data. It matters in pipelines when downstream systems read from a replica that hasn't caught up. A freshly-written row may not be visible to the next task. Solutions: read-after-write consistency on the same connection, explicit waits, or designing the pipeline to be tolerant of small read-staleness windows.

3. Why is a shuffle expensive in Spark? Sample answer: A shuffle moves data across the network between partitions to satisfy operations like groupBy, join, or repartition. Network is the slowest layer in a distributed system. Two mitigations: (a) broadcast small tables to avoid shuffling them across joins, (b) pre-partition data by the join key so the shuffle is avoided entirely. State that wide transformations cause shuffles. Narrow transformations don't.

4. How would you partition a 100TB fact table? Sample answer: Partition by date (the most common access pattern is "last N days of data") at the day or month level. Sub-partition by a high-cardinality column if necessary. State the tradeoff: too-fine partitioning creates many small files (slow to scan), too-coarse partitioning misses pruning benefits. Most warehouses use 100MB-1GB per partition as a sweet spot.

5. What's a skewed key and how do you handle it? Sample answer: A skewed key is one value that appears far more often than others in a partitioned dataset. For example, 90% of events tagged with user_id = 'null' after upstream bug. In a join or groupBy, the partition holding the skewed key gets all the work. The others finish in seconds while the skewed one runs for hours. Two fixes: (a) salt the skewed key (append a random suffix, aggregate, then un-salt), (b) handle the skewed key in a separate code path.

6. Why is idempotency a distributed-systems concept, not just an ETL concept? Sample answer: Because in a distributed system, the same message can be delivered more than once. Network partitions, retry logic, at-least-once delivery semantics. Any consumer of a distributed message stream must be idempotent or risk duplicates. The data-engineering version of this concept is most visible in pipeline retries, but the root concept is from distributed systems.

Tool-specific: Spark interview questions (5 questions)

Spark is the highest-cost tool to learn from scratch but also the most-asked at large employers. Five questions cover most of the entry-level bar.

1. What's the difference between a narrow and a wide transformation? Sample answer: Narrow transformations (filter, map, flatMap, union) operate on a single partition. No data moves across the network. Wide transformations (groupBy, join, distinct, repartition) move data across partitions to reorganize by key. These trigger a shuffle. Knowing which transformations cause shuffles is the basic Spark performance vocabulary.

2. Explain lazy evaluation in Spark. Sample answer: Spark builds a DAG of transformations but doesn't execute anything until an action (collect, count, write, show) is called. This lets the optimizer reorder and combine transformations. For example, pushing a filter down before a join. The trap for new grads: writing df.filter(...) in a notebook and being surprised that nothing runs until they call show.

3. When would you use a broadcast join? Sample answer: When one side of the join is small enough to fit in memory on every executor (default threshold ~10MB but configurable). Broadcasting avoids the shuffle of the large table. Be ready to discuss the tradeoff: broadcasting uses more memory per executor but eliminates network I/O.

4. What's the difference between RDDs, DataFrames, and Datasets? Sample answer: RDDs are the low-level abstraction. Typed, unstructured, no optimizer. DataFrames are higher-level. Schema-aware, optimized by Catalyst (Spark's query planner). Datasets are typed DataFrames (Scala/Java only). In 2026, almost everyone uses DataFrames. RDD direct usage is rare outside legacy code or low-level performance tuning.

5. Your Spark job is slow. How do you debug it? Sample answer: Three steps. (a) Read the Spark UI: find the stage taking the most time, look for skewed task durations or large shuffle volumes. (b) Check partition counts: too few partitions under-use the cluster, too many create overhead. (c) Look for accidental wide transformations or collect() calls that pull data to the driver. State that "tuning executor memory" is the L4 senior answer. The new-grad answer is "I'd look at the UI and reduce shuffle volume first."

Tool-specific: Airflow interview questions (5 questions)

Airflow shows up in nearly every entry-level data engineer role posted in 2026. Five questions, sample answers below.

1. What's a DAG and what does it represent? Sample answer: A Directed Acyclic Graph. A workflow of tasks with dependencies, where each task runs once per scheduled interval. The "directed" part means each edge has a direction (task A runs before task B). The "acyclic" part means no cycles (you can't have A waiting on B that waits on A). In Airflow, a DAG is defined as a Python file in the dags folder.

2. Explain the difference between start_date, execution_date, and schedule_interval. Sample answer: start_date is when the DAG first becomes active. schedule_interval is the cadence: daily, hourly, cron expression. execution_date (now called logical_date in newer Airflow) is the start of the interval the run represents, not the wall-clock time the run actually started. The most common new-grad trap: assuming execution_date is "now" when in fact a daily DAG running at 1am on 2026-05-27 has execution_date = 2026-05-26 (the interval that just ended).

3. What's catchup and when should you turn it off? Sample answer: Catchup is Airflow's behavior of running every missed interval since start_date. Default is on. Turn it off (catchup=False) when you don't want historical runs to fire. For example, a pipeline that always processes "latest" data and shouldn't run hundreds of historical jobs the first time it's deployed.

4. How do you pass data between tasks in Airflow? Sample answer: Two options. XComs (cross-communications): small key-value pairs stored in Airflow's metadata database. Use for small values like dates, counts, or status flags. For large data, don't use XComs. Pass references (file paths, table names) and let downstream tasks read the data from object storage or the warehouse.

5. Design an Airflow DAG that ingests data from a REST API daily. Sample answer: One task fetches from the API for the relevant date (with the API's date parameter set to execution_date). One task validates the response. One task lands the data in object storage. One task triggers a warehouse load. State idempotency: re-running the same execution_date re-fetches and overwrites the same partition.

Tool-specific: dbt interview questions (4 questions)

dbt has become the dominant transformation tool in the modern data stack. Entry-level dbt questions stay close to the surface.

1. What's the difference between a model, a seed, and a snapshot in dbt? Sample answer: A model is a SQL file that defines a transformation. Runs against the warehouse to produce a table or view. A seed is a CSV file loaded as a static table, typically reference data like country codes. A snapshot captures slowly-changing dimensions over time. Useful for tracking historical changes to a row.

2. What's the difference between an incremental model and a table materialization? Sample answer: A table materialization rebuilds the entire table on every run. Simple but expensive at scale. An incremental model only processes new or changed rows since the last run, then merges into the existing table. Faster but requires correct unique_key configuration and is_incremental() logic in the model SQL.

3. How do you test data quality in dbt? Sample answer: Schema tests (uniqueness, not-null, accepted values, referential integrity) declared in YAML files alongside model definitions. Custom data tests for business-logic checks (e.g., "yesterday's revenue should be within 50% of the rolling average"). Tests run as part of dbt test and can fail the pipeline.

4. What does dbt run actually do? Sample answer: Reads all model files, builds a dependency DAG based on ref() calls between models, compiles each model's SQL with the right schema and database prefixes, then executes them in dependency order. State that dbt run only handles transformation. It doesn't ingest raw data. That's the job of an upstream ingestion tool (Fivetran, Airbyte, custom Python).

Tool-specific: Snowflake / Databricks / BigQuery interview questions (4 questions)

Cloud warehouse questions are mostly transferable across vendors at the entry level. The vocabulary differs but the concepts overlap.

1. What's a micro-partition (Snowflake) or a slot (BigQuery)? Sample answer: Both are units of storage and compute granularity. Snowflake's micro-partitions are immutable, columnar files (~50-500MB) automatically partitioned by ingestion order. BigQuery's slots are units of compute that process query work in parallel. Knowing the vocabulary of the target vendor is a quick credibility check.

2. How do you control cost in a cloud warehouse? Sample answer: Three levers. (a) Scope queries to the columns and partitions actually needed: full table scans on a 100TB table cost orders of magnitude more than a date-filtered query. (b) Right-size the compute: most vendors charge by warehouse size or slot count per second. (c) Cache intelligently: most warehouses cache recent query results. Structure dashboards to hit the cache where possible.

3. What's the difference between a materialized view and a regular view? Sample answer: A regular view is a saved query. Every read re-runs the underlying query against the source tables. A materialized view stores the result of the query as a physical table that's refreshed periodically (or incrementally). Materialized views trade storage cost and refresh latency for read performance.

4. Explain Snowflake's time travel. Sample answer: Snowflake retains historical versions of tables for a configurable retention window (default 1 day, up to 90 days on Enterprise). Time travel lets you query a table as of a past timestamp, restore a dropped table, or recover from accidental updates. Useful for debugging pipeline errors and for audit trails. Other warehouses have similar features under different names. BigQuery has table snapshots and time travel up to 7 days.

How to prepare for a data engineer interview (5-7 steps)

The 30-day plan in the schema attached to this article is built for a CS new grad with strong fundamentals (SQL from a database course, comfortable Python from coursework) and no production data engineering experience. Five steps, week by week:

Step 1, Week 1: SQL deep dive. 4 hours/day on window functions, CTEs, query optimization, and indexing. Use a free practice platform with real datasets. Write queries, run them, read execution plans. End the week able to handle the 10 SQL questions above without hesitation.

Step 2, Week 2: ETL design and Airflow. Read one short overview of Airflow scheduling. Install Airflow locally, build one toy DAG that pulls from a public API and lands data in SQLite. Study the 8 ETL design questions above.

Step 3, Week 3: Spark and distributed-systems fundamentals. Work through the Spark transformation model. Build one toy Spark job that aggregates a moderate-size CSV. Layer in distributed-systems vocabulary.

Step 4, Week 4: Tool-specific and behavioral. Pick the tool the target role lists most prominently, usually dbt or a specific cloud warehouse. Build a small portfolio project (3-4 dbt models, or a small notebook against a free BigQuery account). Spend the last two days on behavioral prep using the STAR framework.

Step 5: Run timed mock interviews in weeks 3 and 4. Three mocks minimum. One SQL whiteboard, one system design, one behavioral. AI mock interview tools are particularly effective for SQL and system design at the entry level. They catch vocabulary slips and freeze-points in real time. Add one human peer mock for pressure inoculation. See the mock interview practice guide for CS new grads for the full mock methodology. The four-mode approach (solo / peer / paid / AI) applies as much to DE prep as to SWE prep.

The plan compresses to 20 days for someone with prior data engineering exposure and stretches to 45 days for someone weaker on SQL. Don't shorten the SQL week. SQL fluency is the single most-tested skill for the entire interview loop.

Data engineer interview format by company tier

The interview format varies meaningfully by company size and stage. The table below maps what to expect at each tier so you can calibrate prep effort.

TierTypical loopOnsite durationLeetCode barTool depthCommon topics
FAANG / very-large techRecruiter screen, technical phone screen, 4-5 onsite rounds4-5 hoursOne algorithm round (low-medium difficulty)One deep dive on Spark or proprietary toolingURL shortener-style data pipeline, partitioning at scale, A/B test data architecture
Big Tech (Uber, Airbnb, DoorDash, etc.)Same as FAANG, sometimes slightly shorter4 hoursOne algorithm roundOne Spark or dbt deep diveReal-time pipelines, payments, ride-matching data flows
Mid-stage data team (Series C-D, named data-heavy startups)Recruiter, phone screen, take-home or onsite3-4 hoursRare or skipped entirelyHeavy on the actual stack: dbt, Snowflake, the company's own pipelinesProduct-specific pipelines, the team's actual problems
Series A-B startupRecruiter, technical phone screen, 1-2 onsite rounds2-3 hoursAlmost neverBroad, expected to be a generalistBuild the entire pipeline; how would you handle our exact use case
Enterprise / consultingProcess-heavy, sometimes multi-stage2-4 hoursNoneVendor-specific (often Informatica, Talend, or specific cloud vendor)Integration design, data governance, compliance
Government / contractor1-3 interviews, often video-based1-2 hoursNoneCloud-specific (often AWS or Azure depending on contract)Existing-system understanding, security clearance discussions

Two reads from the table:

First, the LeetCode bar is dramatically lower than a SWE loop. Below the FAANG/Big Tech tier, algorithm rounds are rare or skipped entirely. Don't over-grind LeetCode for a data engineering pivot. The marginal return below 100 problems is high. Above 200 problems it's vanishingly small unless you're explicitly targeting FAANG.

Second, the format below FAANG is more varied and more friendly to new grads. Mid-stage and startup data teams often replace the formal system-design round with a "tell me about a data project you built" conversation. If your pivot strategy is to land at a mid-market or startup data team first and grow from there, your prep target is the SQL + dbt + Airflow stack, not the FAANG-bar Spark deep dive.

If I'm being honest about where I'd put my 30 days: SQL week (skip the cleverness, drill the basics), one solid dbt portfolio repo on GitHub, one Airflow DAG that actually runs and handles a backfill. The Spark week pays off if you're chasing FAANG; otherwise it's optional and you can cover the basics in 2 days. Most mid-market hiring managers won't care if you've never run a Spark job.

What's the difference between data engineer / data scientist / data analyst interviews?

CS new grads exploring the data career space often conflate these three. The interview formats (and the day-to-day work) differ meaningfully. Calibrate which role you're targeting before committing to a prep plan.

AxisData EngineerData ScientistData Analyst
Primary skillPipeline design + toolingStatistics + MLSQL + business reasoning
SQL barHigh (window functions, optimization)Medium-high (analytical SQL)High (analytical SQL, dashboards)
Python barMedium (data-processing patterns)High (numpy, pandas, scikit-learn, sometimes PyTorch)Low-medium (pandas, basic ETL scripts)
Math / stats barLow (basic probability)High (statistics, regression, A/B tests)Medium (interpret stats, A/B test results)
ML barAlmost noneCore competencyAlmost none
LeetCode barLow (one round at FAANG; rare elsewhere)Low (one round at FAANG)None
ToolingSpark, Airflow, dbt, warehousesJupyter, scikit-learn, ML librariesTableau / Looker, SQL, sometimes dbt
System-design roundYes, data flow orientedSometimes, ML system designRare
Day-to-dayBuild and maintain pipelinesBuild models, run experimentsBuild dashboards, answer business questions
Entry-level salary range (US, 2026)$85K-$130K$95K-$140K$70K-$105K

The honest pivot recommendation for a CS new grad: data engineer if you enjoy systems thinking and tooling; data scientist if you took stats and ML courses and enjoyed them; data analyst if you want fast entry-level access at the cost of slower long-term growth. Most CS grads pivot best into data engineering. The system-thinking transfers most directly from CS coursework.

How to switch from software engineer to data engineer in 2026

The pivot story works when three things line up: a resume that signals data engineering interest, a portfolio that backs the signal, and a prep plan that closes the technical gap. Here's how each piece works for the specific case of an 11-months-applying CS new grad with one real internship.

Resume positioning. Don't rewrite SWE internships to claim data engineering experience you didn't have. Do reframe the language to highlight the data-engineering-adjacent work that was there. Original line: "Built backend services in Python and Node.js." Reframed line: "Built Python data-processing services for [X] API integration, handling [Y]M records per day, with error handling and idempotent retries." Both lines describe the same work. The second one signals data engineering vocabulary to the resume screener.

Add a data engineering project section. Two projects, each documented at a real-code level on GitHub:

  • A dbt + free-tier Snowflake project: 3-4 models, one schema test, deployed against a real warehouse. Document the data flow in the README. 6-10 hours of work.
  • An Airflow + Python project: one DAG that pulls from a public API daily, transforms in Python, lands in PostgreSQL or SQLite, with error handling and idempotent retries. Document the architecture in the README. 8-12 hours of work.

The portfolio is the credibility anchor. Recruiters who are skeptical of the pivot become believers when they see a real dbt repo and a real Airflow DAG.

Drop SWE-specific lines. If your resume currently lists 15 LeetCode-flavored projects (e.g., "Implemented Dijkstra's algorithm"), cut them. They don't signal data engineering and they crowd out the data engineering signal. Keep one or two algorithm-heavy projects if they have real data behind them (e.g., "Implemented A/B test result analyzer using statistical significance testing").

Update the LinkedIn headline. Current: "Software Engineer | Python, Java, JavaScript." Updated: "Data Engineer | Python, SQL, dbt, Airflow, Snowflake | CS new grad." The headline is what recruiters search for. If you don't have "data engineer" in there, you don't show up in their searches.

The hard delta: what's actually different. The technical gap from CS-new-grad-SWE to entry-level data engineer is real but small. Three weeks of focused SQL practice, one week of Airflow + dbt, one week of Spark fundamentals, and one good portfolio project closes most of it. The hard delta isn't technical. It's confidence. CS new grads who have been rejected from SWE roles for 11 months often arrive at data engineering interviews undersold and self-deprecating. Don't. The pivot is honest. The skills transfer. The hiring market is friendlier on this side. Walk in like you belong.

The honest answer on detection / verification. Some pivoters worry that recruiters will see the SWE-to-DE pivot as a red flag. They don't. Recruiters see cross-discipline pivots all the time and respect them when the portfolio backs the claim. The only way the pivot reads as a red flag is if the candidate can't talk about the dbt or Airflow projects on their resume. Build the projects. Be able to talk about them. The pivot works.

Data engineer interview prep: 30-day study plan for new grads

For the specific case of a CS new grad with strong SQL fundamentals and zero production data engineering experience, here's a day-by-day schedule that produces an interview-ready candidate in 30 days.

Week 1 (days 1-7): SQL deep dive

  • Day 1: Window functions overview: RANK, DENSE_RANK, ROW_NUMBER, LAG, LEAD. Write 10 queries using each.
  • Day 2: CTEs and recursive CTEs. Write 8 multi-CTE queries.
  • Day 3: Self-joins and complex joins. Write 6 self-join queries.
  • Day 4: Query optimization basics: read EXPLAIN plans, identify slow steps. Optimize 5 slow queries.
  • Day 5: Indexing strategy: when to add an index, when not to. Walk through 5 index-design questions.
  • Day 6: Advanced patterns: pivots, medians, running totals, gap-and-island problems.
  • Day 7: Practice mock: 30 minutes of timed SQL whiteboard, narrating as you write. Self-review.

Week 2 (days 8-14): ETL design and Airflow

  • Day 8: Read one short Airflow overview. Install Airflow locally. Run the example DAG.
  • Day 9: Build a toy DAG that pulls from a public API daily, transforms in Python, lands in SQLite.
  • Day 10: Add error handling and idempotency to the toy DAG. Test by running the same date twice.
  • Day 11: Study the 8 ETL design questions above. Write out sample answers.
  • Day 12: Backfill exercise. Set up a 7-day backfill on your toy DAG. Watch what fails.
  • Day 13: Schema evolution exercise. Change the API's response schema mid-flight. Watch your DAG handle it.
  • Day 14: Document the project on GitHub with a real README.

Week 3 (days 15-21): Spark and distributed-systems fundamentals

  • Day 15: Spark transformation model: narrow vs wide, lazy evaluation, DAG construction. Read one short overview.
  • Day 16: Build a toy Spark job that aggregates a 1-10GB CSV.
  • Day 17: Read the Spark UI for your toy job. Identify the most expensive stage. Optimize it.
  • Day 18: Distributed systems vocabulary: CAP, eventual consistency, idempotency, partition strategy.
  • Day 19: Skewed-key exercise. Simulate a skewed dataset, observe the slow task, apply the salt fix.
  • Day 20: Mock SQL whiteboard. Second timed run. Compare against day 7.
  • Day 21: Mock system design. 45 minutes on "design a daily clickstream pipeline." Self-review.

Week 4 (days 22-30): Tool-specific and behavioral

  • Day 22: Pick the tool the target role lists. Read one overview.
  • Day 23: Build a small portfolio project (3-4 dbt models, or 1-2 BigQuery notebooks).
  • Day 24: Add tests to the portfolio project. Document in the README.
  • Day 25: Behavioral prep. Write 5-6 short STAR-framework stories from coursework + your one real internship.
  • Day 26: Behavioral mock. 30 minutes on common questions. Self-review.
  • Day 27: Final SQL mock. Third timed run. Should be noticeably better than day 7.
  • Day 28: Final system design mock. 45 minutes on a different pipeline prompt. Self-review.
  • Day 29: Resume polish, LinkedIn update, portfolio README polish.
  • Day 30: Rest day. Walk into Monday's phone screen rested, not crammed.

Stick close to the plan. The two highest-risk skips: Week 1 SQL (everyone underestimates how hard the SQL whiteboard is at the entry level) and the day-21 system design mock (most new grads have never run a 45-minute data-pipeline system design under a clock).

Common data engineer interview mistakes for new grads

After watching enough new grads run data engineer loops, the same patterns show up across most failures:

1. Treating SQL as easy. SQL is the single most-graded skill in a data engineer loop. New grads who took one database course in college often arrive thinking SQL is solved, then bomb a window-function question. Spend the week on SQL. The window-function bar is universal at the entry level.

2. Skipping the portfolio. A pivoter without a dbt or Airflow project on GitHub reads as "career-changer who hasn't done the work." A pivoter with two documented projects reads as "career-changer who actually pivoted." The 18-22 hours of project work is the single best ROI prep activity for the pivot specifically.

3. Inflating experience. Don't claim production data engineering experience you don't have. Recruiters detect this within one phone screen. The honest framing ("CS grad with strong SQL fundamentals, pivoting into data engineering, built two projects to learn the stack") works. The dishonest framing ("experienced data engineer with 2 years of production Spark") gets you caught and blacklisted from the recruiter's pipeline.

4. Naming tools without depth. A resume that lists Spark, Airflow, dbt, Snowflake, Databricks, BigQuery, and Kafka but can't pass a basic question on any of them reads as a buzzword resume. Better: list three tools and be able to talk about all three at a meaningful depth.

5. Skipping the SQL whiteboard practice. Most new grads practice SQL by typing in an editor. The whiteboard is different. No autocomplete, no execution, no error messages. Practice writing SQL by hand at least three times in the prep window. The first time is shocking. The third time is fine.

6. Studying for the wrong tier. A new grad targeting a mid-market data team who studies for the FAANG Spark bar wastes a week. A new grad targeting FAANG who studies only dbt under-prepares. Identify the tier of the target roles in the first 48 hours of prep and calibrate accordingly.

7. Underestimating the system-design round. The data engineering system-design round is meaningfully different from a SWE system-design round. Data flow rather than service architecture. Partitioning rather than load balancing. Batch vs streaming tradeoffs rather than read-replicas vs primary. New grads who prep for SWE-style system design get caught when the interviewer asks about backfills or schema evolution. The system design basics for new grads guide covers the SWE-style framework. Layer in the data-flow concepts from this article on top.

Key terms

The vocabulary surface is wider than a SWE interview. Get these wrong and the interviewer flags you. Get them right and you pass that axis.

ETL vs ELT
ETL extracts data from source systems, transforms it in a separate staging engine, then loads it into the warehouse. ELT extracts, loads the raw data into the warehouse first, and transforms in place using the warehouse's compute. Modern data stacks are predominantly ELT. Cloud warehouses are cheap and fast enough to transform at scale, and keeping the raw data makes debugging trivial.
Batch processing vs stream processing
Batch processing runs on a scheduled interval (hourly, daily) over a fixed window of historical data. Stream processing runs continuously over events as they arrive, with windowing semantics for aggregation. Most production pipelines in 2026 are still batch. Streaming is reserved for use cases where freshness matters (fraud detection, real-time dashboards). Hybrid (micro-batch) approaches are common.
Data warehouse vs data lake vs lakehouse
A data warehouse stores structured, schema-on-write data optimized for analytical queries (Snowflake, BigQuery, Redshift). A data lake stores raw, schema-on-read data in cheap object storage (S3, GCS, Azure Blob), supporting any file format. A lakehouse combines both: schema-on-write performance on top of object storage (Databricks, Iceberg-based architectures). The data lake era peaked in 2019. The lakehouse era is current.
OLTP vs OLAP
OLTP (Online Transaction Processing) is the database pattern behind operational systems: many small reads and writes per second, optimized for low latency on indexed lookups. Application databases (Postgres, MySQL) are OLTP. OLAP (Online Analytical Processing) is the pattern behind analytical systems: fewer queries per second but each query scans much more data, optimized for columnar storage and aggregation. Warehouses are OLAP. Data engineers spend most of their time moving data from OLTP systems into OLAP systems.
Idempotency
The property that running an operation multiple times produces the same result as running it once. Critical in distributed pipelines because retries, redelivery, and replays all violate the assumption that an operation runs exactly once. The canonical idempotency pattern in data engineering: identify the operation with a deterministic key (date, batch ID, source-file hash), use upsert (MERGE) rather than insert, and design transformations to be replay-safe.
Backfill
The process of re-running a pipeline for a past date range: either because the pipeline failed for several days, or because business logic changed and historical data needs to be re-computed under the new rules. Backfills probe idempotency, schema evolution, and resource awareness in interviews. "Walk me through how you'd backfill three days of missed clickstream data" is a canonical Airflow interview question.
Partition / partitioning
Splitting a large dataset into smaller, independently-storable chunks by a key, typically date, user_id, or geographic region. Partitioning enables query pruning (only scan the partitions matching the filter) and parallel processing (each worker handles a partition independently). Most modern warehouses partition automatically. User-defined partitioning is for high-cardinality keys.
Shuffle
The data-movement step in a distributed computation when records need to be regrouped by key across the network, for example during a join or groupBy. Shuffles are the slowest operation in Spark because network I/O is the slowest layer. Wide transformations cause shuffles. Narrow transformations don't. Most Spark performance work is shuffle reduction.
Sensor (Airflow)
An Airflow task type that waits for an external condition before allowing downstream tasks to run: a file landing in object storage, a partition appearing in a warehouse, a value in a database row. Sensors come in poke mode (worker checks every N seconds, holding a worker slot) and reschedule mode (worker frees the slot between checks). Long-running sensors should use reschedule mode.
Schema-on-read vs schema-on-write
Schema-on-write enforces structure at the time of writing. The warehouse rejects malformed data at load time. Schema-on-read defers structure to the time of querying. The data lake stores raw files, and queries impose schema dynamically. Warehouses are schema-on-write. Data lakes are schema-on-read. Lakehouses are a hybrid.
Slowly Changing Dimension (SCD)
A pattern for tracking how a dimension table's rows change over time. Type 1 overwrites the old value (no history). Type 2 inserts a new row with valid-from and valid-to dates (full history). Type 3 keeps the current and one previous value in two columns (limited history). SCDs come up in dbt interviews via the snapshot construct, which implements Type 2 automatically.
CDC (Change Data Capture)
A pattern for replicating changes from an OLTP source database into a warehouse by reading the database's transaction log rather than polling for changes. CDC enables near-real-time replication with low overhead on the source. Common tools: Debezium, Fivetran with CDC enabled, AWS DMS. CDC questions probe whether you understand the alternatives to scheduled batch ingestion.

The vocabulary is wider than a SWE interview because the tooling is wider. Knowing the words above gets you through the credibility check. Knowing what's behind them (when each pattern applies, what the tradeoffs are) gets you through the technical depth check.

Related guides

The data engineer interview is one of several interview formats a CS new grad may face in the 2026 cycle. The following guides close adjacent gaps:

Pick the gap, jump to the matching cornerstone, close the gap, then return to DE prep. That is the loop.


The data engineer interview in 2026 is one of the most CS-new-grad-friendly entry points into the broader tech-hiring market. Lower LeetCode bar than a SWE loop, broader vocabulary surface, more entry-level openings per applicant. The pivot is real and honest if you study the material. 30 days of focused prep on the SQL + ETL + tool stack closes the gap from CS-new-grad-SWE to interview-ready entry-level data engineer.

InterviewChamp.AI runs realistic data-engineering mocks that show up on every interview surface: the SQL whiteboard, the Airflow design conversation, the Spark deep-dive, the behavioral round. One install, every surface. Start a practice session, narrate as you write the query, get scored on what your interviewer is scoring, and walk into Monday's phone screen ready.


About the author: Alex Chen is the founder of InterviewChamp.AI, building AI interview prep for the new-grad CS market and writing about the modern interview gauntlet from the inside.

Related guides

Interview Process

System Design Interview Guide for CS New Grads (2026): Framework, Templates, Cheat Sheet

The new-grad system design interview is a vocabulary check, a structure check, and a communication check, not a senior architect evaluation. This guide gives you a 4-step framework, a 12-template cheat sheet, a 45-minute time budget, the five canonical problems that carry 80% of new-grad rotations, and a side-by-side of HLD vs LLD vs machine-learning-system-design. Built for the CS new grad who has solved 600 LeetCode problems but never drawn a load balancer.

Alex Chen ·

Read more →
Interview Process

The 2026 CS New-Grad Interview Loop: Phone Screen to Offer at Every Tier

The 2026 CS new-grad interview loop runs five steps (recruiter screen, technical screen, onsite, debrief, offer) but the shape of each step now depends on tier of company. This guide maps the loop for FAANG, mid-tier public, startup, consultancy, and research lab, with 2026 timelines and how AI-fraud concerns brought in-person rounds back.

Alex Chen ·

Read more →
Interview Process

Accounting Interview Questions for 2026: 40+ Questions for Staff Accountants, Big 4 Candidates, and CPA Pivots

Accounting interview questions in 2026 test six things at once: do you know GAAP cold, can you walk a transaction from journal entry to the three financial statements, can you read a balance sheet under pressure, do you understand the difference between Big 4 audit and corporate close work, can you handle the behavioral round without sounding rehearsed, and can you reason through a case study when the prompt is intentionally vague. If you're an accounting grad, a CPA candidate, or pivoting from finance/ops into staff accountant work, the technical bar isn't the killer. It's framing what you know in 60 seconds while a senior manager watches you on Zoom. This guide walks 40+ questions across six categories, the Big 4 vs corporate vs public-accounting split, and the four-week prep plan that actually works.

Alex Chen ·

Read more →

Frequently asked questions

What does a data engineer interview test for in 2026?
A data engineer interview in 2026 tests four things: SQL fluency at the window-function-and-CTE level, ETL pipeline design with awareness of idempotency and backfills, distributed-systems intuition (partitioning, eventual consistency, the read-vs-write tradeoff), and at least one tool-specific deep dive on Spark, Airflow, dbt, or a cloud warehouse like Snowflake or BigQuery. The LeetCode bar is meaningfully lower than a pure SWE loop. Most data engineering interviews substitute one or two algorithm rounds for a SQL whiteboard round and a system-design round focused on data flow rather than service architecture.
Should CS new grads pivot from SWE to data engineering in 2026?
For most CS new grads applying since spring 2025 with no offers, yes. The data engineering pipeline is meaningfully less saturated than the pure SWE pipeline at the entry level, with employers reporting fewer applicants per opening per Class-of-2025 megathread reporting on r/cscareerquestions. The honest read: data engineering isn't easier, it's different. Lower LeetCode bar, higher SQL bar, higher system-thinking bar around data flow. The pivot is honest if you actually study the material, not if you slap 'data engineer' on a resume that's still SWE-only inside.
What's the difference between data engineer, data scientist, and data analyst interviews?
Data engineer interviews test pipelines and infrastructure: how do you move 100GB of clickstream from Kafka to a warehouse, what happens when the pipeline fails, how do you backfill three days of missed data. Data scientist interviews test statistics, machine learning, and experimentation: A/B test design, regression, feature engineering. Data analyst interviews test SQL plus business reasoning: given a metric drop, what query would you run, how would you slice it. SQL appears in all three; only data engineering also tests Spark, Airflow, distributed systems, and warehouse internals.
How many rounds is a data engineer interview loop?
Entry-level data engineer loops are typically 4-5 rounds at large employers: a recruiter screen, a technical phone screen (SQL + light Python), an onsite SQL round (whiteboard SQL with progressive complexity), a system design round focused on data flow rather than service architecture, and a behavioral round. Senior loops add a tool-specific deep dive (Spark, Airflow, dbt). The onsite typically runs 4-5 hours. The interview is shorter than a FAANG SWE loop and rarely includes more than one algorithm round.
What SQL questions do data engineers actually get asked in 2026?
Window functions (RANK, DENSE_RANK, LAG, LEAD), self-joins, multi-CTE queries, query optimization (read an EXPLAIN plan, name two ways to speed it up), and indexing strategy. The hard SQL round is rarely 'write this query from scratch'. It's 'here's a slow query, debug it' or 'here's a business question, write the query, walk me through the execution plan.' The window-function bar is universal at the entry level; if you can't write a query that ranks customers by spend within each region, that's a fail signal.
What is ETL and how is it different from ELT?
ETL stands for Extract, Transform, Load. Data is pulled from source systems, transformed in a staging area (often a separate compute engine), then loaded into a target warehouse. ELT (Extract, Load, Transform) pulls the raw data into the warehouse first and uses the warehouse's compute to transform in place. ELT has become the dominant pattern in 2026 because cloud warehouses are cheap and fast enough to transform at scale, and because keeping the raw data in the warehouse makes debugging and replays trivial. Most modern data stacks (dbt + Snowflake or BigQuery) are ELT, not ETL.
What Spark interview questions should I prepare for?
Five categories: lazy evaluation and DAG construction (why doesn't my code do anything until I call collect or write), narrow vs wide transformations (filter and map vs groupBy and join), shuffle costs and partition strategy, broadcast joins vs sort-merge joins, and the difference between RDDs, DataFrames, and Datasets. Memory tuning (executor memory, partition count) shows up at senior levels but is rarely the focus for entry-level. The most common entry-level Spark trap: claiming 'Spark scales linearly' without acknowledging that wide transformations and skewed keys break linearity.
What Airflow interview questions should I prepare for?
DAGs and operators (what's a DAG, what's an operator, what's a task), scheduling semantics (start_date, schedule_interval, catchup, the difference between execution_date and the actual run time), idempotency (why your task must be safe to re-run), backfills (how to replay missed days), task dependencies and XComs (passing small values between tasks). Airflow questions often double as system-design questions: 'design a daily pipeline that ingests yesterday's clickstream, transforms it, and loads it into the warehouse.' Know what a sensor is and when to use it.
How do I prepare for a data engineer interview in 30 days as a CS new grad?
Week 1: SQL deep dive. 4 hours/day on window functions, CTEs, query optimization, and indexing. Use a free practice platform with real datasets. Week 2: ETL + Airflow. Read one short overview of Airflow scheduling and idempotency, build one toy DAG locally, study 8 common ETL design questions. Week 3: Spark + distributed systems. Work through the Spark transformation model, read one short overview of partitioning and shuffles, study the canonical distributed-systems concepts (CAP, eventual consistency, idempotency). Week 4: tool-specific (dbt, Snowflake, BigQuery, whatever the role uses) + behavioral. Run timed mock interviews in weeks 3 and 4.
Is the pivot from SWE to data engineering dishonest if I haven't done DE work?
The pivot itself is fine. Engineers cross-train into new disciplines all the time. What's dishonest is lying about experience. Acceptable: 'I'm a CS grad with strong SQL from coursework and a backend Python background; I've been studying data engineering and built two projects in Spark and Airflow to learn the stack.' Not acceptable: rewriting a SWE-only resume to claim production data engineering experience you don't have. Recruiters can tell the difference within one phone screen, and being caught lying is worse than being honest about the pivot.
What's the easiest data engineer interview tool to learn for a new grad?
dbt is the highest-ROI learning target. It's the most-asked tool in 2026 entry-level data engineer interviews, the learning curve is meaningfully shorter than Spark or Airflow, and the free Cloud tier lets you build a real portfolio project in 6-8 hours. After dbt, Airflow is the next-easiest because you can run a local instance and build toy DAGs without infrastructure. Spark is the steepest curve at the entry level. Meaningful Spark depth takes 2-3 weeks of focused work. Snowflake or BigQuery is the lightest study because the SQL is mostly transferable from any SQL background.
What's the salary range for entry-level data engineers in 2026?
Entry-level data engineer base salaries in the US run $85K-$130K depending on company tier and location, with mid-market employers concentrated at $90K-$110K. Total comp at large public tech employers can push past $150K with equity and bonus. Data engineering pays comparably to SWE at the entry level in most markets: slightly lower at the absolute top, slightly higher at the mid-market. The narrative that data engineering pays less is outdated.
Do I need a Python background for a data engineer interview?
Yes, but not at the algorithm-grinding level a SWE interview demands. Entry-level data engineer interviews test Python at a 'write a function that processes a list of dicts' level: list comprehensions, basic file I/O, the pandas API at a competent level, and the requests library for pulling from an API. If you can complete the easier half of a LeetCode list-and-string problem set comfortably, you're ready for the Python bar of a data engineer interview.
What does a system design round look like for data engineers?
It's a 45-60 minute conversation where the interviewer asks you to design a data pipeline rather than a service. Example prompts: design a pipeline that ingests 100GB of clickstream per day into the warehouse, design a real-time fraud detection feature, design a metrics dashboard backend. The expected answer shape is data-flow oriented: sources, ingestion, storage, processing, serving, instead of the service-oriented load-balancer-and-cache pattern of SWE system design. Tradeoff awareness around batch vs streaming, partitioning strategy, and backfill design is the senior signal.
What's a backfill and why does it matter in data engineer interviews?
A backfill is the process of re-running a pipeline for a past date range, either because the pipeline failed for several days, or because business logic changed and historical data needs to be re-computed under the new rules. Backfills matter in interviews because they probe idempotency (does running the same logic twice produce the same result?), schema evolution (what happens if the table schema changed between then and now?), and resource awareness (re-running a year of daily jobs is expensive). 'Walk me through how you'd backfill three days of missed clickstream data' is a canonical Airflow interview question.