10 X Machine Learning Engineer (New Grad) Interview Questions (2026)

X's new-grad MLE loop in 2026 adds one ML-specific round on top of the standard SWE coding/behavioral set. Expect questions on recommendation algorithms, feature engineering, online learning, and the production reality of ranking content for hundreds of millions of users. The role lives close to the For You timeline, Spotlight ranking, and ads relevance teams.

By Sam K., Founder, InterviewChamp.AI · Last verified 2026-05-25

Loop overview

New-grad MLEs report a 5-7 week timeline in 2026. Phone screen is coding (60 min) plus a short ML-conceptual section. Onsite is one ML deep-dive (research project or paper discussion), two coding rounds, one ML system design, and one behavioral. The recommendation-systems angle is the dominant flavor — be ready for it.

Behavioral (3)

Walk me through an ML project where you built something users interacted with.

Frequently asked

Outline

Pick one project. Cover problem framing, data, modeling, evaluation, and what you'd do differently. Have specific numbers: dataset size, baseline vs final metric, online performance if you have it. Surface-level course projects lose; depth in one applied project wins.

Why ML and why X?

Frequently asked

Outline

Be specific: the recommendation problem at X's scale is genuinely hard (cold start, real-time signals, billions of items, adversarial content). What draws you to it? Mention what you've read or seen about the For You algorithm. Vague 'I love ML' answers fail.

Tell me about a time you debugged a model that worked offline but failed in production.

Occasionally asked

Outline

STAR. Pick a real story (training/serving skew, data drift, label leakage, distribution shift). Cover how you noticed (which metric flagged), how you isolated the cause, and what you fixed. If you don't have one, frame a course project or competition with the equivalent dynamics. Don't fabricate.

Coding (LeetCode patterns) (2)

Implement a function that computes the dot product between two sparse vectors.

Frequently asked

Outline

Represent each vector as a hash map (index → value) or a sorted list of (index, value) tuples. Hash map: iterate the smaller, look up in the larger, sum products. O(N) where N is the smaller. Sorted lists: two pointers, advance based on index comparison. Discuss tradeoffs: hash map is faster per query, sorted is more memory-efficient and traversal-friendly.

Implement a function that returns the K most similar items to a query embedding.

Occasionally asked

Outline

Naive: compute cosine similarity to all items, min-heap of size K. O(N * d log K). Walk through normalization — pre-normalize embeddings so cosine becomes dot product. Discuss ANN alternatives (HNSW, IVF) at scale and the recall/latency tradeoff.

Technical (5)

Explain how a two-tower retrieval model works for content recommendation.

Frequently asked

Outline

User tower encodes user features (history, demographics, context) to an embedding. Item tower encodes content features to an embedding in the same space. Score = dot product. Training: positive pairs are real interactions, negatives are sampled (in-batch is common). At serve time: pre-compute item embeddings, online encode the user, ANN lookup for top-K. Discuss the cold-start problem and how you'd address it.

Design a ranker for a personalized feed of posts.

Frequently asked

Outline

Two-stage: candidate generation (cheap, broad — retrieves a few hundred candidates per user) and a ranker (expensive, scored). Features: user history, post freshness, author affinity, engagement signals. Discuss the label problem (what counts as positive — click? dwell? engagement? a mix?), online vs offline training, and the cold-start case for new users or new posts.

How would you evaluate a recommendation system before shipping?

Frequently asked

Outline

Multi-stage: offline (NDCG, MAP, AUC on held-out data — check for distribution shift), shadow (run the new model alongside the production one and compare), A/B test (small percent of traffic, primary metrics tied to the business: session length, engagement rate, retention). Discuss the limits of each: offline metrics often don't predict online wins.

Given a labeled dataset with severe class imbalance, what techniques would you apply?

Frequently asked

Outline

Loss weighting (give minority class higher loss weight) is the easy first move. Over-sampling (SMOTE, simple replication) for the minority class or under-sampling the majority. Threshold tuning at inference (the default 0.5 is rarely right). Mention focal loss as a more sophisticated alternative. Discuss why accuracy is the wrong metric (use AUC, precision-recall AUC, or F1).

Design an online learning system that updates the ranker as users interact with the feed.

Occasionally asked

Outline

Concept-level. Logging pipeline captures interactions with full context (features used, predictions, outcomes). Training job consumes the log stream — could be mini-batch (every N minutes) or true online (one example at a time, less common). Discuss feedback loops (the ranker influences what users see, which influences training data), the cold-start failure mode for new content, and how you'd detect when the model degrades.

X interview tips

Have one ML project you can defend three layers deep. Recommendation-flavored projects (a movie recommender, a search ranker) play best for X's interview style.
Know your evaluation methodology cold. Most MLE no-hires aren't from getting the model wrong — they're from not having a defensible eval story.
ML system design at X centers on serving, evaluation, and feedback loops. Spend prep on the production side, not on training infrastructure.
Brush up on recommendation-system fundamentals: two-tower retrieval, candidate generation vs ranking, the cold-start problem, multi-task learning.
X's MLE org runs lean. Engineers ship models end-to-end (data + training + serving). Generalist skills win over hyper-specialization.

Frequently asked questions

How long is X's MLE new-grad interview process in 2026?

Most reports show 5-7 weeks from recruiter outreach to offer. The ML deep-dive round adds time vs the SWE loop.

What's the difference between X's SWE and MLE new-grad loops?

MLE adds one ML deep-dive round, replaces one coding round with ML coding, and the system design round is ML-flavored (ranking, retrieval, feedback loops).

Do I need a published paper to interview for new-grad MLE at X?

No. A research project with rigorous evaluation, an internship with applied ML work, or a strong end-to-end personal project is sufficient. Quality of methodology beats venue.

Should I prepare for recommendation-systems specifically?

Yes. X's MLE work is heavily ranking-and-retrieval flavored. Spend prep on the recommendation-systems patterns (two-tower retrieval, ranking, online evaluation) rather than generic ML topics.

What ML libraries should I be comfortable with for X interviews?

PyTorch is most common. Know NumPy fluently. Familiarity with vector indexes (FAISS, HNSW) helps for the system design round.

Loop overview

Behavioral (3)

Coding (LeetCode patterns) (2)

Technical (5)

X interview tips

Frequently asked questions

More X interview questions