Skip to main content

Machine Learning Interview Questions for 2026: 40+ Questions Across Theory, Deep Learning, ML System Design, LLMs (New-Grad Edition)

ML interview questions in 2026 test seven things: classical ML theory, the workhorse algorithms, deep learning and transformer mechanics, the LLM stack everyone is shipping into prod, ML system design at the pipeline level, coding implementation in Python and NumPy, and one or two case-study prompts about a real model decision. This guide covers 40+ questions across all seven, plus the part new grads agonize about most: how to compete for an ML role without research papers.

By Alex Chen, Founder, InterviewChamp.AI · Last updated

31 min read

What machine learning interview questions actually test in 2026

Machine learning interview questions in 2026 test seven things, weighted roughly like this at the new-grad bar: ML fundamentals (30%), classical algorithms (20%), deep learning and transformers (15%), the LLM stack (10%), ML system design (15%), coding implementation in Python and NumPy (5%), and one or two short case studies on a real model decision (5%). The fundamentals slice is the floor, and the floor is harder than candidates expect.

The other shift since 2024: every ML interview now includes at least one LLM question, even when the role has nothing to do with frontier reasoning models. RAG architecture, fine-tuning vs prompting, hallucination control, and basic serving concepts (KV-cache, batching, quantization) are now table-stakes vocabulary. The candidate who can recite mixture-of-experts but freezes on "what's the difference between L1 and L2 regularization" loses the round.

The 2026 hiring environment for ML roles is bimodal. Frontier-model labs filter heavily on research signal at the entry level, papers, top-conference internships, or a top-tier PhD program. Mid-market tech companies and ML-heavy startups hire CS new grads who can show production-ML competence: a finished Kaggle competition, a deployed side project, a clean ML system design round. The path for a CS new grad without research papers runs through MLE roles at mid-market tech companies, not through frontier labs at the entry level. That route is real and shipping offers in 2026. The frontier-lab route from undergrad requires either a paper or a year of carefully chosen research-internship signal that most CS programs don't naturally generate.

This guide builds for the mid-market MLE pipeline: 40+ questions across the seven categories, the canonical ML system design prompts at the new-grad bar, the coding patterns interviewers ask cold, and the honest framing of how a new grad without papers competes.

How ML interview questions differ from general SWE interviews

A SWE interview tests algorithm fluency, problem decomposition, and communication. An ML interview adds three orthogonal axes most CS new grads have never been graded on:

Mathematical depth at the model level. Not university-PhD-deep, but deeper than coursework typically goes. You are expected to derive the gradient of logistic regression on the fly, explain why scaled dot-product attention divides by sqrt(d_k), and write the formula for binary cross-entropy from memory. Candidates who memorized the formulas without internalizing the derivations get caught the first time the interviewer asks "why?"

Production-ML system thinking. A SWE system design round designs a service. An ML system design round designs a pipeline plus a service. Feature stores, training pipelines, model registries, online vs offline inference, monitoring for drift, rollback strategies. The vocabulary surface is wider, and the canonical references (production ML books, ML system design articles) are less standardized than SWE system design. Read 2-3 ML system design references and pick the vocabulary that overlaps.

Domain literacy. NLP roles ask about tokenization and transformer-based LLMs. Vision roles ask about CNNs, augmentation, and segmentation losses. Recsys roles ask about embedding tables, two-tower architectures, and negative sampling. The breadth round tests fundamentals across all of them; the depth round tests one. Read the JD before the interview; figure out which domain depth they want.

A SWE new grad who interviewed at five ML companies in 2026 described the difference like this: "SWE interviews graded whether I could solve the problem. ML interviews graded whether I understood why I was solving it that way." That distinction is the whole game. Memorization clears the breadth round. Understanding clears the depth round and the system-design round.

The 40+ machine learning interview questions you should rehearse

What follows is a structured rehearsal set covering the seven categories. Each question has a sample answer outline. Not a full canned response, but the bones of what a strong new-grad answer covers. Adapt the language to your own voice; the structure is the load-bearing part.

ML fundamentals interview questions (10 Q)

This is the floor. Miss any of these and the breadth round becomes uphill from question three.

Q1. Explain the bias-variance tradeoff.

Bias is the error from a model being too simple to capture the underlying pattern (underfitting). Variance is the error from a model being too sensitive to the specific training data (overfitting). High-bias models miss the signal; high-variance models fit the noise. The tradeoff: as you decrease bias by adding model capacity, variance increases. Cross-validation is the diagnostic. If training loss is low and validation loss is high, you're variance-bound. If both are high, you're bias-bound.

Q2. What is overfitting and how do you detect it?

Overfitting is when a model learns patterns specific to the training set that do not generalize. Detection: training loss continues to decrease while validation loss starts increasing. Mitigation: more data, regularization (L1, L2, dropout, early stopping), simpler model, or data augmentation. The interview-relevant nuance is that train-test split alone does not detect distribution shift overfitting; you need a held-out test set drawn from the deployment distribution, not just a random split of the training set.

Q3. What's the difference between L1 and L2 regularization?

L2 (ridge) adds the sum of squared weights to the loss; L1 (lasso) adds the sum of absolute values. The mathematical consequence: L2 shrinks weights toward zero but rarely produces exact zeros; L1 produces sparse solutions where many weights are exactly zero. Use L2 when all features are believed to contribute; use L1 when you want feature selection built into training. Elastic net combines both with a mixing parameter.

Q4. What is cross-validation and when do you use it?

Cross-validation is the technique of partitioning training data into k folds, training on k-1 folds and validating on the held-out fold, then rotating. Common: 5-fold or 10-fold CV. Use it when your dataset is small enough that a single train-validation split has high variance, when you're tuning hyperparameters and want a less noisy signal, or when you need to defend a model choice in interview prose without overfitting to one validation set.

Q5. How do you evaluate a classification model with imbalanced classes?

Accuracy is misleading on imbalanced data. A model that predicts the majority class always can hit 99% accuracy on a 1% positive class. Use precision, recall, F1, and the area under the precision-recall curve. ROC-AUC is less informative on extreme imbalance because the false-positive rate is dominated by the large negative class. The 2026 best practice for imbalanced classification: report PR-AUC and the F1 at the operating threshold, not accuracy.

Q6. What is the difference between supervised, unsupervised, and self-supervised learning?

Supervised learning has labeled examples (input, label) pairs and learns the mapping. Unsupervised learning has no labels and learns structure (clusters, density, embeddings). Self-supervised learning generates labels from the input itself (predict the next word in a sentence, predict a masked image patch) and uses those to pretrain large models without manual annotation. Most production-grade transformer-based LLMs are pretrained self-supervised, then fine-tuned supervised.

Q7. What does it mean for a model to generalize?

Generalization is the property of performing well on data drawn from the same distribution but not seen during training. A model that generalizes well has captured the underlying signal rather than memorizing training noise. Generalization is bounded by the data: a model trained on 2024 distributions does not generalize to 2026 distributions if the data has shifted. The interview-relevant follow-up: how do you measure generalization on a moving target? Answer: rolling validation windows or hold-out sets refreshed from production.

Q8. What is feature engineering and why does it matter?

Feature engineering is the process of transforming raw inputs into features the model can learn from. Examples: bucketing ages into ranges, computing interaction features (price * quantity), normalizing or standardizing scales, one-hot encoding categorical variables, embedding high-cardinality categoricals. It matters because most classical models can only learn as well as the features allow; bad features bound the performance ceiling. Deep learning reduces but does not eliminate the need for feature engineering, especially for tabular data where gradient boosting plus thoughtful features still wins benchmarks.

Q9. What are the common evaluation metrics for regression?

Mean squared error (MSE) and root mean squared error (RMSE) penalize large errors more. Mean absolute error (MAE) is less sensitive to outliers. R-squared measures variance explained. The choice depends on the cost function of the business problem. If a 10-unit miss is twice as bad as a 5-unit miss, MAE fits. If it's four times as bad, MSE fits. Always report the metric in the units of the prediction, not unitless R-squared, when talking to a non-ML stakeholder.

Q10. What is data leakage and how do you prevent it?

Data leakage is when information from the validation or test set influences training, producing artificially inflated metrics that do not hold in production. Common forms: forgetting to fit feature transformers (scalers, imputers) on training data only, using future information in time-series problems, label leakage from features that are computed after the outcome. Prevention: build the data pipeline so that test data is held out from feature engineering as well as model training, and audit features for any that would not be available at inference time.

Classical ML algorithms interview questions (8 Q)

Tree-based models and linear models still dominate production tabular-ML in 2026. Drill these even if you plan to do deep-learning work later.

Q11. Walk me through linear regression and how it's trained.

Linear regression models a real-valued target as a weighted sum of features plus a bias: y = w·x + b. The loss is mean squared error. Training: solve in closed form via the normal equation (X^T X)^-1 X^T y for small datasets, or use gradient descent for larger ones where matrix inversion is too expensive. Assumes linearity, independence of errors, homoscedasticity, and normality of residuals. Assumptions you will be asked to recite.

Q12. Walk me through logistic regression and derive the gradient.

Logistic regression models the probability of a binary outcome as sigmoid(w·x + b). The loss is binary cross-entropy: -[y log(p) + (1-y) log(1-p)] where p is the sigmoid output. The gradient of the loss against the weights is (p - y) * x, a clean form that drops out of the sigmoid derivative algebra. This derivation is one of the most commonly asked ML fundamentals questions. Memorize it and be able to write it out under live observation.

Q13. How does a decision tree work?

A decision tree recursively splits the feature space on the feature and threshold that maximally reduce a loss (Gini impurity or entropy for classification, MSE for regression). Each leaf is a prediction (class label or mean). Pros: interpretable, handles non-linear interactions, no scaling needed. Cons: overfits easily, unstable to small data changes. In production you almost never use a single tree. You use an ensemble.

Q14. What's the difference between bagging and boosting?

Bagging trains multiple models in parallel on bootstrapped samples and averages predictions; reduces variance. Random Forest is the canonical bagging method. Boosting trains models sequentially, each new model focusing on the errors of the previous; reduces bias. Gradient Boosting (and the production-grade variants like XGBoost, LightGBM, CatBoost) is the canonical boosting method. Boosting almost always wins on tabular data; bagging is more parallelizable and more forgiving of bad hyperparameters.

Q15. Explain gradient boosting at a high level.

Gradient boosting fits successive weak learners (typically shallow trees) to the residuals of the previous ensemble. At each step, compute the gradient of the loss against the current prediction, train a new tree to predict that gradient, and add it to the ensemble with a small learning rate. This is functional gradient descent in the space of functions. Production variants (XGBoost, LightGBM) add second-order information, regularization, and tree-construction optimizations.

Q16. How does k-means clustering work?

K-means partitions n points into k clusters by alternating two steps until convergence: (1) assign each point to its nearest cluster centroid, (2) update each centroid to the mean of its assigned points. Initialization matters: k-means++ is the standard improvement over random initialization. K-means assumes spherical, equally-sized clusters; it fails on non-convex shapes (use DBSCAN) or imbalanced cluster sizes (use Gaussian Mixture Models with custom priors). The k must be chosen. Elbow plot or silhouette score are the standard heuristics.

Q17. What is an SVM and what does the kernel trick do?

A Support Vector Machine finds the maximum-margin hyperplane separating two classes. The "support vectors" are the points closest to the boundary. For non-linearly separable data, the kernel trick maps inputs into a higher-dimensional space where they become linearly separable, without explicitly computing the high-dimensional representation. Common kernels: linear, polynomial, RBF (Gaussian). SVMs were dominant in the 2000s; in 2026 they're a fallback method for small datasets, mostly displaced by gradient boosting for tabular and deep learning for unstructured.

Q18. When would you use a tree-based model vs a linear model?

Tree-based models for tabular data with non-linear interactions, mixed feature types, no need for feature scaling, and tolerance for non-smooth predictions. Linear models for interpretable models with monotonic relationships, when the data is high-dimensional and sparse (text TF-IDF), when you need fast inference, or when calibrated probability estimates matter. Production rule of thumb for 2026 tabular data: try gradient boosting first; switch to logistic regression only if interpretability or inference latency forces it.

Deep learning + transformer interview questions (8 Q)

Q19. Walk me through forward and backward propagation for a 2-layer neural network.

Forward: input x → hidden h = activation(W1·x + b1) → output y = activation(W2·h + b2) → loss L. Backward: compute dL/dy from the loss function (e.g., (y - target) for MSE), then chain-rule back through the activations and linear layers: dL/dW2 = dL/dy · h^T, dL/dh = dL/dy · W2, dL/dW1 = (dL/dh ⊙ activation'(W1·x + b1)) · x^T. Be able to write this from scratch in numpy.

Q20. What activation function should you use and why?

ReLU is the default for hidden layers: cheap, helps with vanishing gradients, but suffers from the dying-ReLU problem. Leaky ReLU and ELU mitigate dying ReLU. Sigmoid and tanh are mostly avoided in hidden layers because of vanishing gradients on deep networks; sigmoid stays as the output activation for binary classification. GELU is standard inside transformer-based LLMs. Softmax is the multi-class output activation.

Q21. What's the difference between SGD, momentum, and Adam?

SGD updates parameters by the gradient times a learning rate. Momentum adds a velocity term that accumulates past gradients, smoothing the trajectory and helping escape shallow local minima. Adam combines momentum with per-parameter adaptive learning rates (RMSProp). AdamW decouples weight decay from the gradient update, which is standard for training transformer-based LLMs. Default for most production models in 2026: AdamW with a cosine learning rate schedule and a warmup phase.

Q22. What is dropout and how does it regularize?

Dropout randomly zeroes a fraction of activations during training, forcing the network to not rely on any single neuron. At inference, all activations are kept but scaled by the dropout rate. The effect is similar to training an ensemble of subnetworks. Standard rate: 0.1 to 0.5. Dropout is less commonly used inside transformer blocks at the scale of frontier reasoning models (replaced by other regularization plus large-batch training), but still standard in smaller deep nets and in the embedding layer.

Q23. What is batch normalization vs layer normalization?

Batch norm normalizes each feature across the batch dimension; layer norm normalizes each example across the feature dimension. Batch norm is standard in CNNs for vision; layer norm is standard in transformers and any sequence model where batch statistics are unreliable. Layer norm is also more stable on small batches and at inference time (no running statistics needed). The 2026 transformer architecture defaults to RMSNorm in many modern variants, which is a simpler version of layer norm without the centering step.

Q24. Write the scaled dot-product attention formula and explain the sqrt(d_k) divisor.

Attention(Q, K, V) = softmax((Q · K^T) / sqrt(d_k)) · V. The division by sqrt(d_k) (where d_k is the key dimension) prevents the dot products from growing large in magnitude as d_k increases, which would push softmax into regions with vanishingly small gradients. Without scaling, training instability appears as soon as d_k exceeds ~64. This is one of the most-asked transformer interview questions in 2026.

Q25. What is multi-head attention and why do we use it?

Multi-head attention runs h parallel attention operations on different linear projections of Q, K, V, then concatenates the outputs and projects back. Each head can learn to attend to different relationships (syntax in one head, semantics in another, long-range in a third). Using multiple smaller heads instead of one large head improves the model's ability to capture diverse attention patterns at the same parameter count. Standard: 8 to 32 heads in transformer-based LLMs.

Q26. What's the difference between encoder-only, decoder-only, and encoder-decoder transformer architectures?

Encoder-only (the original BERT architecture) processes the full input bidirectionally and is used for classification, embedding, and retrieval. Decoder-only (the architecture used in production by frontier reasoning models) generates tokens left-to-right with causal masking. Encoder-decoder (the original transformer, T5-style) reads an input with the encoder and generates an output with the decoder, used for translation and summarization. Decoder-only has become the dominant LLM architecture in 2026 because it generalizes well to both classification and generation.

LLM interview questions (5 Q)

Q27. How does retrieval-augmented generation (RAG) work and when do you choose it over fine-tuning?

RAG retrieves relevant documents from an external knowledge base at inference time and includes them in the prompt to the LLM. The model generates the answer conditioned on the retrieved context. Choose RAG when the knowledge base changes often, when you need to cite sources, or when the corpus is too large to fit in context. Choose fine-tuning when you need to change model behavior (tone, format, task-specific reasoning) or when retrieval latency is unacceptable. Hybrid is common: fine-tune for behavior, RAG for facts.

Q28. What is prompt engineering and what are its limits?

Prompt engineering is the practice of structuring the input to a transformer-based LLM to get better outputs without changing the model weights. Techniques: few-shot examples, chain-of-thought prompts ("let's think step by step"), structured output formats (JSON schemas, XML tags), role specification ("you are a careful editor"). Limits: prompts hit a ceiling on tasks the base model doesn't know; brittleness across model versions; cost from long context. For high-volume production tasks, fine-tuning a smaller model often beats prompting a large one on cost-per-quality.

Q29. What are the common failure modes of LLMs and how do you mitigate them?

Five common failures: hallucination (confident wrong facts), prompt injection (user input rewrites system instructions), context-length overflow, inconsistent output formats, and silent failure under adversarial inputs. Mitigations: RAG with explicit citation, structured output (JSON schema enforcement), guardrail models that screen prompts and outputs, evaluation harnesses that catch regression at deploy time, and a fallback path when the LLM call fails or returns malformed output. Always assume the model will fail at the worst possible time and design the surrounding system accordingly.

Q30. How would you evaluate an LLM-powered feature?

Two layers. Offline: a benchmark dataset with reference outputs, scored by automated metrics (BLEU, ROUGE for generation; exact match for classification) plus model-graded evaluation (a stronger LLM grades the candidate outputs) plus human review on a sample. Online: user-facing metrics like clickthrough, conversion, time-to-complete, and explicit feedback. The two layers catch different failures. Offline catches regressions; online catches mismatches between automated metrics and actual user value. Most production LLM features in 2026 maintain a small held-out human-graded eval set that runs on every model update.

Q31. How would you reduce the inference cost of a production LLM endpoint?

Five levers, in order of typical impact: distill to a smaller model fine-tuned on the production task (10-100x cheaper if quality holds), quantize the model (4-bit or 8-bit, 2-4x speedup with small quality loss), use speculative decoding (a small draft model proposes tokens, large model verifies, 2-3x speedup on most tasks), batch requests at the serving layer (3-5x throughput gain at constant latency budget), cache common prompts and responses (eliminates duplicate work entirely). Most production deployments stack at least three of these.

ML system design interview questions (5 Q)

These are the canonical prompts you'll see at the new-grad bar. Each one carries a sample structure that gets you to a passing answer in 45 minutes.

Q32. Design a recommendation system for a video app.

Clarify: what's the business metric (CTR, watch time, day-1 retention)? What's the candidate set size and the user volume? Sketch: offline pipeline ingests user-video interactions, builds user and video embeddings (two-tower architecture is the standard starting point), trains a ranking model on top. Online pipeline: at request time, retrieve top-K candidate videos via approximate nearest neighbor search on user embedding, rerank with the ranking model, return top-N. Monitor: model freshness (retrain cadence), embedding drift, online metrics matching offline metrics. State one tradeoff: two-tower architecture trades recall quality for serving speed; reranker recovers some quality.

Q33. Design a fraud detection system for transactions.

Clarify: real-time or batch (real-time is harder, sub-100ms latency)? What's the cost of a false positive vs false negative? Sketch: offline pipeline ingests historical transactions, engineers features (recency, frequency, amount aggregates over rolling windows, graph features over the merchant-card network), trains a gradient boosting model. Online: at request time, look up features from a feature store, run inference, return a fraud score. Threshold tuning is owned by a downstream team based on cost-per-error. Monitoring: precision and recall at the operating threshold, drift in feature distributions, false-positive complaints from the customer-service team.

Q34. Design a content moderation classifier for user-uploaded text.

Clarify: what categories (hate speech, spam, threats, harassment)? Multi-label or multi-class? Latency budget (synchronous post vs background scan)? Sketch: offline pipeline trains a transformer-based encoder on labeled content, with class-imbalanced loss handling (focal loss or class weights). Online: inference at post time, with a confidence threshold above which content is auto-removed and below which it goes to a human review queue. The escalation queue is the load-bearing part; pure auto-classification fails the long tail. State one tradeoff: precision-recall tradeoff is owned by the policy team, not the ML team. Show the metric, let the business set the threshold.

Q35. Design a search ranking system.

Clarify: query volume, latency budget, ranking signals available, business metric (CTR, conversion, dwell time). Sketch: candidate retrieval first (BM25 or dense retrieval, 1000-10000 candidates), then a learning-to-rank model reranks the top candidates (LambdaMART for tabular features or a neural ranker for query-document interactions). Features: query-document text match, click logs, freshness, personalization signals. Two key principles: train the ranker on click data with appropriate position bias correction, evaluate offline with NDCG before shipping online to an A/B test.

Q36. Design an ML-powered ad bidding system.

Clarify: what's the auction format (first-price, second-price, dynamic)? What's the optimization objective (CTR maximizing, CPA target, ROI)? Sketch: offline pipeline trains a CTR prediction model on impression-click data and a conversion prediction model on click-conversion data. Online: at auction time, predict CTR and CVR for the user-ad pair, compute expected value, compare against advertiser bid floor, submit final bid. Cold-start problem (new ads, new users) is solved with exploration policies like Thompson sampling or epsilon-greedy. State one tradeoff: exploration vs exploitation directly trades short-term revenue for long-term model accuracy.

ML coding interview questions (4 Q)

The coding round is shorter than a SWE coding round but expects ML-specific patterns implemented from scratch in numpy.

Q37. Implement linear regression with gradient descent in numpy.

import numpy as np

def linear_regression_gd(X, y, lr=0.01, epochs=1000):
    n, d = X.shape
    w = np.zeros(d)
    b = 0.0
    for _ in range(epochs):
        y_pred = X @ w + b
        error = y_pred - y
        dw = (2 / n) * X.T @ error
        db = (2 / n) * error.sum()
        w -= lr * dw
        b -= lr * db
    return w, b

The grading is on the gradient derivation, the broadcasting in numpy, and whether you set a sane initialization. Bonus: add a regularization term, vectorize correctly, or note the closed-form solution exists.

Q38. Implement k-means clustering from scratch.

def kmeans(X, k, max_iter=100):
    n, d = X.shape
    # initialize centroids from random data points
    idx = np.random.choice(n, k, replace=False)
    centroids = X[idx]
    for _ in range(max_iter):
        # assign each point to nearest centroid
        distances = np.linalg.norm(X[:, None] - centroids, axis=2)
        labels = distances.argmin(axis=1)
        # update centroids as cluster means
        new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(k)])
        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids
    return labels, centroids

The interviewer will probe: how do you handle empty clusters? What's the time complexity? How would you initialize with k-means++? Be ready for the follow-up.

Q39. Implement scaled dot-product attention from scratch.

def attention(Q, K, V, mask=None):
    d_k = K.shape[-1]
    scores = Q @ K.transpose(-2, -1) / np.sqrt(d_k)
    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)
    weights = softmax(scores, axis=-1)
    return weights @ V

The probe: write softmax stably (subtract max before exponential), explain the sqrt(d_k) divisor, extend to multi-head. This is a common second-round-deep-dive question at ML-platform-heavy companies.

Q40. Implement a 2-layer neural network with backprop from scratch.

The full implementation is too long for this guide, but the structure: forward pass computes hidden = ReLU(W1 @ x + b1), output = sigmoid(W2 @ hidden + b2), loss = binary_cross_entropy(output, y). Backward pass: dL/dout = output - y, dL/dW2 = dL/dout @ hidden.T, dL/dh = W2.T @ dL/dout, dL/dW1 = (dL/dh * relu_derivative(z1)) @ x.T. Update with W -= lr * dW. The grading is on the chain rule, not the numpy vectorization. Get the math right; clean the vectorization second.

ML case-study interview questions (3 Q)

These are not algorithm questions. They're "tell me about a real ML decision" questions, and they distinguish candidates who have built models from candidates who have only studied them.

Q41. Walk me through an ML project you've worked on.

This is the most-asked behavioral-ML question. Expected structure: business problem, dataset, model choice and one tradeoff, evaluation, what you'd do differently. A passing answer at the new-grad bar covers a Kaggle competition or a deployed side project in 5-7 minutes, with one specific number cited (model score, dataset size, latency, error rate) and one honest reflection (what didn't work, what you'd change). The candidates who freeze on this question almost always don't have a finished project; they have half-finished projects they cannot fully describe.

Q42. You shipped a model and your business metric dropped. Walk me through your debug process.

Five steps: confirm the metric drop is statistically real (not a single-day spike), check whether feature distributions have shifted (data drift), check whether the label distribution has shifted (concept drift), audit the model serving path for bugs (wrong feature lookup, wrong feature scaling at inference vs training), and check whether downstream systems consuming the model output have changed (different threshold, different business logic). State the diagnosis tree out loud; the interviewer is grading your structure, not whether you guess the right answer.

Q43. You have 10x the training budget. What would you do first?

The answer that distinguishes new-grad candidates: more data, not a bigger model. The 2026 default for most production ML is that data-quality investments return more than parameter-count increases, especially below the frontier-model scale. State the principle, then describe one specific experiment: 10x the labeled training set, train the same architecture, measure the validation lift. If that lift is small, then experiment with model capacity. Going straight to "I'd train a bigger model" without that diagnostic is a common new-grad red flag.

How to prepare for a machine learning interview (5 steps)

A 4-week prep plan for a CS new grad whose ML coursework is conversational but who hasn't drilled the interview-specific patterns. Adjust if your starting point differs.

  1. Week 1: ML fundamentals deep dive. 3-4 hours per day on bias-variance, overfitting, regularization, evaluation metrics, cross-validation, and the canonical classical algorithms (linear and logistic regression, trees, gradient boosting, k-means, SVM). End the week able to write the gradient of logistic regression loss from memory.

  2. Week 2: deep learning + transformers + LLMs. Backprop math, activation functions, optimizers, normalization. Then transformer attention math, multi-head, positional encodings, encoder-only vs decoder-only. Finish with the LLM stack: RAG architecture, fine-tuning vs prompting, common failure modes, basic serving concepts. Build one toy project this week (a small classifier trained on a real dataset, or a small open-weight model fine-tuned on a domain-specific task).

  3. Week 3: ML system design practice. 5 canonical prompts at 45 minutes each, narrated out loud: recommendation, fraud, content moderation, search ranking, ad bidding. Box-and-arrow diagrams, vocabulary correct, one tradeoff per major choice. Stop at the new-grad bar; do not derive loss functions or push into senior-level capacity math.

  4. Week 4: ML coding + behavioral prep + 3 mocks. Implement gradient descent, k-means, attention, and a 2-layer net from scratch in numpy. Write STAR stories around your Kaggle competition and your deployed side project. Run 3 mock interviews: one breadth, one system design, one coding.

  5. Run timed mocks for the rounds that scare you. If ML system design feels weakest, do 3 system-design mocks instead of one. If breadth questions tank you under pressure, do 3 rapid-fire mocks. Mock discipline is what closes the gap between knowing-the-material and saying-it-on-camera under live observation.

Machine learning interview format by role type

The same ML knowledge gets tested differently across role types. The breakdown for the five most common ML-adjacent roles in 2026:

RoleML breadth depthML coding depthSystem design depthLLM focusProduction-ML focus
Machine Learning Engineer (MLE)HighMedium (numpy from scratch)High (training + serving pipeline)MediumHigh
Applied ScientistVery HighHigh (numpy + research code)MediumMedium-HighMedium
Data ScientistMediumLow (pandas mostly)Low (analysis pipeline only)LowLow
ML Research ScientistHighest (PhD-level)High (research code)LowHighestLow
Software Engineer (ML-adjacent)MediumLow (just enough ML to support the platform)MediumLowHigh (platform + serving)

Two patterns to notice. First, the MLE pipeline is the most balanced load: system design, coding, breadth, LLM all at moderate-to-high depth. This is the pipeline a CS new grad without papers should target. Second, ML Research Scientist is the path that filters most heavily on research signal, and the candidates who land it at the new-grad level either have papers or have built a competitive research portfolio outside of school. Most CS new grads do not fit this profile and shouldn't optimize for it at the entry level.

ML interview cheat sheet for the morning warmup

A one-page reference of the top 20 patterns, organized for the morning-of warmup. The act of writing this from memory is the prep; carrying it in is the safety net.

#ConceptOne-line summaryAsked in
1Bias-varianceHigh bias = underfit, high variance = overfitBreadth
2L1 vs L2L1 produces sparsity, L2 shrinks toward zeroBreadth
3Cross-validationk-fold partitioning to reduce evaluation varianceBreadth
4Logistic regression gradient(sigmoid(w·x) - y) * xCoding, Breadth
5Imbalanced classesPR-AUC and F1, not accuracyBreadth
6Bagging vs boostingParallel ensemble vs sequential ensembleBreadth
7Gradient boostingFit residuals with weak learners; XGBoost is canonicalBreadth
8Decision tree splitsMinimize Gini/entropy/MSE at each splitBreadth
9k-meansAlternate assign-and-update; k-means++ initCoding, Breadth
10Backprop mathChain rule through layers; memorize 2-layer derivationCoding
11ReLU vs GELUReLU for hidden layers, GELU inside transformersBreadth
12Adam vs SGDAdam = momentum + adaptive LR per paramBreadth
13DropoutZero activations during training; rescale at inferenceBreadth
14Layer normPer-example normalization (transformer default)Breadth
15Attention mathsoftmax(QK^T / sqrt(d_k)) VCoding, Breadth
16Multi-headMultiple parallel attentions; concatenate, projectBreadth
17Encoder vs decoderDecoder-only is the 2026 LLM defaultBreadth, LLM
18RAG vs fine-tuningRAG for changing knowledge, fine-tune for behaviorLLM
19Inference cost leversDistill > quantize > speculative > batch > cacheLLM, System design
20Two-tower retrievalStandard recsys architecture; ANN at servingSystem design

Memorize the top half. The bottom half is the polish.

Common machine learning interview mistakes for CS new grads

The most-reported mistakes from new-grad ML interviews in the 2025-2026 hiring cycle, in roughly the order of frequency:

Bluffing on math you don't remember. Asked to derive the gradient of logistic regression, the candidate handwaves through chain rule, gets stuck, panics, gives a wrong final answer. The recovery posture beats the bluff. Say "I remember this involves chain rule and the sigmoid derivative; let me work through it out loud" and you get partial credit. Saying "the answer is X" confidently and being wrong gets zero.

Going deep on LLM mechanics while weak on fundamentals. A candidate who can explain mixture-of-experts but freezes on "explain cross-validation" loses the round. The breadth round grades fundamentals first. Get those airtight before reading the latest transformer-architecture paper.

Treating an ML system design round like a SWE system design round. SWE system design designs a service; ML system design designs a pipeline plus a service. New-grad candidates often skip the training pipeline entirely and jump straight to "I'd put a load balancer in front of the model." The training pipeline (data ingestion, feature engineering, model training, evaluation, retraining cadence) is half the answer.

Naming a model architecture without saying why. "I'd use a transformer-based encoder" without stating the access pattern, the data type, and the inference latency budget is the canonical red flag. Name the constraint first, the architecture second.

Not having a Kaggle competition or side project to talk about. The behavioral-ML round always asks "walk me through an ML project you've worked on." Candidates without a finished project freeze. The recovery is "I completed one Kaggle competition where I tried X; here's what I learned." That answer beats "I haven't shipped a model yet, but I've taken three ML courses."

Confusing precision and recall. Asked the difference, the candidate flips the definitions. Precision = of the items the model labeled positive, what fraction were truly positive. Recall = of the items truly positive, what fraction did the model label positive. Write both definitions on the back of an index card and review them the morning of the interview. This is a binary check. Get it right or fail the breadth round.

Skipping the production-ML lifecycle in MLE interviews. MLE roles grade monitoring, drift detection, rollback, A/B testing, and feature-store discipline. Candidates who study only model architecture and skip the lifecycle questions get caught at the system-design round. Spend at least one day on the production lifecycle, especially monitoring patterns and drift diagnostics.

One thing I'd add from watching CS new grads do this: the candidate who shows up with one finished side project and a clean STAR story about it almost always outperforms the candidate with three half-finished projects and a longer resume. The interviewers are not counting projects; they're grading the depth of conversation about one. Finish one. Talk about that one. Done.

How to compete for ML interviews without research papers

This is the question most CS new grads agonize about silently. The honest read for 2026:

Research papers are required for frontier-model lab entry, not for MLE entry. Mid-market tech companies and ML-heavy startups hire CS new grads on production-ML competence. A finished Kaggle competition, a deployed side project, and a strong showing on an ML system design round is the competitive resume for those companies.

The Kaggle pivot is the single biggest-return 30-day move. If your resume has no ML projects, one finished Kaggle competition closes the gap. Pick a tabular classification competition (shorter learning curve than vision or NLP at the entry level). Spend 15-20 hours building a baseline plus one round of feature engineering. Submit. Write a one-page kernel explaining what you tried and why. The score does not need to medal. The signal is "this person built and submitted an ML model end-to-end."

The deployed-endpoint signal beats the half-finished-research signal. Recruiters do not read research-project descriptions in resume bullets. They look at GitHub READMEs and deployed-endpoint URLs. A simple binary classifier deployed behind an API endpoint with a one-page README explaining the architecture and one tradeoff is graded higher than a half-finished research project that nobody can read.

Side projects should be small, finished, and explainable. Three patterns that hit the MLE-resume bar in 2026:

  1. A binary classifier on a public dataset with a deployed REST endpoint and a README explaining one model decision.
  2. A retrieval-augmented question-answering bot over a public document corpus, with a working frontend.
  3. A small open-weight model fine-tuned on a domain-specific task, with a writeup of the fine-tuning approach and one before-after metric.

Each takes 15-30 hours for a CS new grad and produces a portfolio line that holds up in interview discussion.

The behavioral story is the gluing layer. The recruiter and hiring manager don't read your code; they read your project's one-page README and ask you to walk through it for 5-7 minutes. Practice that walkthrough out loud until it lands clean. Specific dataset, specific model choice, one tradeoff stated honestly, one number cited, one honest reflection on what didn't work. That story is more valuable than the technical depth of the project.

The CS new grads who landed MLE offers in the 2025-2026 cycle without papers all did some version of this: one finished Kaggle competition, one deployed side project, 30 days of focused prep on the seven question categories above, three mock interviews in the last week. The pattern is repeatable. Run it.

Key terms

Bias-variance tradeoff
The tension between underfitting (high bias, model too simple) and overfitting (high variance, model too sensitive to training data). Diagnosed by comparing training and validation loss. The single most-asked ML fundamentals concept.
Regularization
Techniques that constrain model capacity to prevent overfitting. L1 (lasso) produces sparse weights; L2 (ridge) shrinks weights toward zero; dropout zeroes random activations during training; early stopping halts training when validation loss climbs.
Cross-validation
k-fold partitioning of training data to estimate generalization more reliably than a single train-validation split. Standard k = 5 or 10.
Gradient boosting
An ensemble method that fits successive weak learners (typically shallow trees) to the residuals of the previous ensemble. Production-grade variants (XGBoost, LightGBM, CatBoost) dominate tabular-ML benchmarks in 2026.
Attention + self-attention
The core operation inside transformer-based LLMs. Computes a weighted sum of value vectors where the weights are softmax(QK^T / sqrt(d_k)). Self-attention is when Q, K, V come from the same input; cross-attention is when they come from different sources.
Encoder-only vs decoder-only vs encoder-decoder
Three transformer architectures. Encoder-only (BERT-style) for classification and embedding. Decoder-only (the 2026 default for transformer-based LLMs) for generation. Encoder-decoder (T5-style) for translation and summarization.
RAG (Retrieval-Augmented Generation)
Retrieving relevant documents at inference time and including them in the prompt to the LLM. Used when the knowledge base changes often or when source citation is required.
Fine-tuning vs prompting
Two ways to adapt an LLM to a task. Prompting modifies the input; fine-tuning modifies the weights. Prompt for fast iteration; fine-tune when behavior change is needed at scale and inference cost matters.
ML system design
The architecture discipline of designing end-to-end ML pipelines: data ingestion, feature engineering, model training, evaluation, serving, and monitoring. Distinct from SWE system design in that it includes a training pipeline as well as a serving pipeline.
Drift (data drift, concept drift)
The phenomenon of a deployed model's performance degrading because either the input distribution (data drift) or the input-output relationship (concept drift) has shifted from the training distribution. Monitoring for drift is a production-ML lifecycle requirement.

Related guides


About the author: Alex Chen is the founder of InterviewChamp.AI, building AI interview prep for the new-grad CS market and writing about the modern interview gauntlet from the inside.

Related guides

Interview Process

System Design Interview Guide for CS New Grads (2026): Framework, Templates, Cheat Sheet

The new-grad system design interview is a vocabulary check, a structure check, and a communication check, not a senior architect evaluation. This guide gives you a 4-step framework, a 12-template cheat sheet, a 45-minute time budget, the five canonical problems that carry 80% of new-grad rotations, and a side-by-side of HLD vs LLD vs machine-learning-system-design. Built for the CS new grad who has solved 600 LeetCode problems but never drawn a load balancer.

Alex Chen ·

Read more →
Interview Process

The 2026 CS New-Grad Interview Loop: Phone Screen to Offer at Every Tier

The 2026 CS new-grad interview loop runs five steps (recruiter screen, technical screen, onsite, debrief, offer) but the shape of each step now depends on tier of company. This guide maps the loop for FAANG, mid-tier public, startup, consultancy, and research lab, with 2026 timelines and how AI-fraud concerns brought in-person rounds back.

Alex Chen ·

Read more →
Interview Process

Accounting Interview Questions for 2026: 40+ Questions for Staff Accountants, Big 4 Candidates, and CPA Pivots

Accounting interview questions in 2026 test six things at once: do you know GAAP cold, can you walk a transaction from journal entry to the three financial statements, can you read a balance sheet under pressure, do you understand the difference between Big 4 audit and corporate close work, can you handle the behavioral round without sounding rehearsed, and can you reason through a case study when the prompt is intentionally vague. If you're an accounting grad, a CPA candidate, or pivoting from finance/ops into staff accountant work, the technical bar isn't the killer. It's framing what you know in 60 seconds while a senior manager watches you on Zoom. This guide walks 40+ questions across six categories, the Big 4 vs corporate vs public-accounting split, and the four-week prep plan that actually works.

Alex Chen ·

Read more →

Frequently asked questions

What machine learning interview questions should I prepare for in 2026?
Prepare across seven categories: ML fundamentals (bias-variance, overfitting, regularization, evaluation metrics), classical algorithms (linear and logistic regression, trees, gradient boosting, k-means, SVM), deep learning and transformers (backprop, attention, the difference between encoder-only and decoder-only architectures), LLMs (tokenization, RAG, fine-tuning vs prompting, hallucination control), ML system design (feature stores, training pipelines, online vs offline serving, monitoring), coding implementation (numpy gradient descent, k-means from scratch, attention from scratch at the senior bar), and one or two case studies on a real model choice. Entry-level loops weight fundamentals + classical algorithms heavily (~50%), deep learning and LLMs ~20%, ML system design ~20%, coding ~10%. Research-paper depth is rarely tested at the new-grad bar.
Can a CS new grad without research papers get an ML engineer interview in 2026?
Yes, but the path is narrower and the prep has to be sharper. The honest read for 2026: pure ML Research Scientist roles at frontier labs do filter on papers (or top-3 conference internships), but ML Engineer (MLE), Applied Scientist, and Production ML roles at mid-market tech companies and ML-heavy startups hire CS new grads who can ship. The interview pivot: show ML literacy through a Kaggle competition (even one you didn't medal in), a deployed side project, or a strong showing on an ML system design round. Two finished side projects on a resume beat one half-finished research project nobody can read.
What's the difference between an ML engineer interview and a data scientist interview?
ML Engineer interviews test production-ML competence: can you train a model, package it, serve it behind an API at sub-100ms latency, monitor it for drift, and roll it back when it breaks. Heavy on system design, devops adjacency, and software engineering fundamentals. Data Scientist interviews test statistics, experimentation, and business reasoning: A/B test design, regression, feature engineering, hypothesis testing. SQL appears in both. Only MLE interviews go deep on Spark, Kubernetes, model serving infrastructure, and the production lifecycle. A CS new grad with strong software fundamentals usually fits the MLE pipeline better than the DS pipeline.
How many rounds is a machine learning engineer interview loop?
Entry-level MLE loops at large employers run 5-6 rounds: recruiter screen, technical phone screen (ML fundamentals + light coding), an ML coding round (implement gradient descent or k-means from scratch in numpy), an ML breadth round (10-15 rapid-fire fundamentals questions), an ML system design round (design a recommendation pipeline or a feature store), and a behavioral round. Some companies add a domain-specific deep dive: NLP, computer vision, recsys, or LLM serving depending on the team. Senior loops add one more depth round. Total onsite typically 4-5 hours.
What ML fundamentals questions get asked in every 2026 interview?
Six show up nearly every loop: explain the bias-variance tradeoff, what is overfitting and how do you detect it, what regularization techniques have you used and when, how do you evaluate a classification model with imbalanced classes, what's the difference between L1 and L2 regularization, and walk me through cross-validation. These are the floor. Miss any of them and you fail the breadth round regardless of how deep your transformer knowledge goes. Drill these first, drill them cold.
What deep learning interview questions should I expect?
Five categories: forward and backward propagation (write the math for a 2-layer network), activation functions and when to use them (ReLU vs sigmoid vs tanh vs GELU), optimizers (SGD vs Adam vs AdamW, momentum, learning rate schedules), regularization specific to deep nets (dropout, batch norm, layer norm, weight decay), and architectural fundamentals (CNNs for vision, RNNs vs transformers for sequence). At ML-engineer-heavy companies expect at least one practical question: 'your validation loss is increasing while training loss decreases, what do you check first?'
What transformer interview questions should I prepare for?
Attention math is the entry ticket. Be able to write the scaled dot-product attention formula, explain why we scale by sqrt(d_k), and describe the difference between self-attention and cross-attention. Beyond that: multi-head attention (why multiple heads instead of one big one), positional encodings (absolute vs RoPE), the encoder-only vs decoder-only vs encoder-decoder distinction, KV-cache and why it matters at inference, and tokenization (BPE, SentencePiece, where they differ). At ML-platform-heavy companies, also know about FlashAttention and why memory-IO matters more than FLOPS for transformer inference.
What LLM interview questions are companies asking in 2026?
Six topics dominate: how does RAG work and when do you choose it over fine-tuning, what's the difference between prompting and fine-tuning, what are the common failure modes of LLMs (hallucination, prompt injection, context-length issues), how do you evaluate an LLM-powered feature (offline benchmarks vs online metrics), what's the difference between dense and sparse retrieval, and how would you reduce inference cost on a production LLM endpoint. At ML-platform companies expect deeper questions on quantization, distillation, speculative decoding, and serving infrastructure for transformer-based LLMs.
What does an ML system design interview look like for a new grad?
A 45-60 minute conversation where the interviewer asks you to design an end-to-end ML pipeline. Common prompts: design a recommendation system for a video app, design a fraud detector for transactions, design a content moderation classifier, design a job-matching system. Expected structure: clarify the problem (what's the business metric you're optimizing), define inputs and labels, sketch the offline training pipeline (data ingestion, feature engineering, model training, evaluation), sketch the online serving pipeline (feature lookup, model inference, latency budget, monitoring), and state one tradeoff per major choice. New-grad bar stops at the box-and-arrow diagram plus correct vocabulary; you do not need to derive the loss function.
How do I prepare for an ML interview as a CS new grad in 30 days?
Week 1: ML fundamentals deep dive. Bias-variance, overfitting, regularization, evaluation metrics, the canonical classical algorithms. 3-4 hours per day from a structured textbook plus 30 minutes of flashcards. Week 2: deep learning and transformers. Math of backprop, transformer attention, the LLM stack (RAG, fine-tuning, prompting). Build one toy project (train a small classifier on a real dataset, fine-tune one small open-weight model on a domain task). Week 3: ML system design. Practice 5 canonical prompts (recommendation, fraud, content moderation, search ranking, ad bidding) at exactly 45 minutes each, narrating out loud. Week 4: ML coding (numpy gradient descent, k-means from scratch, simple attention from scratch) + behavioral prep + 3 timed mock interviews. The Kaggle competition or side project should already be on the resume by week 1; if not, start it now in parallel.
Do I need a Kaggle medal to get an ML interview?
No, but you do need to show you've shipped something. The Kaggle medal narrative is overweighted on the internet. In practice, recruiters skim for evidence of finished ML work: a Kaggle competition you completed (even without a medal), a side project with a deployed endpoint, a write-up explaining one model decision you made and why. The signal isn't 'top 1% on Kaggle.' The signal is 'this person has shipped a model end-to-end and can talk about it.' For 2026 new grads, one finished Kaggle competition plus one deployed side project on your resume hits the bar at most mid-market ML-engineer-friendly companies.
What's the most common ML interview mistake new grads make?
Bluffing on math they don't fully know. The classic version: asked to derive the gradient of a logistic regression loss, the candidate handwaves through chain rule, gets stuck halfway, panics, gives a wrong final answer. The recovery posture beats the bluff. Saying 'I remember this is a chain-rule problem and the answer involves sigmoid(x) times (1 - sigmoid(x)), but let me work through it step by step out loud' is graded higher than confidently producing a wrong derivation. Interviewers reward calibrated uncertainty more than they punish gaps.
How important is a strong LLM section for a new-grad ML interview in 2026?
More important than it was in 2024, less important than the internet implies. Most non-research ML interviews in 2026 ask 1-2 LLM questions in a 5-round loop: usually at the RAG-vs-fine-tuning level, occasionally a transformer-internals question. Going deep on LLM mechanics is a strong tiebreaker signal at LLM-heavy companies, but it does not substitute for fundamentals. The pattern that loses offers: candidate who can explain mixture-of-experts and speculative decoding but freezes on 'explain cross-validation.' Get the fundamentals airtight first.
What's the difference between an ML interview at a frontier-model lab vs a mid-market tech company?
Depth and selection. Frontier labs filter heavily on research signal (papers, top conference publications, top-school PhD or top-3 internship) and grade research-quality thinking: derivations, ablations, original architecture proposals. Mid-market tech companies grade production-ML competence. Can you train a model, deploy it, monitor it, and iterate. The new-grad-without-papers strategy is to skip the frontier-lab pipeline at the entry level and target mid-market ML-engineer roles, where strong CS fundamentals plus one finished side project plus one deployed Kaggle entry is a competitive resume. Frontier-lab entry happens later, after 2-3 years of production ML experience.