AI and machine learning interviews test conceptual understanding, mathematical intuition, practical implementation knowledge, and awareness of modern developments in LLMs and deep learning. The field moves quickly โ interviewers expect candidates to know not just classical ML but also transformer architectures, fine-tuning, and the practical considerations of deploying AI systems. Here's what you need to know.
Foundational ML Concepts Every Candidate Must Know
"What is the bias-variance tradeoff?" This is the fundamental tension in supervised machine learning. Bias is error from incorrect assumptions in the learning algorithm โ a high-bias model is too simple and misses patterns in the data (underfitting). Variance is sensitivity to fluctuations in the training data โ a high-variance model learns the training data too precisely, including its noise, and generalizes poorly to new data (overfitting). The tradeoff: models complex enough to capture true patterns are also complex enough to learn noise; models simple enough to ignore noise may also be too simple to capture patterns. Regularization techniques (L1/L2 regularization, dropout, early stopping) add controlled bias to reduce variance. Cross-validation and train/validation/test splits are how you detect where a model falls on this spectrum.
"Explain overfitting and the techniques used to prevent it." Overfitting occurs when a model learns the training data too well โ memorizing noise and specific examples rather than generalizing patterns. Signs: high training accuracy, significantly lower validation accuracy. Prevention techniques: L2 regularization (Ridge) adds a penalty proportional to the square of parameter magnitudes to the loss function, discouraging large weights. L1 regularization (Lasso) uses the absolute value of parameters, encouraging sparsity (some weights become exactly zero โ effectively feature selection). Dropout randomly zeroes some neural network neurons during training, preventing co-adaptation. Early stopping monitors validation performance and stops training when it starts degrading. Data augmentation artificially increases training data by applying transformations (flips, rotations, crops for images). Cross-validation provides a more robust performance estimate than a single train/test split.
"What is the difference between supervised, unsupervised, and reinforcement learning?" Supervised learning trains on labeled data โ input-output pairs โ learning to map inputs to outputs. Classification (predicting a category) and regression (predicting a number) are supervised tasks. Unsupervised learning finds patterns in unlabeled data โ discovering structure without being told what to look for. Clustering (grouping similar data points), dimensionality reduction (finding compact representations), and anomaly detection are unsupervised tasks. Reinforcement learning trains an agent to take actions in an environment to maximize a cumulative reward. The agent learns through trial and error โ no labeled data, just feedback on whether actions led to good or bad outcomes. Games, robotics control, and recommendation systems use reinforcement learning.
Feature Engineering and Data Preprocessing
"What is feature scaling and why does it matter?" Feature scaling normalizes the range of input features. Many algorithms (gradient descent, SVM, KNN, neural networks) are sensitive to feature scale โ a feature measured in millions will dominate a feature measured in single digits, causing the model to effectively ignore the smaller-scale feature. Normalization (Min-Max scaling) scales each feature to [0,1]. Standardization (Z-score) transforms features to mean 0 and standard deviation 1. Decision trees and random forests are notably insensitive to feature scaling โ they split on thresholds rather than distances. When in doubt, standardize โ it's robust to outliers and compatible with most algorithms.
"How do you handle missing data?" Multiple strategies exist. Deletion: remove rows with missing values (only if missing data is random and the loss of data won't significantly reduce dataset size). Mean/median/mode imputation: replace missing values with the column's mean (for numeric), median (more robust to outliers), or mode (for categorical). Model-based imputation: use a predictive model (KNN imputation, MICE) to predict missing values from other features. Indicator variables: add a binary column indicating which rows had missing values โ preserves the "missingness" as a feature. The choice depends on how much data is missing, whether missingness is random or systematic, and the algorithm being used.
Neural Networks and Deep Learning
"What is backpropagation?" Backpropagation is the algorithm used to calculate gradients โ how much each parameter contributed to the error โ so gradient descent can update them to reduce error. Forward pass: compute the prediction and calculate the loss. Backward pass: compute the gradient of the loss with respect to each parameter, working backwards from the output layer to the input layer using the chain rule of calculus. The computed gradients tell the optimizer which direction to adjust each weight and by how much. This process repeats for each batch of training data over many epochs until the model converges to a solution.
"What are transformers and why did they revolutionize NLP?" Transformers (introduced in "Attention Is All You Need," 2017) are a neural network architecture built on the self-attention mechanism. Self-attention allows each position in a sequence to attend to all other positions, capturing long-range dependencies more effectively than RNNs, which process sequences sequentially and struggle with long-distance relationships. Transformers process all positions in parallel (enabling much faster training on GPUs), scale effectively with more data and compute, and transfer well โ a transformer pre-trained on large text corpora can be fine-tuned for specific tasks with small datasets. GPT, BERT, T5, and virtually all modern LLMs are transformer-based.
Modern AI: LLMs and Practical Deployment
"What is RAG (Retrieval Augmented Generation) and when do you use it?" RAG combines an LLM with a retrieval system. When a query is received, relevant documents are retrieved from a knowledge base (using vector similarity search) and included in the LLM's context along with the query. The LLM generates a response grounded in the retrieved documents. RAG solves LLM hallucination problems for domain-specific knowledge: instead of relying on the model's training data (which may be outdated or lack specific knowledge), the model answers based on authoritative retrieved documents. Use RAG for: customer service bots with product documentation, internal knowledge base search, legal and medical question answering over authoritative sources.
"What is fine-tuning an LLM, and when is it better than prompting?" Fine-tuning adjusts the weights of a pre-trained model on a domain-specific dataset, adapting the model's style, tone, format, or specialized knowledge. It's better than prompting when: you need consistent output format that's hard to enforce via prompting alone, you have a specialized domain requiring terminology or reasoning not well represented in training data, you need improved performance on a specific task with labeled examples available, or you need to reduce token costs by encoding knowledge into weights rather than sending long prompts. Prompting (and RAG) should be tried first โ fine-tuning is expensive, requires training infrastructure, and is irreversible for the base model. PEFT techniques (LoRA, QLoRA) make fine-tuning significantly more efficient by training only a small number of additional parameters.
