Prev Next

Python / Python Deep Learning and Neural Networks Interview Questions

1. What is a neural network and how does forward propagation work mathematically? 2. Explain backpropagation mathematically. How does the chain rule enable computing gradients through many layers? 3. What are the most common activation functions and why did ReLU replace sigmoid/tanh as the default? 4. What are vanishing and exploding gradients, and what techniques are used to address them? 5. Why does weight initialization matter in neural networks, and what is the difference between Xavier and He initialization? 6. How does Batch Normalization work mathematically and why does it stabilize training? 7. Compare SGD, SGD with momentum, RMSProp, and Adam optimizers. When do you choose each? 8. How does Dropout work mathematically, and why does it act as regularization? 9. Explain how convolutional layers work and why they are well-suited to image data. 10. How do RNNs work and why did LSTMs solve the long-range dependency problem? 11. What is the self-attention mechanism in Transformers and why did it replace RNNs for sequence modeling? 12. What loss functions does PyTorch provide for classification and regression, and which to use when? 13. What is transfer learning and how do you fine-tune a pretrained model in PyTorch? 14. How does PyTorch's Dataset and DataLoader pipeline work, and what are the key performance considerations? 15. Why is learning rate scheduling important and what are the most common strategies? 16. What are the most effective regularization strategies for deep learning and how do they differ from classical ML regularization? 17. What are embedding layers in deep learning and how are they different from one-hot encoding? 18. How do you save and load PyTorch models correctly, and what is included in a proper checkpoint? 19. What is mixed precision training and how does it speed up deep learning with torch.cuda.amp? 20. What is the difference between model.eval(), torch.no_grad(), and torch.inference_mode()? When do you use each? 21. How do you use GPUs in PyTorch and what are the key patterns for writing device-agnostic code? 22. What are the differences between Batch Norm, Layer Norm, Group Norm, and Instance Norm? 23. What is an autoencoder and what can a well-trained latent space be used for? 24. How do you diagnose a neural network that is not training correctly from its loss curves? 25. What is the mathematical setup of a Generative Adversarial Network (GAN) and what training challenges do they have? 26. What is torch.compile and how does it speed up PyTorch model execution? 27. Why do Transformers need positional encodings and how does sinusoidal encoding work? 28. What are the most impactful hyperparameters to tune in deep learning and what is the recommended search order? 29. What is an encoder-decoder architecture and how is it used for sequence-to-sequence tasks? 30. What is model quantization in deep learning and how does PyTorch support it? 31. What does a production-quality PyTorch training loop look like, incorporating all best practices? 32. How does batch size affect deep learning training mathematically and practically? 33. How do you choose the right layer type (Linear, Conv, Attention) for a given input modality? 34. What evaluation metrics are most commonly used in deep learning tasks and how do you implement them in PyTorch? 35. How do you export a PyTorch model for production deployment using TorchScript or ONNX? 36. What is knowledge distillation and how does it compress large neural networks into smaller ones? 37. What is self-supervised learning and how do contrastive methods like SimCLR learn representations? 38. How would you implement and train a simple feedforward neural network in PyTorch from scratch, without using nn.Sequential?
Could not find what you were looking for? send us the question and we would be happy to answer your question.

1. What is a neural network and how does forward propagation work mathematically?

A neural network is a parameterised function composed of stacked layers. Each layer applies a linear transformation followed by a non-linear activation: h = σ(Wx + b), where W is a weight matrix, b is a bias vector, and σ is an activation function. Stacking L such layers gives a universal function approximator capable of learning arbitrarily complex input–output mappings, provided the network is wide or deep enough.

Forward propagation simply evaluates this composed function left to right: the input x passes through layer 1, the output becomes the input to layer 2, and so on until the final layer produces a prediction. The entire computation is a directed acyclic graph (DAG) of tensor operations — exactly the structure PyTorch's autograd engine records to enable automatic differentiation.

import torch
import torch.nn as nn

class TwoLayerNet(nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden_dim)   # W1, b1
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, out_dim)  # W2, b2

    def forward(self, x):
        h = self.relu(self.fc1(x))  # h = ReLU(W1 x + b1)
        return self.fc2(h)           # y = W2 h + b2

model = TwoLayerNet(in_dim=10, hidden_dim=64, out_dim=1)
x = torch.randn(32, 10)   # batch of 32 inputs
y_hat = model(x)           # forward pass — calls model.forward(x)
print(y_hat.shape)         # torch.Size([32, 1])

Why depth matters: a network with one wide hidden layer can theoretically approximate any function (universal approximation theorem), but deeper networks can represent certain functions exponentially more efficiently — a function that needs an exponentially wide shallow network may be captured by a compact deep one, because each layer can reuse and compose features built by earlier layers.

What does each layer in a neural network compute?
Why does PyTorch's autograd record the forward-pass computation graph?
2. Explain backpropagation mathematically. How does the chain rule enable computing gradients through many layers?

Backpropagation is the algorithm for computing the gradient of a scalar loss L with respect to every parameter in the network. It exploits the chain rule of calculus: if the loss depends on parameter W through intermediate quantities h₁, h₂, ..., hₙ, then ∂L/∂W = (∂L/∂hₙ)(∂hₙ/∂hₙ₋₁)···(∂h₁/∂W). Backprop applies the chain rule systematically starting from the loss and working backwards through each layer, accumulating local gradients.

At each layer, two quantities are needed: the local gradient (how does the layer's output change with its input/weights?) and the upstream gradient (how does the loss change with this layer's output?). Multiplying them gives the gradient flowing to the layer's parameters and to its input, which becomes the upstream gradient for the preceding layer.

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(10, 64), nn.ReLU(),
    nn.Linear(64, 1)
)
loss_fn  = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

x = torch.randn(32, 10)
y = torch.randn(32, 1)

# --- Standard training step ---
optimizer.zero_grad()    # 1. Clear old gradients (they accumulate!)
y_hat = model(x)         # 2. Forward pass — build computation graph
loss  = loss_fn(y_hat, y)# 3. Compute scalar loss
loss.backward()          # 4. Backprop — traverse graph in reverse
                         #    populates .grad for every parameter
optimizer.step()         # 5. Update parameters: W -= lr * W.grad

# Inspect gradients of first layer
print(model[0].weight.grad.shape)  # torch.Size([64, 10])

# Manual chain rule for a single neuron:
# loss = (y_hat - y)^2, y_hat = w*x + b
# dL/dw = 2*(y_hat - y) * x  <- upstream * local
w = torch.tensor([2.0], requires_grad=True)
x_s = torch.tensor([3.0])
y_s = torch.tensor([1.0])
loss_s = (w * x_s - y_s) ** 2
loss_s.backward()
print(w.grad)   # tensor([40.]) == 2*(2*3-1)*3
In backpropagation, what two quantities are multiplied at each layer to compute the gradient?
Why must optimizer.zero_grad() be called before each backward pass in PyTorch?
3. What are the most common activation functions and why did ReLU replace sigmoid/tanh as the default?

Activation functions introduce non-linearity — without them, stacking linear layers would collapse into a single linear transformation. Several families exist, each with different mathematical properties that affect training dynamics.

Common Activation Functions
FunctionFormulaRangeKey property
Sigmoid1/(1+e⁻ˣ)(0, 1)Saturates for |x|>>0 — causes vanishing gradient
Tanh(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)(-1, 1)Zero-centred; still saturates
ReLUmax(0, x)[0, ∞)Non-saturating for x>0; sparse; fast
Leaky ReLUmax(αx, x) α≈0.01(-∞,∞)Fixes ReLU's dying neuron problem
GELUx·Φ(x)(-∞,∞)Used in BERT/GPT; smooth probabilistic gate
Softmaxeˣⁱ/Σeˣʲ(0,1) sums to 1Multi-class output — probability distribution
import torch
import torch.nn.functional as F

x = torch.linspace(-3, 3, 7)

print(F.relu(x))         # [0, 0, 0, 0, 1, 2, 3]  (zeroes negatives)
print(F.sigmoid(x))      # (0,1) — saturates near 0 and 1 at extremes
print(F.tanh(x))         # (-1,1) — saturates near ±1
print(F.leaky_relu(x, negative_slope=0.01))  # small slope for x<0
print(F.gelu(x))         # smooth variant used in transformers

# Softmax: multi-class final layer
logits = torch.tensor([2.0, 1.0, 0.1])
probs  = F.softmax(logits, dim=0)
print(probs)              # [0.659, 0.242, 0.099] — sums to 1.0

# In a model: prefer nn.ReLU() (in-place optional with inplace=True)
import torch.nn as nn
act = nn.ReLU()  # stateless — can be shared across layers

Why ReLU replaced sigmoid: for large networks the vanishing gradient problem made sigmoid/tanh networks nearly untrainable. For a neuron deep in the network, the gradient arriving from backprop has already been multiplied by many sigmoid derivatives — each at most 0.25 — so the gradient shrinks exponentially with depth. ReLU's derivative is exactly 1 for positive inputs (no shrinkage in that direction), allowing gradients to flow through deep networks without exponential decay. The trade-off is the 'dying ReLU' problem where neurons receiving strongly negative inputs get stuck outputting zero permanently, addressed by Leaky ReLU and ELU variants.

Why did sigmoid/tanh activations cause problems in deep networks?
What is the 'dying ReLU' problem?
4. What are vanishing and exploding gradients, and what techniques are used to address them?

Vanishing gradients occur when gradients shrink exponentially as they are backpropagated through many layers — the product of many small numbers (e.g. sigmoid derivatives ≤ 0.25) approaches zero, making early layer weights unable to update meaningfully. Exploding gradients are the opposite: the product of many large numbers causes gradients to grow exponentially, destabilising training with numerically infinite or NaN updates.

Both problems worsen with depth. The root mathematical cause is that repeated matrix multiplication of the weight matrices during backprop concentrates the gradient spectrum: if weight matrices have singular values consistently less than 1, gradients vanish; if greater than 1, they explode. Several techniques address this:

Solutions to Gradient Problems
TechniqueAddressesHow it helps
ReLU / Leaky ReLUVanishingGradient = 1 for positive inputs — no shrinkage
Batch NormalisationBothNormalises layer inputs; stabilises gradient magnitude
Residual connections (ResNet)VanishingGradient highway: ∂L/∂x = ∂L/∂(x+F) flows directly
Gradient clippingExplodingCaps gradient norm before the update step
Careful weight init (Xavier/He)BothEnsures variance stable across layers at init
LSTM/GRU gatesVanishing (RNNs)Gating controls gradient flow through time
import torch
import torch.nn as nn

# Gradient clipping — applied AFTER backward(), BEFORE optimizer.step()
model = nn.LSTM(input_size=10, hidden_size=128, num_layers=3, batch_first=True)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

x = torch.randn(32, 20, 10)   # (batch, seq_len, input_size)
output, _ = model(x)
loss = output.sum()

optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # clip!
optimizer.step()

# Residual connection in code:
class ResidualBlock(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, dim), nn.ReLU(),
            nn.Linear(dim, dim)
        )
    def forward(self, x):
        return x + self.net(x)  # gradient flows through x directly
What is the mathematical root cause of vanishing gradients in deep networks?
How do residual connections (skip connections) help gradients flow in deep networks?
5. Why does weight initialization matter in neural networks, and what is the difference between Xavier and He initialization?

If weights are initialized too small, activations and gradients shrink layer by layer — a form of vanishing gradient from the start. If too large, they explode. The goal of principled initialisation is to keep the variance of activations and gradients roughly constant across all layers at the start of training.

Xavier (Glorot) initialisation draws weights from a distribution with variance 2/(fan_in + fan_out). It was derived assuming linear activations (or tanh in the original paper) by requiring that the variance of the layer's output equals the variance of its input. He (Kaiming) initialisation uses variance 2/fan_in, derived for ReLU activations specifically — since ReLU zeroes out half the input on average, the variance of the output is halved, so doubling the initial weight variance compensates for this. Using Xavier with ReLU causes variance to shrink by roughly half per layer, eventually vanishing.

import torch
import torch.nn as nn

# Default PyTorch Linear layers use Kaiming Uniform initialisation
layer = nn.Linear(256, 128)
print(layer.weight.std())  # approximately sqrt(2/256) ≈ 0.088

# Explicit initialisation
def init_weights(m):
    if isinstance(m, nn.Linear):
        # Xavier: good for sigmoid/tanh activations
        nn.init.xavier_uniform_(m.weight)
        # He/Kaiming: good for ReLU activations (default in PyTorch)
        # nn.init.kaiming_uniform_(m.weight, nonlinearity='relu')
        nn.init.zeros_(m.bias)

model = nn.Sequential(
    nn.Linear(784, 256), nn.ReLU(),
    nn.Linear(256, 128), nn.ReLU(),
    nn.Linear(128, 10)
)
model.apply(init_weights)  # apply init_weights to every sub-module

# Verifying activation variance stays stable across layers:
x = torch.randn(100, 784)
for layer in model:
    x = layer(x)
    print(f'{layer.__class__.__name__}: std={x.std():.3f}')
# With He init + ReLU: std should remain near 1.0 throughout
Why does Xavier initialization use variance 2/(fan_in + fan_out) rather than a constant?
Why should He initialization be used with ReLU activations instead of Xavier?
6. How does Batch Normalization work mathematically and why does it stabilize training?

Batch Normalisation (BN) normalises the pre-activation values within a mini-batch to have zero mean and unit variance, then rescales them with learnable parameters γ (scale) and β (shift): BN(x) = γ · (x - μ_B) / √(σ²_B + ε) + β, where μ_B and σ²_B are the batch mean and variance, and ε is a small constant for numerical stability.

BN addresses internal covariate shift — the distribution of each layer's inputs changes during training as the preceding layers' weights update, forcing each layer to continuously adapt to a moving target. By renormalising inputs at each layer, BN stabilises this distribution. In practice, BN also provides a mild regularisation effect (similar to adding noise via the mini-batch statistics), reduces sensitivity to learning rate, and substantially reduces the need for dropout in many architectures.

import torch
import torch.nn as nn

# BatchNorm1d: for fully-connected layers (normalises over batch dim)
# BatchNorm2d: for conv layers (normalises per channel over batch+spatial)

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.BatchNorm1d(256),    # BN BEFORE or AFTER activation — varies by paper
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.BatchNorm1d(128),
    nn.ReLU(),
    nn.Linear(128, 10),
)

# BatchNorm behaves DIFFERENTLY in train vs eval mode!
model.train()   # uses batch mean/var during forward pass
model.eval()    # uses running mean/var (exponential moving avg)

# Always call model.eval() at inference time:
with torch.no_grad():
    model.eval()
    preds = model(torch.randn(1, 784))  # inference — correct behavior

# Manual: BN keeps running stats during training
bn = nn.BatchNorm1d(256)
print(bn.running_mean.shape)  # torch.Size([256]) — updated each forward call
What critical difference exists between BatchNorm behavior in training mode vs eval mode?
What is 'internal covariate shift' and how does Batch Normalization address it?
7. Compare SGD, SGD with momentum, RMSProp, and Adam optimizers. When do you choose each?

All these optimizers share the same goal — updating parameters to reduce loss — but differ in how they use gradient history to adapt the update step. Understanding the mechanics helps diagnose slow training and poor generalisation.

Optimizer Comparison
OptimizerUpdate rule (simplified)Key advantageLimitation
SGDθ ← θ - η·gSimple, no memory overheadSlow convergence, sensitive to lr
SGD + Momentumv ← βv + g; θ ← θ - η·vAccelerates consistent directions, damps oscillationStill global lr
RMSPropθ ← θ - η·g / √(E[g²]+ε)Adapts lr per parameter; good for RNNsNo momentum term
AdamCombines momentum + RMSProp; bias-correctedRobust default; fast convergenceCan generalise worse than SGD on some tasks
import torch
import torch.nn as nn

model = nn.Linear(10, 1)

# SGD — baseline, works but needs careful lr tuning
opt_sgd = torch.optim.SGD(model.parameters(), lr=0.01)

# SGD + Momentum — adds velocity; β=0.9 is standard
opt_mom = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9,
                           weight_decay=1e-4)  # L2 regularisation

# Adam — adaptive learning rate + momentum; best default for DL
opt_adam = torch.optim.Adam(model.parameters(),
                             lr=1e-3,      # default, usually works
                             betas=(0.9, 0.999),  # momentum terms
                             eps=1e-8,
                             weight_decay=1e-5)

# AdamW — Adam with decoupled weight decay (better than Adam + L2)
opt_adamw = torch.optim.AdamW(model.parameters(), lr=1e-3,
                               weight_decay=1e-2)

# Learning rate schedulers — change lr during training
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(opt_adam, T_max=100)
for epoch in range(100):
    # ... training loop ...
    scheduler.step()  # decrease lr following cosine curve

When to choose: Adam is the safe default for most deep learning tasks. SGD with momentum often achieves better final generalisation on image classification tasks (the finding that motivated the NLP community's shift back to AdamW for fine-tuning pre-trained transformers). AdamW is now the standard for fine-tuning large language models.

What does the momentum term in SGD with momentum physically represent?
What is the key difference between Adam and AdamW?

8. How does Dropout work mathematically, and why does it act as regularization?

During training, Dropout randomly sets each neuron's output to zero with probability p (the drop probability) and scales the remaining activations by 1/(1-p) to preserve the expected sum. This means each forward pass trains a different thinned sub-network — with n neurons, there are 2ⁿ possible sub-networks, and each training step updates a random one.

The regularisation effect comes from several mechanisms: (1) it prevents co-adaptation — neurons cannot rely on specific other neurons always being present, so each must learn useful features independently; (2) it is mathematically equivalent to training an exponentially large ensemble and averaging their predictions at test time (where Dropout is disabled); (3) the multiplicative noise acts similarly to L2 regularisation on the weights. At inference, Dropout is disabled and all neurons are active — the 1/(1-p) scaling during training ensures the expected value of each neuron's output is the same during training and inference.

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 512), nn.ReLU(),
    nn.Dropout(p=0.5),              # drop 50% of neurons
    nn.Linear(512, 256), nn.ReLU(),
    nn.Dropout(p=0.3),
    nn.Linear(256, 10)
)

# Training: Dropout is ACTIVE (neurons randomly zeroed)
model.train()
x = torch.ones(1, 784)
out1 = model(x)
out2 = model(x)  # different result! different neurons dropped each time

# Inference: Dropout is DISABLED (all neurons active)
model.eval()
with torch.no_grad():
    out3 = model(x)
    out4 = model(x)  # same result — deterministic

# Inverted Dropout (PyTorch default):
# Scale by 1/(1-p) DURING training, not during inference
# => test-time output has correct expected value without scaling
dp = nn.Dropout(p=0.5)
model.train()
x_in = torch.ones(10)
print(dp(x_in))  # ~5 zeros, remaining values are 2.0 (scaled by 1/0.5)
Why does inverted Dropout scale surviving activations by 1/(1-p) during training rather than scaling at inference?
Why does Dropout function as an implicit ensemble method?
9. Explain how convolutional layers work and why they are well-suited to image data.

A convolutional layer applies a set of learnable filters (kernels) by sliding each filter over the spatial dimensions of the input and computing a dot product at each position. For a 2D image, a kernel of size k×k with C_in input channels and C_out output channels has k×k×C_in×C_out parameters total. This produces one feature map per output channel, where each value represents the response of that filter at a specific spatial location.

CNNs are powerful for images because of two structural inductive biases they encode: (1) translation equivariance — the same filter is applied everywhere, so if an object moves in the image, the corresponding feature map activation moves identically; (2) parameter sharing — instead of a separate weight per input-output pixel pair (as a fully-connected layer would require), the filter weights are shared across all spatial locations, drastically reducing parameters and improving sample efficiency.

import torch
import torch.nn as nn

# Standard Conv2d usage
# Input:  (batch, C_in, H, W)
# Output: (batch, C_out, H', W')
conv = nn.Conv2d(
    in_channels=3,    # RGB image
    out_channels=64,  # 64 filters
    kernel_size=3,    # 3x3 kernel
    stride=1,
    padding=1,        # 'same' padding — preserves H and W
)

x = torch.randn(8, 3, 32, 32)   # batch of 8 RGB 32x32 images
out = conv(x)
print(out.shape)  # torch.Size([8, 64, 32, 32])

n_params = 3 * 64 * 3 * 3 + 64  # weights + biases
print('Parameters:', n_params)   # 1792
# Compare: FC layer 3*32*32 -> 64*32*32 would be 3*32*32*64*32*32 = 603M!

# Typical CNN block:
cnn_block = nn.Sequential(
    nn.Conv2d(64, 128, kernel_size=3, padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),  # halves spatial dims
)
What is 'parameter sharing' in convolutional layers and why is it beneficial?
What does 'translation equivariance' mean in the context of CNNs?
10. How do RNNs work and why did LSTMs solve the long-range dependency problem?

A vanilla RNN processes a sequence step-by-step, maintaining a hidden state hₜ = tanh(Wₓxₜ + Wₕhₜ₋₁ + b) that acts as a compressed memory of everything seen so far. The problem is that this hidden state must be updated at every step — and during backpropagation through time (BPTT), gradients are multiplied by Wₕ repeatedly. If the spectral radius of Wₕ is less than 1, gradients vanish over long sequences; if greater than 1, they explode. In practice, vanilla RNNs cannot effectively learn dependencies longer than ~10–20 steps.

LSTMs introduce a separate cell state cₜ (the long-term memory) and three gates — forget, input, and output — each controlled by sigmoid activations. The forget gate fₜ = σ(Wf[hₜ₋₁, xₜ] + bf) decides what to erase from cₜ₋₁; the input gate decides what new information to write; the output gate controls what the hidden state exposes. The key mathematical insight is that the cell state update is additive: cₜ = fₜ⊙cₜ₋₁ + iₜ⊙c̃ₜ. Additive updates mean the gradient can flow through time without repeated multiplicative shrinkage, solving the vanishing gradient problem for long sequences.

import torch
import torch.nn as nn

# LSTM usage in PyTorch
lstm = nn.LSTM(
    input_size=64,
    hidden_size=128,
    num_layers=2,       # stacked LSTM
    batch_first=True,   # input shape: (batch, seq, features)
    dropout=0.2,        # applied between stacked layers
    bidirectional=False
)

x = torch.randn(32, 50, 64)   # (batch=32, seq_len=50, input=64)
output, (h_n, c_n) = lstm(x)
print(output.shape)  # (32, 50, 128) — all time-step hidden states
print(h_n.shape)     # (2, 32, 128)  — final hidden state, both layers
print(c_n.shape)     # (2, 32, 128)  — final cell state, both layers

# GRU: simplified LSTM with only 2 gates — often comparable quality
gru = nn.GRU(input_size=64, hidden_size=128, batch_first=True)
out_gru, h_gru = gru(x)

# For classification, use the LAST hidden state:
last_h = output[:, -1, :]  # (32, 128) — last time step
classifier = nn.Linear(128, 5)
logits = classifier(last_h)
Why can vanilla RNNs not learn long-range dependencies effectively?
Why do LSTMs use additive cell state updates rather than the multiplicative updates of vanilla RNNs?
11. What is the self-attention mechanism in Transformers and why did it replace RNNs for sequence modeling?

Self-attention computes a weighted sum of all input vectors, where the weight between positions i and j reflects how much position i should 'attend to' position j. Concretely, input vectors are linearly projected into queries (Q), keys (K), and values (V), and the attention output is: Attention(Q, K, V) = softmax(QKᵀ/√dₖ) · V. The division by √dₖ prevents the dot products from growing large in high-dimensional spaces, which would push softmax into saturation.

Multi-head attention runs H parallel attention heads with different Q/K/V projections, then concatenates and projects their outputs — each head can learn to attend to different types of relationships simultaneously. The critical advantage over RNNs: self-attention connects any two positions in the sequence in O(1) operations regardless of their distance, while RNNs need O(n) sequential steps to connect positions n apart. This makes transformers trainable in parallel across the sequence length, enabling training on vastly larger datasets.

import torch
import torch.nn as nn
import math

class ScaledDotProductAttention(nn.Module):
    def forward(self, Q, K, V, mask=None):
        d_k = Q.shape[-1]
        scores = (Q @ K.transpose(-2, -1)) / math.sqrt(d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        weights = torch.softmax(scores, dim=-1)
        return weights @ V, weights

# PyTorch's built-in multi-head attention
mha = nn.MultiheadAttention(
    embed_dim=512,
    num_heads=8,    # 8 heads, each with dim=64
    dropout=0.1,
    batch_first=True
)

seq_len, batch, d_model = 20, 4, 512
x = torch.randn(batch, seq_len, d_model)
out, attn_weights = mha(x, x, x)  # Q=K=V=x for self-attention
print(out.shape)         # (4, 20, 512)
print(attn_weights.shape)# (4, 20, 20) — weight of each position pair
Why is the dot product in scaled dot-product attention divided by √d_k?
What is the key efficiency advantage of self-attention over RNNs for long sequences?
12. What loss functions does PyTorch provide for classification and regression, and which to use when?

The choice of loss function should match the output type and the probabilistic assumption about the data-generating process — it is the mathematical link between model predictions and the training signal.

Common PyTorch Loss Functions
TaskLossPyTorch classNotes
Binary classificationBinary cross-entropynn.BCEWithLogitsLossTakes logits (pre-sigmoid); numerically stable
Multi-class classificationCross-entropynn.CrossEntropyLossTakes logits; combines log-softmax + NLLLoss
RegressionMSEnn.MSELossSensitive to outliers
Regression (robust)MAE / Hubernn.L1Loss / nn.HuberLossHuber blends L1+L2; robust to outliers
Multi-label classificationBCE per labelnn.BCEWithLogitsLossEach label independent — not mutually exclusive
Contrastive / metric learningTriplet marginnn.TripletMarginLossLearns embeddings
import torch
import torch.nn as nn

# Binary classification — output is a single logit (no sigmoid)
bce = nn.BCEWithLogitsLoss()  # applies sigmoid internally
logit = torch.tensor([2.0, -1.0, 0.5])
label = torch.tensor([1.0, 0.0, 1.0])
loss = bce(logit, label)

# Multi-class — outputs are raw logits per class (no softmax)
ce = nn.CrossEntropyLoss()
logits = torch.randn(8, 10)    # batch of 8, 10 classes
targets = torch.randint(0, 10, (8,))  # class indices 0-9
loss = ce(logits, targets)

# Class-weighted cross-entropy — for imbalanced datasets
weights = torch.tensor([1.0]*9 + [10.0])  # up-weight class 9
ce_weighted = nn.CrossEntropyLoss(weight=weights)

# Regression
mse = nn.MSELoss()
huber = nn.HuberLoss(delta=1.0)  # L2 for |error|<1, L1 for |error|>1
pred = torch.randn(32, 1)
true = torch.randn(32, 1)
print(mse(pred, true), huber(pred, true))
Why is nn.BCEWithLogitsLoss preferred over applying torch.sigmoid followed by nn.BCELoss?
What is a key difference between nn.CrossEntropyLoss and nn.BCEWithLogitsLoss?
13. What is transfer learning and how do you fine-tune a pretrained model in PyTorch?

Transfer learning reuses a model trained on a large dataset (typically ImageNet for vision, or a large text corpus for NLP) as a starting point for a related task with less data. The pretrained model has already learned general features (edges, textures, shapes for images; grammar, semantics for text) — fine-tuning adapts these features to the target task without needing to learn them from scratch.

Two common strategies: (1) Feature extraction — freeze all pretrained layers and train only a new task-specific head; (2) Full fine-tuning — unfreeze some or all pretrained layers and train end-to-end with a small learning rate to avoid overwriting the useful pretrained representations. A common practical pattern is to first train only the head for a few epochs (so it doesn't start with random gradients corrupting the pretrained backbone), then unfreeze and fine-tune everything together with a smaller lr.

import torch
import torch.nn as nn
import torchvision.models as models

# Load pretrained ResNet-50
backbone = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

# --- Strategy 1: Feature extraction ---
# Freeze ALL pretrained parameters
for param in backbone.parameters():
    param.requires_grad = False

# Replace the final FC layer for our task (e.g. 5 classes)
in_features = backbone.fc.in_features  # 2048 for ResNet-50
backbone.fc = nn.Linear(in_features, 5)
# Only backbone.fc.parameters() have requires_grad=True

# --- Strategy 2: Full fine-tuning ---
backbone2 = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
backbone2.fc = nn.Linear(backbone2.fc.in_features, 5)
# Use layer-wise lr: smaller lr for early layers
optimizer = torch.optim.AdamW([
    {'params': backbone2.layer1.parameters(), 'lr': 1e-5},
    {'params': backbone2.layer4.parameters(), 'lr': 1e-4},
    {'params': backbone2.fc.parameters(),    'lr': 1e-3},
], weight_decay=1e-2)

# Verify which parameters will be updated
trainable = sum(p.numel() for p in backbone.parameters() if p.requires_grad)
total     = sum(p.numel() for p in backbone.parameters())
print(f'Trainable: {trainable:,} / Total: {total:,}')
Why is a smaller learning rate recommended for pretrained layers during fine-tuning?
What does setting param.requires_grad = False accomplish in PyTorch?
14. How does PyTorch's Dataset and DataLoader pipeline work, and what are the key performance considerations?

PyTorch's data loading follows a clean two-class design: Dataset encapsulates how to access a single sample (index → (X, y)), and DataLoader wraps a Dataset to handle batching, shuffling, and parallel data loading. Separating these responsibilities makes it easy to write dataset-specific logic once and reuse the same efficient loading infrastructure.

The most critical performance consideration is that the data loading pipeline must keep the GPU continuously fed — the GPU should never sit idle waiting for the next batch. Key knobs: num_workers launches subprocesses that prefetch batches in parallel with the GPU computation; pin_memory=True allocates batch tensors in pinned (non-pageable) CPU memory, enabling faster CPU→GPU transfers via DMA; prefetch_factor controls how many batches each worker prefetches ahead.

import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np

class TabularDataset(Dataset):
    def __init__(self, X: np.ndarray, y: np.ndarray):
        # Convert to tensors once at construction (not per __getitem__)
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.long)

    def __len__(self):
        return len(self.X)    # required — DataLoader uses this for indexing

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]  # single sample

dataset = TabularDataset(X_train, y_train)

loader = DataLoader(
    dataset,
    batch_size=256,
    shuffle=True,             # shuffle each epoch
    num_workers=4,            # parallel data loading
    pin_memory=True,          # faster CPU->GPU transfer
    drop_last=True,           # drop incomplete final batch
    persistent_workers=True,  # keep workers alive between epochs
)

# Training loop
for X_batch, y_batch in loader:
    X_batch = X_batch.cuda(non_blocking=True)  # async transfer
    y_batch = y_batch.cuda(non_blocking=True)
    # ... forward, backward, step
What must every custom PyTorch Dataset class implement?
Why does pin_memory=True in DataLoader improve training throughput?
15. Why is learning rate scheduling important and what are the most common strategies?

A fixed learning rate is a poor choice for most training runs: too high early on causes instability; too high late in training prevents fine convergence to a sharp minimum. Learning rate schedulers systematically vary the lr during training to get the best of both worlds — fast progress early, precise convergence later.

Common LR Schedules
ScheduleBehaviourBest for
StepLRMultiply lr by γ every N epochsQuick experiments; baseline
CosineAnnealingLRlr follows cosine curve from η_max to η_minMost DL tasks; smooth decay
OneCycleLRWarmup from low to high lr, then decay — all in one cycleFast training (super-convergence)
ReduceLROnPlateauReduce lr when validation metric stops improvingUnknown training time; auto-adapts
CyclicLRCycle between base_lr and max_lr repeatedlyEscaping sharp minima
WarmupThenDecayLinear warmup then cosine decayLarge transformers, LLMs
import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Linear(10, 1)
optimizer = optim.AdamW(model.parameters(), lr=1e-3)

# CosineAnnealingLR — smooth decay from max to min lr
scheduler_cos = optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=100, eta_min=1e-6
)

# OneCycleLR — requires total_steps at init
n_epochs, steps_per_epoch = 10, 100
scheduler_1cycle = optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=1e-2,
    total_steps=n_epochs * steps_per_epoch,
    pct_start=0.3,   # 30% of steps for warmup
)

# ReduceLROnPlateau — triggered by validation metric
scheduler_plateau = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=5
)

# Training loop
for epoch in range(100):
    train_one_epoch(model, optimizer, loader)
    val_loss = validate(model, val_loader)

    scheduler_cos.step()             # epoch-based schedulers
    scheduler_plateau.step(val_loss) # metric-based scheduler
    print(f'LR: {optimizer.param_groups[0]["lr"]:.6f}')
Why is a warmup phase (low → high lr) commonly used at the start of training large models?
When is ReduceLROnPlateau the most appropriate scheduler?
16. What are the most effective regularization strategies for deep learning and how do they differ from classical ML regularization?

Deep neural networks have millions of parameters and can trivially memorise training data. Classical regularisation (L1/L2 on weights) still applies, but modern deep learning has developed additional techniques that often work better or are used in combination.

DL Regularization Techniques
TechniqueHow it worksBest applied to
L2 (weight decay)Penalises large weights: adds λ‖w‖² to lossAll DL models; use AdamW for correct implementation
DropoutRandomly zero neurons during trainingFully-connected layers; less common in conv/transformer
Data augmentationArtificially increase diversity of training setVision (flips, crop, colour jitter, mixup, cutmix)
Early stoppingStop training when val loss stops improvingAny model; simple and effective baseline
Label smoothingSoften one-hot labels to (1-ε, ε/(k-1),...)Classification; improves calibration
Stochastic depthRandomly drop entire residual blocks during trainingVery deep networks (ResNets, ViTs)
import torch
import torch.nn as nn
import torchvision.transforms as T

# Data augmentation for images
train_transform = T.Compose([
    T.RandomHorizontalFlip(p=0.5),
    T.RandomCrop(32, padding=4),
    T.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]),
])

# Label smoothing: penalises overconfident predictions
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
# A 10-class example: true label 3 becomes
# [0.01, 0.01, 0.01, 0.91, 0.01, ...] instead of [0,0,0,1,0,...]

# Mixup augmentation (manual implementation)
def mixup_batch(x, y, alpha=0.4):
    lam = torch.distributions.Beta(alpha, alpha).sample().item()
    idx = torch.randperm(x.size(0))
    x_mix = lam * x + (1 - lam) * x[idx]
    y_a, y_b = y, y[idx]
    return x_mix, y_a, y_b, lam

# Early stopping — track best val loss, restore best weights
best_val_loss = float('inf')
patience_count = 0
for epoch in range(max_epochs):
    val_loss = validate(model, val_loader)
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'best_model.pt')
        patience_count = 0
    else:
        patience_count += 1
    if patience_count >= patience:
        break
What does label smoothing achieve compared to using hard one-hot targets?
Why is data augmentation one of the most effective regularization strategies for vision models?
17. What are embedding layers in deep learning and how are they different from one-hot encoding?

An embedding layer is a learnable lookup table that maps discrete tokens (words, categories, user IDs) to dense, low-dimensional real-valued vectors. It is mathematically a matrix E ∈ ℝ^{V×d} (vocabulary size × embedding dimension), and looking up token i simply retrieves row i — equivalent to multiplying a one-hot vector by E, but implemented as an O(1) table lookup rather than an O(V) matrix multiply.

The key advantage over one-hot encoding is that embeddings are learned — similar tokens (synonyms, related categories) naturally end up with similar embedding vectors because they appear in similar contexts during training. This gives embeddings semantic meaning and enables generalisation: the model can leverage the fact that 'Paris' and 'Berlin' are semantically similar even if 'Berlin' was rare in training data, because their embedding vectors will be nearby.

import torch
import torch.nn as nn

vocab_size  = 10000
embed_dim   = 128

embedding = nn.Embedding(
    num_embeddings=vocab_size,
    embedding_dim=embed_dim,
    padding_idx=0    # token 0 gets a fixed zero vector (PAD token)
)

# Input: integer token IDs
token_ids = torch.tensor([[1, 5, 23, 0], [42, 7, 0, 0]])  # (2, 4)
embedded  = embedding(token_ids)
print(embedded.shape)  # (2, 4, 128) — each token -> 128-dim vector

# Pre-trained embeddings (e.g. GloVe, Word2Vec)
pretrained = torch.randn(vocab_size, embed_dim)  # replace with real vectors
embedding.weight.data.copy_(pretrained)
# Freeze pretrained embeddings:
# embedding.weight.requires_grad = False

# In a text model:
class TextClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.lstm  = nn.LSTM(embed_dim, 256, batch_first=True)
        self.fc    = nn.Linear(256, 5)
    def forward(self, x):
        e = self.embed(x)          # (B, L, 128)
        _, (h, _) = self.lstm(e)
        return self.fc(h[-1])
What is the computational advantage of using an embedding layer over multiplying a one-hot vector by a weight matrix?
Why do similar tokens end up with similar embedding vectors after training?
18. How do you save and load PyTorch models correctly, and what is included in a proper checkpoint?

PyTorch provides two main ways to persist a model: saving the full model object (convenient but fragile to class definition changes) or saving only the state dictionary (recommended for production and reproducibility). The state dict is a Python OrderedDict mapping layer names to their parameter tensors — it contains everything needed to recreate the model's learned state.

A proper training checkpoint includes more than just model weights — it must also save the optimizer state (which contains momentum buffers and adaptive learning rate accumulators in Adam), the current epoch and step, the best validation metric, and the random number generator state, so that training can be resumed exactly where it left off without any change in behaviour.

import torch
import torch.nn as nn

model = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# ── Recommended: save/load state_dict ──
torch.save(model.state_dict(), 'model_weights.pt')

model_new = nn.Linear(10, 5)               # same architecture
model_new.load_state_dict(torch.load('model_weights.pt'))
model_new.eval()                            # ALWAYS call eval() for inference

# ── Full training checkpoint ──
def save_checkpoint(path, epoch, model, optimizer, best_val_loss):
    torch.save({
        'epoch':         epoch,
        'model_state':   model.state_dict(),
        'optimizer_state': optimizer.state_dict(),
        'best_val_loss': best_val_loss,
        'rng_state':     torch.get_rng_state(),
    }, path)

def load_checkpoint(path, model, optimizer):
    ckpt = torch.load(path, map_location='cpu')
    model.load_state_dict(ckpt['model_state'])
    optimizer.load_state_dict(ckpt['optimizer_state'])
    return ckpt['epoch'], ckpt['best_val_loss']

# Loading on different device: always load to CPU first,
# then move to device (avoids GPU OOM if original GPU is unavailable)
model.load_state_dict(
    torch.load('model_weights.pt', map_location='cpu')
)
model = model.to('cuda')
Why is saving model.state_dict() preferred over saving the entire model object with torch.save(model)?
Why is the optimizer state included in a training checkpoint alongside the model weights?
19. What is mixed precision training and how does it speed up deep learning with torch.cuda.amp?

Modern GPUs (Volta and later) have dedicated hardware for 16-bit floating-point operations (FP16 / BFloat16) that can be 2–8× faster than FP32 for matrix multiplications. Mixed precision training runs the forward pass and gradient computations in FP16 (or BF16) for speed, while maintaining a master copy of the weights in FP32 for numerical precision during the optimizer update.

Loss scaling addresses a key challenge: FP16's limited dynamic range (smallest positive ≈ 6×10⁻⁸) can cause small gradient values to underflow to zero. The scaler multiplies the loss by a large scalar before backward (inflating gradients into FP16's representable range), then divides the gradients back before the optimizer step. PyTorch's GradScaler automates this and dynamically adjusts the scale factor.

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler

model     = nn.Linear(1024, 512).cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scaler    = GradScaler()           # manages loss scaling automatically

x = torch.randn(256, 1024).cuda()
y = torch.randn(256, 512).cuda()

for step in range(100):
    optimizer.zero_grad()

    # autocast: runs eligible ops in FP16 automatically
    with autocast(device_type='cuda', dtype=torch.float16):
        y_hat = model(x)           # FP16 matrix multiply
        loss  = nn.MSELoss()(y_hat, y)

    # Scale loss -> backward in FP16 -> unscale gradients -> optimizer step
    scaler.scale(loss).backward()  # inflate loss to prevent underflow
    scaler.unscale_(optimizer)     # restore original gradient magnitudes
    nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # clip after unscale
    scaler.step(optimizer)         # skip step if gradients are inf/NaN
    scaler.update()                # adjust scale factor for next step

# BFloat16 (bfloat16): available on A100+ GPUs
# - Same exponent range as FP32 (no underflow problem -> no scaler needed)
# - Less precision (7-bit mantissa vs 10-bit for FP16)
with autocast(device_type='cuda', dtype=torch.bfloat16):
    y_hat = model(x)  # no scaler needed with BF16
What problem does loss scaling solve in FP16 mixed precision training?
Why does BFloat16 not require a GradScaler while FP16 does?
20. What is the difference between model.eval(), torch.no_grad(), and torch.inference_mode()? When do you use each?

These three mechanisms serve different but complementary purposes that are often confused. Understanding the distinction prevents subtle bugs in training, validation, and inference code.

eval vs no_grad vs inference_mode
MechanismWhat it controlsEffect
model.eval()Layer behaviour (Dropout, BatchNorm)Disables Dropout; BatchNorm uses running stats instead of batch stats
model.train()Layer behaviour (Dropout, BatchNorm)Enables Dropout; BatchNorm uses current batch stats
torch.no_grad()Gradient trackingStops building the computation graph; saves memory; tensors cannot call .backward()
torch.inference_mode()Gradient tracking + view trackingStricter than no_grad; ~10% faster; returned tensors cannot be used in autograd at all
import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(10, 64), nn.ReLU(),
    nn.BatchNorm1d(64), nn.Dropout(0.5),
    nn.Linear(64, 1)
)

# ─── Training ───────────────────────────────────────────────────────
model.train()   # Dropout ACTIVE, BatchNorm uses batch stats
x = torch.randn(32, 10)
out1 = model(x)
out2 = model(x)  # DIFFERENT — Dropout randomly drops each call

# ─── Validation (compute val loss, need backward later? No) ─────────
model.eval()
with torch.no_grad():
    # Dropout OFF, BatchNorm uses running stats, no computation graph
    val_out = model(x)
    val_loss = nn.MSELoss()(val_out, torch.zeros(32, 1))

# ─── Inference / deployment ─────────────────────────────────────────
model.eval()
with torch.inference_mode():  # fastest; cannot go back to autograd
    pred = model(torch.randn(1, 10))

# COMMON BUG: forgetting model.eval() at inference
# model.eval() and torch.no_grad() are INDEPENDENT — you need BOTH:
# - model.eval() alone: still builds graph (memory waste)
# - torch.no_grad() alone: Dropout still active (wrong predictions)
What happens if you call model.eval() but forget torch.no_grad() during validation?
What does BatchNorm do differently in eval mode vs train mode?
21. How do you use GPUs in PyTorch and what are the key patterns for writing device-agnostic code?

PyTorch's device abstraction allows the same code to run on CPU, single GPU, or multiple GPUs with minimal changes. The fundamental operations are moving tensors to a device with .to(device) or .cuda(), and ensuring model and data tensors always reside on the same device before any computation.

A critical performance concept: CPU–GPU data transfers are expensive (PCIe bandwidth is limited vs. GPU memory bandwidth). Minimise them by loading data onto the GPU once per batch, pre-computing dataset statistics on CPU, and avoiding frequent tensor transfers inside the training loop.

import torch
import torch.nn as nn

# Device-agnostic code pattern
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using: {device}')  # cuda / mps / cpu

# Move model to device
model = nn.Linear(10, 5).to(device)

# Move data to device in the training loop
for X_batch, y_batch in loader:
    X_batch = X_batch.to(device, non_blocking=True)
    y_batch = y_batch.to(device, non_blocking=True)
    y_hat = model(X_batch)
    # ...

# Check which device a tensor is on
t = torch.randn(3)
print(t.device)         # cpu
t_gpu = t.cuda()        # or t.to('cuda:0')
print(t_gpu.device)     # cuda:0

# Apple Silicon
device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')

# Memory diagnostics
print(torch.cuda.memory_allocated() / 1e9, 'GB allocated')
print(torch.cuda.max_memory_allocated() / 1e9, 'GB peak')
torch.cuda.empty_cache()  # release unused cached GPU memory

# Multi-GPU: DistributedDataParallel (DDP) preferred over DataParallel
model_ddp = nn.parallel.DistributedDataParallel(model, device_ids=[0, 1])
Why is non_blocking=True in tensor.to(device) beneficial during training?
What is the most important rule to avoid runtime errors when doing GPU computations in PyTorch?
22. What are the differences between Batch Norm, Layer Norm, Group Norm, and Instance Norm?

All normalisation variants compute mean and variance and apply the same transformation (x-μ)/√(σ²+ε) — they differ only in which dimensions the mean and variance are computed over. This seemingly small difference has large practical consequences depending on the architecture and batch size.

Normalisation Comparison
MethodNormalises overBest forKey limitation
BatchNormBatch + spatial dims per channelCNNs, large batch MLPBreaks with batch_size=1; train/eval difference
LayerNormAll features per sampleTransformers, NLP, RNNsSlower than BN on large spatial dims
InstanceNormSpatial dims per channel per sampleStyle transfer, GANLoses channel statistics
GroupNormSpatial dims per group of channels per sampleObject detection, small batchRequires choosing n_groups
import torch
import torch.nn as nn

# BatchNorm1d: normalise over batch for FC layers
# Input: (N, C) or (N, C, L)
bn = nn.BatchNorm1d(num_features=128)

# LayerNorm: normalise over feature dim(s) — no dependency on batch
# Input: (*, normalized_shape)  — last dims are normalised
ln = nn.LayerNorm(normalized_shape=128)  # used in transformers
ln_2d = nn.LayerNorm([128, 8, 8])        # can normalise spatial too

# GroupNorm: split channels into groups, normalise per group per sample
# Input: (N, C, *)  — C must be divisible by num_groups
gn = nn.GroupNorm(num_groups=8, num_channels=128)

# InstanceNorm: each sample, each channel independently
inst = nn.InstanceNorm2d(num_features=128)

# Example: why LayerNorm is used in transformers
d_model = 512
x = torch.randn(4, 20, d_model)   # (batch, seq_len, d_model)
# BatchNorm would normalise over batch and seq_len per feature dim —
# unstable at inference when batch=1 (as in autoregressive generation)
# LayerNorm normalises over d_model for each (batch, seq) position independently
print(ln(x).shape)  # (4, 20, 512) — each position normalised independently
Why is LayerNorm preferred over BatchNorm in transformer architectures?
When is GroupNorm particularly useful compared to BatchNorm?
23. What is an autoencoder and what can a well-trained latent space be used for?

An autoencoder is a neural network trained to reconstruct its input through a bottleneck. The encoder f: X → Z maps inputs to a lower-dimensional latent space Z, and the decoder g: Z → X̂ reconstructs the input. Training minimises the reconstruction loss (e.g. MSE for continuous inputs, binary cross-entropy for binary) without any labels — it is an unsupervised learning technique.

The bottleneck forces the encoder to learn a compressed, information-dense representation. A well-trained latent space can be used for: (1) dimensionality reduction and visualisation (better than PCA for non-linear data); (2) anomaly detection (normal samples reconstruct well; anomalies have high reconstruction error); (3) de-noising (train with noisy input, clean target — denoising autoencoders); (4) generative modelling (Variational Autoencoders / VAEs impose a probabilistic structure on Z that enables generation).

import torch
import torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, input_dim=784, latent_dim=32):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 256), nn.ReLU(),
            nn.Linear(256, latent_dim)
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256), nn.ReLU(),
            nn.Linear(256, input_dim), nn.Sigmoid()  # pixel values in [0,1]
        )

    def forward(self, x):
        z = self.encoder(x)
        return self.decoder(z)

ae = Autoencoder()
optimizer = torch.optim.Adam(ae.parameters(), lr=1e-3)
criterion = nn.MSELoss()

for X_batch, _ in loader:   # labels not used!
    X_flat = X_batch.view(X_batch.size(0), -1)  # flatten images
    X_hat  = ae(X_flat)
    loss   = criterion(X_hat, X_flat)
    optimizer.zero_grad(); loss.backward(); optimizer.step()

# Anomaly detection at inference:
ae.eval()
with torch.no_grad():
    X_hat = ae(test_samples)
    recon_error = ((test_samples - X_hat) ** 2).mean(dim=1)
# High recon_error => anomalous sample
Why can autoencoders detect anomalies based on reconstruction error?
What is the difference between a standard autoencoder and a Variational Autoencoder (VAE)?
24. How do you diagnose a neural network that is not training correctly from its loss curves?

Reading loss curves is one of the most important practical skills in deep learning. The shape of the training and validation loss over time reveals the failure mode and guides the fix.

Common Training Failure Modes
Loss curve shapeDiagnosisLikely fix
Loss is NaN from the startExploding gradients or bad initGradient clipping, lower lr, check data for inf/NaN
Loss doesn't decrease at allVanishing gradient, lr too low, dead neuronsCheck activations, raise lr, use He init + ReLU
Loss decreases then plateaus earlyLearning rate too high or model too smallReduce lr / lr schedule, increase capacity
Train loss low, val loss high (large gap)OverfittingMore regularisation: dropout, weight decay, augmentation, early stopping
Both losses plateau at high valueUnderfitting (high bias)Increase model capacity, train longer, reduce regularisation
Loss oscillates wildlyLearning rate too highReduce lr, use lr schedule, check batch size
import torch
import torch.nn as nn

# Checking for gradient issues
model = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for step, (X, y) in enumerate(loader):
    optimizer.zero_grad()
    loss = criterion(model(X), y)
    loss.backward()

    # Check for NaN/Inf in loss
    if not torch.isfinite(loss):
        print(f'Step {step}: non-finite loss = {loss.item()}')
        break

    # Monitor gradient norms
    total_norm = 0
    for p in model.parameters():
        if p.grad is not None:
            total_norm += p.grad.data.norm(2).item() ** 2
    total_norm = total_norm ** 0.5
    if step % 100 == 0:
        print(f'Step {step}: loss={loss.item():.4f} grad_norm={total_norm:.4f}')

    optimizer.step()

# Check dead ReLU neurons
def count_dead_neurons(model, X):
    activations = []
    def hook(m, inp, out):
        activations.append((out <= 0).float().mean().item())
    handles = [l.register_forward_hook(hook)
               for l in model.modules() if isinstance(l, nn.ReLU)]
    with torch.no_grad(): model(X)
    for h in handles: h.remove()
    return activations  # fraction of dead neurons per layer
If training loss decreases steadily but validation loss diverges upward, what is the most likely diagnosis?
What does a training loss that never decreases (stays near its initial value from epoch 1) typically indicate?
25. What is the mathematical setup of a Generative Adversarial Network (GAN) and what training challenges do they have?

A GAN consists of two competing networks: a generator G that maps random noise z ~ p(z) to fake data samples, and a discriminator D that classifies inputs as real or fake. They play a minimax game with objective: min_G max_D E[log D(x)] + E[log(1 - D(G(z)))]. At the Nash equilibrium, G produces samples from the true data distribution and D outputs 0.5 for every input (cannot distinguish real from fake).

In practice, GANs suffer from several well-known training challenges: mode collapse (G learns to produce only a subset of modes of the data distribution); training instability (the minimax game does not converge reliably); and vanishing generator gradient (when D becomes too good early on, it correctly classifies fake samples with near-certainty, giving G near-zero gradient signal). These led to many GAN variants — DCGAN (convolutional architecture), WGAN (Wasserstein distance instead of JS divergence), and progressive growing GANs.

import torch
import torch.nn as nn

latent_dim, img_dim = 100, 784

# Generator: noise -> fake image
generator = nn.Sequential(
    nn.Linear(latent_dim, 256), nn.ReLU(),
    nn.Linear(256, 512), nn.ReLU(),
    nn.Linear(512, img_dim), nn.Tanh()
)

# Discriminator: image -> real (1) or fake (0)
discriminator = nn.Sequential(
    nn.Linear(img_dim, 512), nn.LeakyReLU(0.2),
    nn.Linear(512, 256), nn.LeakyReLU(0.2),
    nn.Linear(256, 1)  # raw logit; use BCEWithLogitsLoss
)

criterion = nn.BCEWithLogitsLoss()
opt_G = torch.optim.Adam(generator.parameters(),     lr=2e-4, betas=(0.5, 0.999))
opt_D = torch.optim.Adam(discriminator.parameters(), lr=2e-4, betas=(0.5, 0.999))

for real_imgs, _ in loader:
    real_imgs = real_imgs.view(-1, img_dim)
    bs = real_imgs.size(0)

    # Train Discriminator
    z = torch.randn(bs, latent_dim)
    fake_imgs = generator(z).detach()  # detach: don't update G here
    loss_D = (criterion(discriminator(real_imgs), torch.ones(bs, 1))
            + criterion(discriminator(fake_imgs), torch.zeros(bs, 1)))
    opt_D.zero_grad(); loss_D.backward(); opt_D.step()

    # Train Generator
    z = torch.randn(bs, latent_dim)
    loss_G = criterion(discriminator(generator(z)), torch.ones(bs, 1))
    opt_G.zero_grad(); loss_G.backward(); opt_G.step()
What is mode collapse in GAN training?
Why does detach() need to be called on generator output when training the discriminator?
26. What is torch.compile and how does it speed up PyTorch model execution?

Introduced in PyTorch 2.0, torch.compile applies ahead-of-time compilation to a PyTorch model or function. Rather than executing each operation eagerly (PyTorch's default), it captures the computation as a graph, optimises it (fusing operations, eliminating redundant memory reads/writes), and compiles it to efficient machine code using a backend (TorchInductor by default, which generates CUDA/C++ kernels).

The primary benefit is kernel fusion: instead of launching a separate GPU kernel for each operation (e.g. separate kernels for matrix multiply, add bias, and ReLU), the compiler fuses them into a single kernel that reads and writes GPU memory once. GPU memory bandwidth is often the bottleneck for transformer-style models, so reducing memory round-trips directly translates to throughput gains — typically 10–50% speedup for training and inference on modern hardware.

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(1024, 1024), nn.GELU(),
    nn.Linear(1024, 512), nn.GELU(),
    nn.Linear(512, 10)
)

# Compile the model — first call triggers compilation (may take 30s+)
compiled_model = torch.compile(model)

# Usage is identical to a regular model
x = torch.randn(256, 1024).cuda()
compiled_model = compiled_model.cuda()
out = compiled_model(x)   # warm-up: triggers compilation
out = compiled_model(x)   # subsequent calls use compiled kernels

# Compilation modes (trade-off speed of compilation vs runtime)
model_default = torch.compile(model)                       # best overall
model_reduce   = torch.compile(model, mode='reduce-overhead')  # fewer overheads
model_max      = torch.compile(model, mode='max-autotune') # slowest to compile, fastest to run

# Measure speedup
import time
x = torch.randn(512, 1024, device='cuda')
for _ in range(5): model(x)   # warm-up
t0 = time.time()
for _ in range(100): model(x)
torch.cuda.synchronize()
print('Eager:', time.time() - t0)

for _ in range(5): compiled_model(x)
t0 = time.time()
for _ in range(100): compiled_model(x)
torch.cuda.synchronize()
print('Compiled:', time.time() - t0)
What is the primary technique torch.compile uses to accelerate model execution?
Why does the first call to a torch.compile'd model take much longer than subsequent calls?
27. Why do Transformers need positional encodings and how does sinusoidal encoding work?

Self-attention is permutation equivariant — swapping two positions in the input produces the same output with those two positions swapped, because attention treats all positions symmetrically. Without positional information, a transformer cannot distinguish 'The dog bit the man' from 'The man bit the dog'. Positional encodings inject sequence order information into the token embeddings before they enter the transformer.

The original 'Attention is All You Need' paper uses sinusoidal encodings: PE(pos, 2i) = sin(pos / 10000^{2i/d}) and PE(pos, 2i+1) = cos(pos / 10000^{2i/d}), where pos is the position and i is the dimension index. Each dimension oscillates at a different frequency, giving a unique fingerprint to every position. The key properties: (1) each position has a unique encoding; (2) the encoding for position pos+k is a linear function of position pos, allowing the model to reason about relative distances; (3) it generalises to sequence lengths unseen during training.

import torch
import math

def sinusoidal_positional_encoding(max_len, d_model):
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len).unsqueeze(1).float()
    div_term = torch.exp(
        torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)
    )
    pe[:, 0::2] = torch.sin(position * div_term)  # even dims: sin
    pe[:, 1::2] = torch.cos(position * div_term)  # odd dims: cos
    return pe  # (max_len, d_model)

import torch.nn as nn

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=512, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        pe = sinusoidal_positional_encoding(max_len, d_model)
        self.register_buffer('pe', pe)  # not a parameter; saved with model

    def forward(self, x):   # x: (batch, seq_len, d_model)
        x = x + self.pe[:x.size(1)]  # add pos encoding to each embedding
        return self.dropout(x)

# Modern alternative: Rotary Position Embeddings (RoPE)
# Used in LLaMA, Mistral — encodes relative rather than absolute position
# Applied directly to Q and K matrices before attention computation
Why do Transformers need positional encodings when RNNs do not?
What is the key advantage of using different sinusoidal frequencies across embedding dimensions for positional encoding?
28. What are the most impactful hyperparameters to tune in deep learning and what is the recommended search order?

Deep learning has many hyperparameters, but they are not equally important. Empirical research and practitioner experience has established a rough hierarchy of impact. Tuning in the wrong order wastes compute — finding the optimal dropout rate is pointless if the learning rate is still wildly off.

Hyperparameter Importance Hierarchy
PriorityHyperparameterTypical search range
1 (highest)Learning rateLog-uniform: 1e-5 to 1e-1
1Batch size32, 64, 128, 256, 512
2Model architecture (depth, width)Task-specific; start from established baselines
2Optimizer (Adam vs SGD + momentum)Usually Adam/AdamW first
3Weight decay / L2 penaltyLog-uniform: 1e-5 to 1e-1
3LR schedule and warmupCosine with 5-10% warmup steps
4 (lower)Dropout rate0.0, 0.1, 0.2, 0.5
4Batch norm epsilon / momentumRarely tuned; defaults usually fine
import optuna
import torch
import torch.nn as nn

def objective(trial):
    # Optuna suggests hyperparameters — log-uniform search for lr
    lr         = trial.suggest_float('lr', 1e-5, 1e-1, log=True)
    wd         = trial.suggest_float('weight_decay', 1e-5, 1e-1, log=True)
    n_layers   = trial.suggest_int('n_layers', 2, 6)
    hidden_dim = trial.suggest_categorical('hidden_dim', [128, 256, 512])
    dropout    = trial.suggest_float('dropout', 0.0, 0.5)

    layers = []
    in_dim = 784
    for _ in range(n_layers):
        layers += [nn.Linear(in_dim, hidden_dim), nn.ReLU(),
                   nn.Dropout(dropout)]
        in_dim = hidden_dim
    model = nn.Sequential(*layers, nn.Linear(in_dim, 10))

    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=wd)
    val_acc = train_and_evaluate(model, optimizer, n_epochs=10)
    return val_acc

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print('Best params:', study.best_params)
Why should the learning rate be tuned before other hyperparameters like dropout?
Why is log-uniform sampling preferred over uniform sampling when searching for learning rates?
29. What is an encoder-decoder architecture and how is it used for sequence-to-sequence tasks?

Encoder-decoder (seq2seq) architectures handle tasks where the input and output are sequences of potentially different lengths — machine translation, summarisation, speech recognition, image captioning. The encoder processes the full input sequence and produces a context representation; the decoder generates the output sequence token by token, conditioning each prediction on the context and all previously generated tokens.

In transformer-based seq2seq, the encoder uses bidirectional self-attention (each position attends to all input positions), while the decoder uses two attention mechanisms: masked self-attention (each output position can only attend to previous output positions, preserving the autoregressive property) and cross-attention (each decoder position attends to all encoder output positions to draw relevant information from the input).

import torch
import torch.nn as nn

# PyTorch's built-in Transformer (encoder-decoder)
transformer = nn.Transformer(
    d_model=512,
    nhead=8,
    num_encoder_layers=6,
    num_decoder_layers=6,
    dim_feedforward=2048,
    dropout=0.1,
    batch_first=True
)

# Source and target sequences
src = torch.randn(4, 20, 512)   # (batch, src_len, d_model)
tgt = torch.randn(4, 15, 512)   # (batch, tgt_len, d_model)

# Causal mask: prevent decoder from attending to future target tokens
tgt_len = tgt.size(1)
tgt_mask = nn.Transformer.generate_square_subsequent_mask(tgt_len)

out = transformer(src, tgt, tgt_mask=tgt_mask)
print(out.shape)  # (4, 15, 512)

# Teacher forcing: at training time, feed ground-truth previous tokens
# to the decoder (not its own previous predictions)
# At inference: autoregressive — use model's own previous output:
def greedy_decode(model, src, max_len, sos_idx, eos_idx):
    memory = model.encoder(src)
    ys = torch.tensor([[sos_idx]])
    for _ in range(max_len):
        mask = nn.Transformer.generate_square_subsequent_mask(ys.size(1))
        out  = model.decoder(ys.float(), memory, tgt_mask=mask)
        next_token = out[:, -1].argmax()
        ys = torch.cat([ys, next_token.unsqueeze(0).unsqueeze(0)], dim=1)
        if next_token.item() == eos_idx: break
    return ys
What is the purpose of the causal (subsequent) mask in the transformer decoder?
What is teacher forcing in seq2seq training?
30. What is model quantization in deep learning and how does PyTorch support it?

Quantization reduces model size and inference latency by representing weights and activations in lower-precision integer formats (INT8, INT4, INT2) rather than FP32 or FP16. A 32-bit float weight is replaced by an 8-bit integer plus a scale factor and zero-point: x_float = scale × (x_int - zero_point). This yields 4× memory reduction for INT8, enabling larger models to fit on limited hardware and significantly faster integer arithmetic on CPUs and mobile accelerators.

Three main approaches: (1) Post-Training Quantization (PTQ) — quantize a trained FP32 model without retraining, using a small calibration dataset to determine optimal scale factors; (2) Quantization-Aware Training (QAT) — simulate quantization noise during training (fake quantization), allowing the model to adapt and typically recovering the accuracy lost by PTQ; (3) Dynamic quantization — weights are quantized ahead of time, activations quantized dynamically at inference (simplest, good baseline for RNNs).

import torch
import torch.nn as nn
from torch.quantization import quantize_dynamic, prepare, convert

# ─── Dynamic Quantization (simplest — weights INT8, activations FP32) ───
model_fp32 = nn.LSTM(input_size=64, hidden_size=128)
model_int8 = quantize_dynamic(
    model_fp32,
    qconfig_spec={nn.Linear, nn.LSTM},
    dtype=torch.qint8
)
print('FP32 size:', sum(p.numel() * 4 for p in model_fp32.parameters()), 'bytes')
# INT8 model is ~4x smaller

# ─── Post-Training Static Quantization ───
model = nn.Sequential(nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 10))
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
model_prepared = prepare(model)  # insert observer modules

# Calibrate with representative data
model_prepared.eval()
with torch.no_grad():
    for X_cal, _ in calibration_loader:
        model_prepared(X_cal)

model_int8 = convert(model_prepared)  # convert to INT8

# ─── Modern approach: bitsandbytes / llm.int8() for LLMs ───
# 8-bit quantization of LLM weights with minimal accuracy loss
# Allows running 7B+ parameter models on consumer GPUs
# from transformers import AutoModelForCausalLM
# model = AutoModelForCausalLM.from_pretrained('gpt2', load_in_8bit=True)
What are the two components needed to convert a quantized INT8 value back to approximate FP32?
What is the key advantage of Quantization-Aware Training (QAT) over Post-Training Quantization (PTQ)?
31. What does a production-quality PyTorch training loop look like, incorporating all best practices?

A well-structured training loop separates concerns cleanly: data loading, forward pass, loss computation, backpropagation, gradient management, metric tracking, and model persistence. Each step has specific pitfalls that silently degrade results.

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler
from torch.utils.data import DataLoader

def train_epoch(model, loader, optimizer, criterion, device, scaler):
    model.train()
    total_loss, n_correct, n_total = 0.0, 0, 0

    for X, y in loader:
        X, y = X.to(device, non_blocking=True), y.to(device, non_blocking=True)

        optimizer.zero_grad(set_to_none=True)  # faster than zero_grad()

        with autocast(device_type='cuda', dtype=torch.float16):
            logits = model(X)
            loss   = criterion(logits, y)

        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        scaler.step(optimizer)
        scaler.update()

        total_loss += loss.item() * X.size(0)
        n_correct  += (logits.argmax(1) == y).sum().item()
        n_total    += X.size(0)

    return total_loss / n_total, n_correct / n_total

@torch.no_grad()
def eval_epoch(model, loader, criterion, device):
    model.eval()
    total_loss, n_correct, n_total = 0.0, 0, 0
    for X, y in loader:
        X, y = X.to(device, non_blocking=True), y.to(device, non_blocking=True)
        logits = model(X)
        loss   = criterion(logits, y)
        total_loss += loss.item() * X.size(0)
        n_correct  += (logits.argmax(1) == y).sum().item()
        n_total    += X.size(0)
    return total_loss / n_total, n_correct / n_total

# Main training loop
best_val_acc = 0
for epoch in range(n_epochs):
    tr_loss, tr_acc = train_epoch(model, train_loader, optimizer,
                                   criterion, device, scaler)
    vl_loss, vl_acc = eval_epoch(model, val_loader, criterion, device)
    scheduler.step()
    if vl_acc > best_val_acc:
        best_val_acc = vl_acc
        torch.save(model.state_dict(), 'best.pt')
    print(f'Epoch {epoch:3d}: tr={tr_loss:.4f}/{tr_acc:.3f}  '
          f'val={vl_loss:.4f}/{vl_acc:.3f}')
Why is optimizer.zero_grad(set_to_none=True) preferred over optimizer.zero_grad()?
Why should loss.item() be called to accumulate running loss rather than loss directly?
32. How does batch size affect deep learning training mathematically and practically?

Batch size controls the trade-off between gradient estimate quality and training speed. With batch size B, the gradient is estimated as the average loss gradient over B samples — the variance of this estimate is proportional to σ²/B, where σ² is the per-sample gradient variance. Larger batches give lower-variance (more accurate) gradient estimates, but with diminishing returns: the benefit of doubling the batch size has halved variance but the compute cost also doubles.

Generalisation effect: empirically, large batches often lead to sharper minima that generalise worse than the flatter minima found by small batches. The noise in small-batch SGD acts as implicit regularisation — the stochastic gradient trajectory tends to find broader minima, which are more robust to small perturbations. This is the 'large batch training problem'. Mitigations: linear scaling rule (scale lr proportionally with batch size), warmup, and gradient accumulation (simulate large batches while maintaining small-batch noise).

import torch
import torch.nn as nn

model     = nn.Linear(10, 1)
criterion = nn.MSELoss()

# Gradient accumulation: simulate batch_size=1024 with micro_batch=32
accumulation_steps = 32   # effective_batch_size = 32 * 32 = 1024
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

optimizer.zero_grad()
for step, (X, y) in enumerate(loader):
    # Forward and backward every micro-batch
    loss = criterion(model(X), y) / accumulation_steps  # scale by 1/K
    loss.backward()  # gradients accumulate, not cleared

    if (step + 1) % accumulation_steps == 0:
        # Clip and step only after accumulating K micro-batches
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)

# Linear scaling rule: if you double batch size, double the lr
base_lr    = 1e-3
base_batch = 256
new_batch  = 1024
new_lr     = base_lr * (new_batch / base_batch)  # 4e-3
# But use warmup to stabilise the larger lr at the start
Why do large batch sizes sometimes lead to worse generalization than small batch sizes?
What does gradient accumulation achieve, and when is it useful?
33. How do you choose the right layer type (Linear, Conv, Attention) for a given input modality?

Each layer type encodes different structural assumptions (inductive biases) about the data. Using a layer whose assumptions match the data's structure allows the model to learn faster and with less data than a generic alternative.

Layer Selection by Modality and Structure
Data typeStructureRecommended layerReason
TabularNo spatial/sequential structureLinear (MLP)Features are independent; no shared structure to exploit
Images2D spatial locality + translation equivarianceConv2dSame pattern anywhere in image; fewer params than FC
Text/sequencesLong-range dependencies, variable lengthTransformer (self-attention)O(1) path length between any two positions
Short sequences / time seriesLocal temporal patternsConv1d or LSTMLocal: Conv1d; long-range: LSTM
GraphsIrregular node connectivityGraph Conv (GCN/GAT)Aggregates neighbor information per node
Point cloudsPermutation invariant 3DPointNet / sparse convMust handle unordered sets
import torch
import torch.nn as nn

# Tabular data: simple MLP
mlp = nn.Sequential(
    nn.Linear(30, 128), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(128, 64), nn.ReLU(),
    nn.Linear(64, 1)
)

# Image: CNN
cnn = nn.Sequential(
    nn.Conv2d(3, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(),
    nn.AdaptiveAvgPool2d(1),   # global average pooling -> (B, 64, 1, 1)
    nn.Flatten(),
    nn.Linear(64, 10)
)

# Text: embedding + transformer encoder
encoder_layer = nn.TransformerEncoderLayer(
    d_model=256, nhead=4, dim_feedforward=512,
    dropout=0.1, batch_first=True
)
text_model = nn.Sequential(
    nn.Embedding(10000, 256),
    nn.TransformerEncoder(encoder_layer, num_layers=4)
)

# Time series: Conv1d (local patterns) or LSTM (sequential patterns)
ts_cnn = nn.Conv1d(in_channels=1, out_channels=32, kernel_size=5, padding=2)
ts_rnn = nn.LSTM(input_size=1, hidden_size=64, batch_first=True)
Why is Conv2d more parameter-efficient than a fully-connected layer for image data?
When would you choose LSTM over a Transformer for sequence modelling?
34. What evaluation metrics are most commonly used in deep learning tasks and how do you implement them in PyTorch?

The choice of evaluation metric should match the task's real-world objective, not just be the easiest to compute. The training loss and the evaluation metric are often different — models are trained with cross-entropy but evaluated with accuracy, F1, mAP, or BLEU depending on the application.

Metrics by Task
TaskPrimary metricWhen it falls short
Classification (balanced)AccuracyMisleading on imbalanced classes
Classification (imbalanced)F1 / AUC-ROC / PR-AUCPR-AUC better than ROC-AUC for severe imbalance
Object detectionmAP (mean Average Precision)Doesn't account for localisation precision at all scales
RegressionMAE / RMSE / R²RMSE sensitive to outliers; R² can be negative
Machine translationBLEU scoreDoesn't capture semantic similarity
Language generationPerplexity / ROUGE / BERTScorePerplexity doesn't measure fluency
SegmentationIntersection over Union (IoU / mIoU)Sensitive to class imbalance
import torch
from torchmetrics import Accuracy, F1Score, AUROC, MeanSquaredError

# torchmetrics: handles accumulation across batches correctly
n_classes = 5
acc  = Accuracy(task='multiclass', num_classes=n_classes)
f1   = F1Score(task='multiclass', num_classes=n_classes, average='macro')
auroc = AUROC(task='multiclass', num_classes=n_classes)

model.eval()
with torch.no_grad():
    for X, y in val_loader:
        logits = model(X)
        preds  = logits.argmax(dim=1)
        probs  = torch.softmax(logits, dim=1)
        acc.update(preds, y)
        f1.update(preds, y)
        auroc.update(probs, y)

print(f'Val Acc:  {acc.compute():.4f}')
print(f'Val F1:   {f1.compute():.4f}')
print(f'Val AUROC:{auroc.compute():.4f}')

# Manual accuracy (without torchmetrics)
all_preds, all_labels = [], []
with torch.no_grad():
    for X, y in val_loader:
        preds = model(X).argmax(1)
        all_preds.append(preds.cpu())
        all_labels.append(y.cpu())
preds  = torch.cat(all_preds)
labels = torch.cat(all_labels)
accuracy = (preds == labels).float().mean()
print(f'Accuracy: {accuracy:.4f}')
Why is accuracy a misleading metric for highly imbalanced classification datasets?
What does the torchmetrics library's .update() and .compute() pattern accomplish?
35. How do you export a PyTorch model for production deployment using TorchScript or ONNX?

Research-time PyTorch models depend on Python's interpreter and PyTorch's eager execution mode — both are too slow and have too many dependencies for production deployment. Two standard serialisation formats allow deploying PyTorch models without Python: TorchScript (PyTorch-native, supports dynamic shapes better) and ONNX (framework-agnostic, runs on TensorRT, OpenVINO, CoreML, ONNX Runtime across many hardware targets).

TorchScript compiles a PyTorch model into an intermediate representation (IR) that can run in C++ via LibTorch, without any Python dependency. It is created either via torch.jit.trace (records operations from a concrete example — doesn't handle data-dependent control flow) or torch.jit.script (analyzes Python source — handles control flow but requires type annotations and a subset of Python). ONNX export traces the model similarly and serialises it to the ONNX protobuf format, which can then be run on any ONNX-compatible runtime.

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(10, 64), nn.ReLU(), nn.Linear(64, 5)
).eval()

example_input = torch.randn(1, 10)

# ─── TorchScript: tracing ───────────────────────────────────────────
traced = torch.jit.trace(model, example_input)
traced.save('model_traced.pt')
# Load and run without original Python class:
loaded = torch.jit.load('model_traced.pt')
output = loaded(torch.randn(4, 10))

# ─── TorchScript: scripting (handles if/for in forward) ────────────
@torch.jit.script
def activation(x: torch.Tensor) -> torch.Tensor:
    if x.sum() > 0:
        return torch.relu(x)
    return torch.tanh(x)

# ─── ONNX export ────────────────────────────────────────────────────
torch.onnx.export(
    model,
    example_input,
    'model.onnx',
    input_names=['features'],
    output_names=['logits'],
    dynamic_axes={'features': {0: 'batch_size'},  # variable batch
                  'logits':   {0: 'batch_size'}},
    opset_version=17,
)

# Validate exported ONNX model
import onnx, onnxruntime as ort
onnx.checker.check_model('model.onnx')
sess = ort.InferenceSession('model.onnx')
result = sess.run(None, {'features': example_input.numpy()})
What is the key difference between torch.jit.trace and torch.jit.script?
What advantage does ONNX export provide over TorchScript for production deployment?
36. What is knowledge distillation and how does it compress large neural networks into smaller ones?

Knowledge distillation (Hinton et al., 2015) trains a small student network to mimic the output distribution of a large, accurate teacher network. Instead of training only on hard labels (the correct class as a one-hot vector), the student is also trained to match the teacher's soft probabilities — the full output distribution including small probabilities assigned to incorrect classes.

The soft probabilities carry richer information than hard labels: if the teacher assigns 0.7 to 'cat' and 0.25 to 'dog', this communicates that the image looks somewhat cat-like but also dog-like — a nuanced signal the student can learn from. A temperature parameter T sharpens or softens this distribution: p_i = exp(z_i/T) / Σ exp(z_j/T). Higher T produces a softer, more uniform distribution that exposes the teacher's confidence relationships across all classes, giving the student a richer gradient signal. The distillation loss combines the cross-entropy with hard labels and the KL divergence with the teacher's soft targets.

import torch
import torch.nn as nn
import torch.nn.functional as F

teacher = BigModel().eval()      # pretrained, frozen
student = SmallModel()           # to be trained

T           = 3.0   # temperature — soften the distributions
alpha       = 0.7   # weight for distillation vs hard-label loss
ce_loss     = nn.CrossEntropyLoss()
kl_div_loss = nn.KLDivLoss(reduction='batchmean')

optimizer = torch.optim.AdamW(student.parameters(), lr=1e-3)

for X, y_hard in loader:
    # Teacher forward (no grad)
    with torch.no_grad():
        teacher_logits = teacher(X)

    # Student forward
    student_logits = student(X)

    # Hard-label cross-entropy
    loss_hard = ce_loss(student_logits, y_hard)

    # Soft-target KL divergence (temperature-scaled)
    student_soft = F.log_softmax(student_logits / T, dim=1)
    teacher_soft = F.softmax(teacher_logits / T, dim=1)
    loss_kl = kl_div_loss(student_soft, teacher_soft) * (T ** 2)
    # T^2 scaling: compensates for the T-scaled gradients

    loss = alpha * loss_kl + (1 - alpha) * loss_hard
    optimizer.zero_grad(); loss.backward(); optimizer.step()
Why are 'soft probabilities' from a teacher network more informative than one-hot hard labels for training a student?
Why is the KL divergence term multiplied by T² in the distillation loss?
37. What is self-supervised learning and how do contrastive methods like SimCLR learn representations?

Self-supervised learning (SSL) is a form of unsupervised learning where the model is trained on a pretext task defined entirely from the data itself — no human-provided labels. The learned representations can then be transferred to downstream tasks with few or no labels (linear probe, fine-tuning).

Contrastive methods like SimCLR define a pretext task based on augmentation invariance: for each input, create two random augmented views (crops, colour jitter, flips) and train the model so that representations of the two views of the same image are similar (positive pair), while representations of views from different images are dissimilar (negative pairs). The NT-Xent loss (normalised temperature-scaled cross-entropy) implements this: for a batch of N images (2N views), the model is trained to identify the matching view among 2(N-1) negative candidates.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as T

# Augmentation pipeline: two random views of the same image
augment = T.Compose([
    T.RandomResizedCrop(224),
    T.RandomHorizontalFlip(),
    T.ColorJitter(0.4, 0.4, 0.4, 0.1),
    T.RandomGrayscale(p=0.2),
    T.GaussianBlur(kernel_size=23),
    T.ToTensor(),
])

class SimCLRLoss(nn.Module):
    def __init__(self, temperature=0.07):
        super().__init__()
        self.tau = temperature

    def forward(self, z1, z2):
        # L2-normalise projections to unit sphere
        z1 = F.normalize(z1, dim=1)
        z2 = F.normalize(z2, dim=1)
        # All 2N representations as rows
        z  = torch.cat([z1, z2], dim=0)   # (2N, d)
        # Pairwise cosine similarities / temperature
        sim_matrix = z @ z.T / self.tau   # (2N, 2N)
        # Mask out self-similarities on diagonal
        n = z1.size(0)
        labels = torch.cat([torch.arange(n, 2*n), torch.arange(n)]).to(z.device)
        # Remove diagonal (self-similarity)
        mask = ~torch.eye(2*n, dtype=bool, device=z.device)
        sim_matrix = sim_matrix[mask].view(2*n, -1)
        return F.cross_entropy(sim_matrix, labels)

# After pretraining: linear evaluation
# Freeze backbone, train linear head on downstream task
backbone = resnet50_pretrained
for p in backbone.parameters(): p.requires_grad = False
linear_head = nn.Linear(2048, num_classes)
optimizer = torch.optim.Adam(linear_head.parameters(), lr=1e-3)
What are positive and negative pairs in contrastive self-supervised learning?
What does the temperature parameter τ (tau) control in the NT-Xent contrastive loss?
38. How would you implement and train a simple feedforward neural network in PyTorch from scratch, without using nn.Sequential?

This question tests whether you understand the full PyTorch workflow: defining a custom nn.Module, implementing forward, and running the standard train loop. It is a common practical screen in ML engineering interviews.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# ─── 1. Define the model ────────────────────────────────────────────
class FeedForwardNet(nn.Module):
    def __init__(self, in_dim: int, hidden_dim: int, out_dim: int,
                 dropout: float = 0.1):
        super().__init__()
        self.fc1     = nn.Linear(in_dim, hidden_dim)
        self.bn1     = nn.BatchNorm1d(hidden_dim)
        self.relu    = nn.ReLU()
        self.drop    = nn.Dropout(dropout)
        self.fc2     = nn.Linear(hidden_dim, out_dim)
        self._init_weights()

    def _init_weights(self):
        nn.init.kaiming_uniform_(self.fc1.weight, nonlinearity='relu')
        nn.init.zeros_(self.fc1.bias)
        nn.init.xavier_uniform_(self.fc2.weight)
        nn.init.zeros_(self.fc2.bias)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.relu(self.bn1(self.fc1(x)))
        x = self.drop(x)
        return self.fc2(x)

# ─── 2. Create data ──────────────────────────────────────────────────
torch.manual_seed(42)
X = torch.randn(1000, 20)
y = (X[:, 0] + X[:, 1] > 0).long()  # binary label
ds     = TensorDataset(X, y)
loader = DataLoader(ds, batch_size=64, shuffle=True)

# ─── 3. Instantiate model, loss, optimizer ───────────────────────────
device    = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model     = FeedForwardNet(20, 64, 2, dropout=0.1).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)

# ─── 4. Training loop ─────────────────────────────────────────────────
for epoch in range(30):
    model.train()
    epoch_loss = 0.0
    for X_b, y_b in loader:
        X_b, y_b = X_b.to(device), y_b.to(device)
        optimizer.zero_grad(set_to_none=True)
        loss = criterion(model(X_b), y_b)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    if epoch % 5 == 0:
        print(f'Epoch {epoch:3d}: loss={epoch_loss / len(loader):.4f}')

Key interview checkpoints: (1) subclass nn.Module and call super().__init__(); (2) define all layers as attributes in __init__; (3) implement forward; (4) follow the zero-grad → forward → loss → backward → step order; (5) call model.train() before training and model.eval() before evaluation.

What is the required order of operations in a standard PyTorch training step?
Why must all learnable layers be defined as attributes in nn.Module's __init__ rather than created inside forward()?
«
»
Tools

Comments & Discussions