Python / Python Deep Learning and Neural Networks Interview Questions

1. What is a neural network and how does forward propagation work mathematically? 2. Explain backpropagation mathematically. How does the chain rule enable computing gradients through many layers? 3. What are the most common activation functions and why did ReLU replace sigmoid/tanh as the default? 4. What are vanishing and exploding gradients, and what techniques are used to address them? 5. Why does weight initialization matter in neural networks, and what is the difference between Xavier and He initialization? 6. How does Batch Normalization work mathematically and why does it stabilize training? 7. Compare SGD, SGD with momentum, RMSProp, and Adam optimizers. When do you choose each? 8. How does Dropout work mathematically, and why does it act as regularization? 9. Explain how convolutional layers work and why they are well-suited to image data. 10. How do RNNs work and why did LSTMs solve the long-range dependency problem? 11. What is the self-attention mechanism in Transformers and why did it replace RNNs for sequence modeling? 12. What loss functions does PyTorch provide for classification and regression, and which to use when? 13. What is transfer learning and how do you fine-tune a pretrained model in PyTorch? 14. How does PyTorch's Dataset and DataLoader pipeline work, and what are the key performance considerations? 15. Why is learning rate scheduling important and what are the most common strategies? 16. What are the most effective regularization strategies for deep learning and how do they differ from classical ML regularization? 17. What are embedding layers in deep learning and how are they different from one-hot encoding? 18. How do you save and load PyTorch models correctly, and what is included in a proper checkpoint? 19. What is mixed precision training and how does it speed up deep learning with torch.cuda.amp? 20. What is the difference between model.eval(), torch.no_grad(), and torch.inference_mode()? When do you use each? 21. How do you use GPUs in PyTorch and what are the key patterns for writing device-agnostic code? 22. What are the differences between Batch Norm, Layer Norm, Group Norm, and Instance Norm? 23. What is an autoencoder and what can a well-trained latent space be used for? 24. How do you diagnose a neural network that is not training correctly from its loss curves? 25. What is the mathematical setup of a Generative Adversarial Network (GAN) and what training challenges do they have? 26. What is torch.compile and how does it speed up PyTorch model execution? 27. Why do Transformers need positional encodings and how does sinusoidal encoding work? 28. What are the most impactful hyperparameters to tune in deep learning and what is the recommended search order? 29. What is an encoder-decoder architecture and how is it used for sequence-to-sequence tasks? 30. What is model quantization in deep learning and how does PyTorch support it? 31. What does a production-quality PyTorch training loop look like, incorporating all best practices? 32. How does batch size affect deep learning training mathematically and practically? 33. How do you choose the right layer type (Linear, Conv, Attention) for a given input modality? 34. What evaluation metrics are most commonly used in deep learning tasks and how do you implement them in PyTorch? 35. How do you export a PyTorch model for production deployment using TorchScript or ONNX? 36. What is knowledge distillation and how does it compress large neural networks into smaller ones? 37. What is self-supervised learning and how do contrastive methods like SimCLR learn representations? 38. How would you implement and train a simple feedforward neural network in PyTorch from scratch, without using nn.Sequential?

Could not find what you were looking for? send us the question and we would be happy to answer your question.

1. What is a neural network and how does forward propagation work mathematically?

A neural network is a parameterised function composed of stacked layers. Each layer applies a linear transformation followed by a non-linear activation: h = σ(Wx + b), where W is a weight matrix, b is a bias vector, and σ is an activation function. Stacking L such layers gives a universal function approximator capable of learning arbitrarily complex input–output mappings, provided the network is wide or deep enough.

Forward propagation simply evaluates this composed function left to right: the input x passes through layer 1, the output becomes the input to layer 2, and so on until the final layer produces a prediction. The entire computation is a directed acyclic graph (DAG) of tensor operations — exactly the structure PyTorch's autograd engine records to enable automatic differentiation.

import torch
import torch.nn as nn

class TwoLayerNet(nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden_dim)   # W1, b1
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, out_dim)  # W2, b2

    def forward(self, x):
        h = self.relu(self.fc1(x))  # h = ReLU(W1 x + b1)
        return self.fc2(h)           # y = W2 h + b2

model = TwoLayerNet(in_dim=10, hidden_dim=64, out_dim=1)
x = torch.randn(32, 10)   # batch of 32 inputs
y_hat = model(x)           # forward pass — calls model.forward(x)
print(y_hat.shape)         # torch.Size([32, 1])

Why depth matters: a network with one wide hidden layer can theoretically approximate any function (universal approximation theorem), but deeper networks can represent certain functions exponentially more efficiently — a function that needs an exponentially wide shallow network may be captured by a compact deep one, because each layer can reuse and compose features built by earlier layers.

What does each layer in a neural network compute?A weighted vote among the previous layer's outputs

✗ Try again.

A linear transformation of its input followed by a non-linear activation function

✓ Correct! Well done.

The gradient of the loss with respect to its parameters

✗ Try again.

A random projection of the input to reduce dimensionality

✗ Try again.

Why does PyTorch's autograd record the forward-pass computation graph?To verify that the network is architecturally correct

✗ Try again.

To enable automatic differentiation — it traverses the recorded graph in reverse to compute gradients of the loss with respect to all parameters

✓ Correct! Well done.

To cache forward-pass results for faster repeated inference

✗ Try again.

To enforce that the computation remains deterministic across runs

✗ Try again.

2. Explain backpropagation mathematically. How does the chain rule enable computing gradients through many layers?

Backpropagation is the algorithm for computing the gradient of a scalar loss L with respect to every parameter in the network. It exploits the chain rule of calculus: if the loss depends on parameter W through intermediate quantities h₁, h₂, ..., hₙ, then ∂L/∂W = (∂L/∂hₙ)(∂hₙ/∂hₙ₋₁)···(∂h₁/∂W). Backprop applies the chain rule systematically starting from the loss and working backwards through each layer, accumulating local gradients.

At each layer, two quantities are needed: the local gradient (how does the layer's output change with its input/weights?) and the upstream gradient (how does the loss change with this layer's output?). Multiplying them gives the gradient flowing to the layer's parameters and to its input, which becomes the upstream gradient for the preceding layer.

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(10, 64), nn.ReLU(),
    nn.Linear(64, 1)
)
loss_fn  = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

x = torch.randn(32, 10)
y = torch.randn(32, 1)

# --- Standard training step ---
optimizer.zero_grad()    # 1. Clear old gradients (they accumulate!)
y_hat = model(x)         # 2. Forward pass — build computation graph
loss  = loss_fn(y_hat, y)# 3. Compute scalar loss
loss.backward()          # 4. Backprop — traverse graph in reverse
                         #    populates .grad for every parameter
optimizer.step()         # 5. Update parameters: W -= lr * W.grad

# Inspect gradients of first layer
print(model[0].weight.grad.shape)  # torch.Size([64, 10])

# Manual chain rule for a single neuron:
# loss = (y_hat - y)^2, y_hat = w*x + b
# dL/dw = 2*(y_hat - y) * x  <- upstream * local
w = torch.tensor([2.0], requires_grad=True)
x_s = torch.tensor([3.0])
y_s = torch.tensor([1.0])
loss_s = (w * x_s - y_s) ** 2
loss_s.backward()
print(w.grad)   # tensor([40.]) == 2*(2*3-1)*3

In backpropagation, what two quantities are multiplied at each layer to compute the gradient?The activation value and the learning rate

✗ Try again.

The local gradient (how the layer's output changes with its weights) and the upstream gradient (how the loss changes with the layer's output)

✓ Correct! Well done.

The weight magnitude and the bias gradient

✗ Try again.

The batch size and the loss value

✗ Try again.

Why must optimizer.zero_grad() be called before each backward pass in PyTorch?It resets the model weights to their initial values

✗ Try again.

PyTorch accumulates gradients by default — without zeroing, gradients from successive batches add together, corrupting the update

✓ Correct! Well done.

It clears the computation graph so the next forward pass can begin

✗ Try again.

It is only necessary when using SGD, not Adam

✗ Try again.

3. What are the most common activation functions and why did ReLU replace sigmoid/tanh as the default?

Activation functions introduce non-linearity — without them, stacking linear layers would collapse into a single linear transformation. Several families exist, each with different mathematical properties that affect training dynamics.

Common Activation Functions
Function	Formula	Range	Key property
Sigmoid	1/(1+e⁻ˣ)	(0, 1)	Saturates for \|x\|>>0 — causes vanishing gradient
Tanh	(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)	(-1, 1)	Zero-centred; still saturates
ReLU	max(0, x)	[0, ∞)	Non-saturating for x>0; sparse; fast
Leaky ReLU	max(αx, x) α≈0.01	(-∞,∞)	Fixes ReLU's dying neuron problem
GELU	x·Φ(x)	(-∞,∞)	Used in BERT/GPT; smooth probabilistic gate
Softmax	eˣⁱ/Σeˣʲ	(0,1) sums to 1	Multi-class output — probability distribution

import torch
import torch.nn.functional as F

x = torch.linspace(-3, 3, 7)

print(F.relu(x))         # [0, 0, 0, 0, 1, 2, 3]  (zeroes negatives)
print(F.sigmoid(x))      # (0,1) — saturates near 0 and 1 at extremes
print(F.tanh(x))         # (-1,1) — saturates near ±1
print(F.leaky_relu(x, negative_slope=0.01))  # small slope for x<0
print(F.gelu(x))         # smooth variant used in transformers

# Softmax: multi-class final layer
logits = torch.tensor([2.0, 1.0, 0.1])
probs  = F.softmax(logits, dim=0)
print(probs)              # [0.659, 0.242, 0.099] — sums to 1.0

# In a model: prefer nn.ReLU() (in-place optional with inplace=True)
import torch.nn as nn
act = nn.ReLU()  # stateless — can be shared across layers

Why ReLU replaced sigmoid: for large networks the vanishing gradient problem made sigmoid/tanh networks nearly untrainable. For a neuron deep in the network, the gradient arriving from backprop has already been multiplied by many sigmoid derivatives — each at most 0.25 — so the gradient shrinks exponentially with depth. ReLU's derivative is exactly 1 for positive inputs (no shrinkage in that direction), allowing gradients to flow through deep networks without exponential decay. The trade-off is the 'dying ReLU' problem where neurons receiving strongly negative inputs get stuck outputting zero permanently, addressed by Leaky ReLU and ELU variants.

Why did sigmoid/tanh activations cause problems in deep networks?They are computationally too slow for large batches

✗ Try again.

Their derivatives are at most 0.25, so gradients shrink exponentially with depth during backpropagation — the vanishing gradient problem

✓ Correct! Well done.

They cannot represent non-linear functions

✗ Try again.

They require all input features to be normalized

✗ Try again.

What is the 'dying ReLU' problem?ReLU neurons output values that are too large, causing gradient explosion

✗ Try again.

Neurons that receive consistently negative inputs always output zero and have zero gradient, so they never update — they are permanently inactive

✓ Correct! Well done.

ReLU cannot be used in the output layer for regression

✗ Try again.

ReLU gradients become undefined at x=0

✗ Try again.

4. What are vanishing and exploding gradients, and what techniques are used to address them?

Vanishing gradients occur when gradients shrink exponentially as they are backpropagated through many layers — the product of many small numbers (e.g. sigmoid derivatives ≤ 0.25) approaches zero, making early layer weights unable to update meaningfully. Exploding gradients are the opposite: the product of many large numbers causes gradients to grow exponentially, destabilising training with numerically infinite or NaN updates.

Both problems worsen with depth. The root mathematical cause is that repeated matrix multiplication of the weight matrices during backprop concentrates the gradient spectrum: if weight matrices have singular values consistently less than 1, gradients vanish; if greater than 1, they explode. Several techniques address this:

Solutions to Gradient Problems
Technique	Addresses	How it helps
ReLU / Leaky ReLU	Vanishing	Gradient = 1 for positive inputs — no shrinkage
Batch Normalisation	Both	Normalises layer inputs; stabilises gradient magnitude
Residual connections (ResNet)	Vanishing	Gradient highway: ∂L/∂x = ∂L/∂(x+F) flows directly
Gradient clipping	Exploding	Caps gradient norm before the update step
Careful weight init (Xavier/He)	Both	Ensures variance stable across layers at init
LSTM/GRU gates	Vanishing (RNNs)	Gating controls gradient flow through time

import torch
import torch.nn as nn

# Gradient clipping — applied AFTER backward(), BEFORE optimizer.step()
model = nn.LSTM(input_size=10, hidden_size=128, num_layers=3, batch_first=True)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

x = torch.randn(32, 20, 10)   # (batch, seq_len, input_size)
output, _ = model(x)
loss = output.sum()

optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # clip!
optimizer.step()

# Residual connection in code:
class ResidualBlock(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, dim), nn.ReLU(),
            nn.Linear(dim, dim)
        )
    def forward(self, x):
        return x + self.net(x)  # gradient flows through x directly

What is the mathematical root cause of vanishing gradients in deep networks?The learning rate is set too low

✗ Try again.

Backpropagation repeatedly multiplies by weight matrices and activation derivatives; if these consistently have singular values less than 1, the gradient magnitude decays exponentially with depth

✓ Correct! Well done.

The network has too many parameters relative to the training data

✗ Try again.

Batch normalisation removes too much gradient information

✗ Try again.

How do residual connections (skip connections) help gradients flow in deep networks?They increase the learning rate automatically for deep layers

✗ Try again.

They add a direct path from the input to the output of a block — the gradient can flow through this identity shortcut unchanged, bypassing the potentially problematic layers

✓ Correct! Well done.

They remove layers that produce vanishing gradients from the network

✗ Try again.

They clone the gradient and send it to all layers simultaneously

✗ Try again.

5. Why does weight initialization matter in neural networks, and what is the difference between Xavier and He initialization?

If weights are initialized too small, activations and gradients shrink layer by layer — a form of vanishing gradient from the start. If too large, they explode. The goal of principled initialisation is to keep the variance of activations and gradients roughly constant across all layers at the start of training.

Xavier (Glorot) initialisation draws weights from a distribution with variance 2/(fan_in + fan_out). It was derived assuming linear activations (or tanh in the original paper) by requiring that the variance of the layer's output equals the variance of its input. He (Kaiming) initialisation uses variance 2/fan_in, derived for ReLU activations specifically — since ReLU zeroes out half the input on average, the variance of the output is halved, so doubling the initial weight variance compensates for this. Using Xavier with ReLU causes variance to shrink by roughly half per layer, eventually vanishing.

import torch
import torch.nn as nn

# Default PyTorch Linear layers use Kaiming Uniform initialisation
layer = nn.Linear(256, 128)
print(layer.weight.std())  # approximately sqrt(2/256) ≈ 0.088

# Explicit initialisation
def init_weights(m):
    if isinstance(m, nn.Linear):
        # Xavier: good for sigmoid/tanh activations
        nn.init.xavier_uniform_(m.weight)
        # He/Kaiming: good for ReLU activations (default in PyTorch)
        # nn.init.kaiming_uniform_(m.weight, nonlinearity='relu')
        nn.init.zeros_(m.bias)

model = nn.Sequential(
    nn.Linear(784, 256), nn.ReLU(),
    nn.Linear(256, 128), nn.ReLU(),
    nn.Linear(128, 10)
)
model.apply(init_weights)  # apply init_weights to every sub-module

# Verifying activation variance stays stable across layers:
x = torch.randn(100, 784)
for layer in model:
    x = layer(x)
    print(f'{layer.__class__.__name__}: std={x.std():.3f}')
# With He init + ReLU: std should remain near 1.0 throughout

Why does Xavier initialization use variance 2/(fan_in + fan_out) rather than a constant?It is computationally cheaper than computing the exact variance

✗ Try again.

It is derived analytically to keep the variance of a layer's output equal to the variance of its input under linear activations — preventing activations from growing or shrinking across layers

✓ Correct! Well done.

It prevents the bias terms from being initialized to zero

✗ Try again.

It is based on the number of training examples rather than layer dimensions

✗ Try again.

Why should He initialization be used with ReLU activations instead of Xavier?He initialization is newer and always better than Xavier

✗ Try again.

ReLU zeros out half the inputs on average, halving the output variance per layer; He initialization doubles the initial weight variance (using 2/fan_in) to compensate for this zeroing effect

✓ Correct! Well done.

Xavier initialization produces negative weights that ReLU cannot process

✗ Try again.

He initialization prevents the dying ReLU problem by using larger weights

✗ Try again.

6. How does Batch Normalization work mathematically and why does it stabilize training?

Batch Normalisation (BN) normalises the pre-activation values within a mini-batch to have zero mean and unit variance, then rescales them with learnable parameters γ (scale) and β (shift): BN(x) = γ · (x - μ_B) / √(σ²_B + ε) + β, where μ_B and σ²_B are the batch mean and variance, and ε is a small constant for numerical stability.

BN addresses internal covariate shift — the distribution of each layer's inputs changes during training as the preceding layers' weights update, forcing each layer to continuously adapt to a moving target. By renormalising inputs at each layer, BN stabilises this distribution. In practice, BN also provides a mild regularisation effect (similar to adding noise via the mini-batch statistics), reduces sensitivity to learning rate, and substantially reduces the need for dropout in many architectures.

import torch
import torch.nn as nn

# BatchNorm1d: for fully-connected layers (normalises over batch dim)
# BatchNorm2d: for conv layers (normalises per channel over batch+spatial)

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.BatchNorm1d(256),    # BN BEFORE or AFTER activation — varies by paper
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.BatchNorm1d(128),
    nn.ReLU(),
    nn.Linear(128, 10),
)

# BatchNorm behaves DIFFERENTLY in train vs eval mode!
model.train()   # uses batch mean/var during forward pass
model.eval()    # uses running mean/var (exponential moving avg)

# Always call model.eval() at inference time:
with torch.no_grad():
    model.eval()
    preds = model(torch.randn(1, 784))  # inference — correct behavior

# Manual: BN keeps running stats during training
bn = nn.BatchNorm1d(256)
print(bn.running_mean.shape)  # torch.Size([256]) — updated each forward call

What critical difference exists between BatchNorm behavior in training mode vs eval mode?In eval mode, the learnable parameters gamma and beta are frozen

✗ Try again.

In training mode, BN uses the current mini-batch mean and variance; in eval mode, it uses running statistics accumulated during training — using batch stats at inference would give inconsistent results for different batch sizes

✓ Correct! Well done.

In eval mode, BN is skipped entirely to improve inference speed

✗ Try again.

Batch size must be at least 32 in eval mode for BN to work correctly

✗ Try again.

What is 'internal covariate shift' and how does Batch Normalization address it?The change in gradient sign that causes oscillation during training; BN clips gradients automatically

✗ Try again.

The change in the distribution of each layer's inputs during training as earlier layers update; BN re-normalizes these distributions at every layer, giving each subsequent layer a more stable input to work with

✓ Correct! Well done.

The shift in batch statistics caused by imbalanced class distributions; BN reweights samples

✗ Try again.

The numerical drift in floating-point weights; BN quantises weights to prevent it

✗ Try again.

7. Compare SGD, SGD with momentum, RMSProp, and Adam optimizers. When do you choose each?

All these optimizers share the same goal — updating parameters to reduce loss — but differ in how they use gradient history to adapt the update step. Understanding the mechanics helps diagnose slow training and poor generalisation.

Optimizer Comparison
Optimizer	Update rule (simplified)	Key advantage	Limitation
SGD	θ ← θ - η·g	Simple, no memory overhead	Slow convergence, sensitive to lr
SGD + Momentum	v ← βv + g; θ ← θ - η·v	Accelerates consistent directions, damps oscillation	Still global lr
RMSProp	θ ← θ - η·g / √(E[g²]+ε)	Adapts lr per parameter; good for RNNs	No momentum term
Adam	Combines momentum + RMSProp; bias-corrected	Robust default; fast convergence	Can generalise worse than SGD on some tasks

import torch
import torch.nn as nn

model = nn.Linear(10, 1)

# SGD — baseline, works but needs careful lr tuning
opt_sgd = torch.optim.SGD(model.parameters(), lr=0.01)

# SGD + Momentum — adds velocity; β=0.9 is standard
opt_mom = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9,
                           weight_decay=1e-4)  # L2 regularisation

# Adam — adaptive learning rate + momentum; best default for DL
opt_adam = torch.optim.Adam(model.parameters(),
                             lr=1e-3,      # default, usually works
                             betas=(0.9, 0.999),  # momentum terms
                             eps=1e-8,
                             weight_decay=1e-5)

# AdamW — Adam with decoupled weight decay (better than Adam + L2)
opt_adamw = torch.optim.AdamW(model.parameters(), lr=1e-3,
                               weight_decay=1e-2)

# Learning rate schedulers — change lr during training
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(opt_adam, T_max=100)
for epoch in range(100):
    # ... training loop ...
    scheduler.step()  # decrease lr following cosine curve

When to choose: Adam is the safe default for most deep learning tasks. SGD with momentum often achieves better final generalisation on image classification tasks (the finding that motivated the NLP community's shift back to AdamW for fine-tuning pre-trained transformers). AdamW is now the standard for fine-tuning large language models.

What does the momentum term in SGD with momentum physically represent?The ratio of the learning rate to the batch size

✗ Try again.

A velocity vector accumulating past gradient directions — it amplifies updates in consistent directions and damps oscillation in inconsistent directions, like a ball rolling downhill

✓ Correct! Well done.

The exponential moving average of the parameter values

✗ Try again.

The second-order derivative (Hessian) of the loss

✗ Try again.

What is the key difference between Adam and AdamW?AdamW uses a different momentum estimate than Adam

✗ Try again.

AdamW decouples weight decay from the gradient-based update — in Adam, adding L2 regularization to the loss couples the decay with the adaptive learning rate scaling, which weakens its intended regularizing effect; AdamW applies the decay directly to the weights

✓ Correct! Well done.

AdamW removes the bias-correction terms that Adam uses

✗ Try again.

AdamW is only applicable to transformer models

✗ Try again.

8. How does Dropout work mathematically, and why does it act as regularization?

During training, Dropout randomly sets each neuron's output to zero with probability p (the drop probability) and scales the remaining activations by 1/(1-p) to preserve the expected sum. This means each forward pass trains a different thinned sub-network — with n neurons, there are 2ⁿ possible sub-networks, and each training step updates a random one.

The regularisation effect comes from several mechanisms: (1) it prevents co-adaptation — neurons cannot rely on specific other neurons always being present, so each must learn useful features independently; (2) it is mathematically equivalent to training an exponentially large ensemble and averaging their predictions at test time (where Dropout is disabled); (3) the multiplicative noise acts similarly to L2 regularisation on the weights. At inference, Dropout is disabled and all neurons are active — the 1/(1-p) scaling during training ensures the expected value of each neuron's output is the same during training and inference.

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 512), nn.ReLU(),
    nn.Dropout(p=0.5),              # drop 50% of neurons
    nn.Linear(512, 256), nn.ReLU(),
    nn.Dropout(p=0.3),
    nn.Linear(256, 10)
)

# Training: Dropout is ACTIVE (neurons randomly zeroed)
model.train()
x = torch.ones(1, 784)
out1 = model(x)
out2 = model(x)  # different result! different neurons dropped each time

# Inference: Dropout is DISABLED (all neurons active)
model.eval()
with torch.no_grad():
    out3 = model(x)
    out4 = model(x)  # same result — deterministic

# Inverted Dropout (PyTorch default):
# Scale by 1/(1-p) DURING training, not during inference
# => test-time output has correct expected value without scaling
dp = nn.Dropout(p=0.5)
model.train()
x_in = torch.ones(10)
print(dp(x_in))  # ~5 zeros, remaining values are 2.0 (scaled by 1/0.5)

Why does inverted Dropout scale surviving activations by 1/(1-p) during training rather than scaling at inference?It makes the computation faster during training

✗ Try again.

Scaling during training ensures that the expected output of each neuron is the same during training and at inference time — so no scaling adjustment is needed at inference, simplifying deployment

✓ Correct! Well done.

Scaling during training prevents the vanishing gradient problem

✗ Try again.

It makes Dropout mathematically equivalent to L1 regularization

✗ Try again.

Why does Dropout function as an implicit ensemble method?It trains copies of the model on different subsets of the data

✗ Try again.

Each training step trains a different thinned sub-network (random subset of neurons); at inference, the full network approximates averaging the predictions of this exponentially large collection of sub-networks

✓ Correct! Well done.

It builds a separate model for each dropout probability value

✗ Try again.

Dropout automatically runs multiple forward passes and averages their results

✗ Try again.

9. Explain how convolutional layers work and why they are well-suited to image data.

A convolutional layer applies a set of learnable filters (kernels) by sliding each filter over the spatial dimensions of the input and computing a dot product at each position. For a 2D image, a kernel of size k×k with C_in input channels and C_out output channels has k×k×C_in×C_out parameters total. This produces one feature map per output channel, where each value represents the response of that filter at a specific spatial location.

CNNs are powerful for images because of two structural inductive biases they encode: (1) translation equivariance — the same filter is applied everywhere, so if an object moves in the image, the corresponding feature map activation moves identically; (2) parameter sharing — instead of a separate weight per input-output pixel pair (as a fully-connected layer would require), the filter weights are shared across all spatial locations, drastically reducing parameters and improving sample efficiency.

import torch
import torch.nn as nn

# Standard Conv2d usage
# Input:  (batch, C_in, H, W)
# Output: (batch, C_out, H', W')
conv = nn.Conv2d(
    in_channels=3,    # RGB image
    out_channels=64,  # 64 filters
    kernel_size=3,    # 3x3 kernel
    stride=1,
    padding=1,        # 'same' padding — preserves H and W
)

x = torch.randn(8, 3, 32, 32)   # batch of 8 RGB 32x32 images
out = conv(x)
print(out.shape)  # torch.Size([8, 64, 32, 32])

n_params = 3 * 64 * 3 * 3 + 64  # weights + biases
print('Parameters:', n_params)   # 1792
# Compare: FC layer 3*32*32 -> 64*32*32 would be 3*32*32*64*32*32 = 603M!

# Typical CNN block:
cnn_block = nn.Sequential(
    nn.Conv2d(64, 128, kernel_size=3, padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),  # halves spatial dims
)

What is 'parameter sharing' in convolutional layers and why is it beneficial?Each layer shares weights with the previous layer to save memory

✗ Try again.

The same filter weights are used at every spatial position in the input — instead of unique weights per location (as in a fully-connected layer), one set of weights scans the entire feature map, drastically reducing parameters

✓ Correct! Well done.

All convolutional layers in a network share the same weight matrix

✗ Try again.

Parameter sharing refers to sharing weights between the encoder and decoder in autoencoders

✗ Try again.

What does 'translation equivariance' mean in the context of CNNs?The network's output is always the same regardless of input position

✗ Try again.

If the input pattern shifts spatially, the corresponding feature map activation shifts by the same amount — the learned filter detects the pattern regardless of where in the image it appears

✓ Correct! Well done.

The network can classify images regardless of their resolution

✗ Try again.

Translation equivariance means the order of convolutional layers can be swapped

✗ Try again.

10. How do RNNs work and why did LSTMs solve the long-range dependency problem?

A vanilla RNN processes a sequence step-by-step, maintaining a hidden state hₜ = tanh(Wₓxₜ + Wₕhₜ₋₁ + b) that acts as a compressed memory of everything seen so far. The problem is that this hidden state must be updated at every step — and during backpropagation through time (BPTT), gradients are multiplied by Wₕ repeatedly. If the spectral radius of Wₕ is less than 1, gradients vanish over long sequences; if greater than 1, they explode. In practice, vanilla RNNs cannot effectively learn dependencies longer than ~10–20 steps.

LSTMs introduce a separate cell state cₜ (the long-term memory) and three gates — forget, input, and output — each controlled by sigmoid activations. The forget gate fₜ = σ(Wf[hₜ₋₁, xₜ] + bf) decides what to erase from cₜ₋₁; the input gate decides what new information to write; the output gate controls what the hidden state exposes. The key mathematical insight is that the cell state update is additive: cₜ = fₜ⊙cₜ₋₁ + iₜ⊙c̃ₜ. Additive updates mean the gradient can flow through time without repeated multiplicative shrinkage, solving the vanishing gradient problem for long sequences.

import torch
import torch.nn as nn

# LSTM usage in PyTorch
lstm = nn.LSTM(
    input_size=64,
    hidden_size=128,
    num_layers=2,       # stacked LSTM
    batch_first=True,   # input shape: (batch, seq, features)
    dropout=0.2,        # applied between stacked layers
    bidirectional=False
)

x = torch.randn(32, 50, 64)   # (batch=32, seq_len=50, input=64)
output, (h_n, c_n) = lstm(x)
print(output.shape)  # (32, 50, 128) — all time-step hidden states
print(h_n.shape)     # (2, 32, 128)  — final hidden state, both layers
print(c_n.shape)     # (2, 32, 128)  — final cell state, both layers

# GRU: simplified LSTM with only 2 gates — often comparable quality
gru = nn.GRU(input_size=64, hidden_size=128, batch_first=True)
out_gru, h_gru = gru(x)

# For classification, use the LAST hidden state:
last_h = output[:, -1, :]  # (32, 128) — last time step
classifier = nn.Linear(128, 5)
logits = classifier(last_h)

Why can vanilla RNNs not learn long-range dependencies effectively?They cannot process variable-length sequences

✗ Try again.

During backpropagation through time, gradients are multiplied by the same weight matrix at each step — if its spectral radius is less than 1, the gradient vanishes exponentially with sequence length

✓ Correct! Well done.

Vanilla RNNs can only process up to 10 tokens due to memory constraints

✗ Try again.

Vanilla RNNs do not support batch processing

✗ Try again.

Why do LSTMs use additive cell state updates rather than the multiplicative updates of vanilla RNNs?Additive updates are computationally faster

✗ Try again.

Additive updates allow the gradient to flow backwards through time without repeated multiplication — information can be preserved or erased in the cell state without the gradient shrinking exponentially with each step

✓ Correct! Well done.

Multiplicative updates would require normalising the hidden state

✗ Try again.

LSTMs use multiplicative updates — only the output gate is additive

✗ Try again.

11. What is the self-attention mechanism in Transformers and why did it replace RNNs for sequence modeling?

Self-attention computes a weighted sum of all input vectors, where the weight between positions i and j reflects how much position i should 'attend to' position j. Concretely, input vectors are linearly projected into queries (Q), keys (K), and values (V), and the attention output is: Attention(Q, K, V) = softmax(QKᵀ/√dₖ) · V. The division by √dₖ prevents the dot products from growing large in high-dimensional spaces, which would push softmax into saturation.

Multi-head attention runs H parallel attention heads with different Q/K/V projections, then concatenates and projects their outputs — each head can learn to attend to different types of relationships simultaneously. The critical advantage over RNNs: self-attention connects any two positions in the sequence in O(1) operations regardless of their distance, while RNNs need O(n) sequential steps to connect positions n apart. This makes transformers trainable in parallel across the sequence length, enabling training on vastly larger datasets.

import torch
import torch.nn as nn
import math

class ScaledDotProductAttention(nn.Module):
    def forward(self, Q, K, V, mask=None):
        d_k = Q.shape[-1]
        scores = (Q @ K.transpose(-2, -1)) / math.sqrt(d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        weights = torch.softmax(scores, dim=-1)
        return weights @ V, weights

# PyTorch's built-in multi-head attention
mha = nn.MultiheadAttention(
    embed_dim=512,
    num_heads=8,    # 8 heads, each with dim=64
    dropout=0.1,
    batch_first=True
)

seq_len, batch, d_model = 20, 4, 512
x = torch.randn(batch, seq_len, d_model)
out, attn_weights = mha(x, x, x)  # Q=K=V=x for self-attention
print(out.shape)         # (4, 20, 512)
print(attn_weights.shape)# (4, 20, 20) — weight of each position pair

Why is the dot product in scaled dot-product attention divided by √d_k?To normalise the attention weights so they sum to 1

✗ Try again.

For large d_k, dot products grow in magnitude, pushing softmax into saturation with near-zero gradients; dividing by √d_k keeps the scores in a well-behaved range

✓ Correct! Well done.

It converts the dot product from cosine similarity to Euclidean distance

✗ Try again.

It ensures the attention weights are symmetrical between positions i and j

✗ Try again.

What is the key efficiency advantage of self-attention over RNNs for long sequences?Self-attention uses less GPU memory than RNNs

✗ Try again.

Self-attention connects any two positions in a single operation regardless of distance, enabling full parallelisation across the sequence; RNNs must process tokens sequentially and need O(n) steps to connect distant positions

✓ Correct! Well done.

Self-attention does not require backpropagation

✗ Try again.

Self-attention has fewer parameters than LSTM for the same hidden dimension

✗ Try again.

12. What loss functions does PyTorch provide for classification and regression, and which to use when?

The choice of loss function should match the output type and the probabilistic assumption about the data-generating process — it is the mathematical link between model predictions and the training signal.

Common PyTorch Loss Functions
Task	Loss	PyTorch class	Notes
Binary classification	Binary cross-entropy	nn.BCEWithLogitsLoss	Takes logits (pre-sigmoid); numerically stable
Multi-class classification	Cross-entropy	nn.CrossEntropyLoss	Takes logits; combines log-softmax + NLLLoss
Regression	MSE	nn.MSELoss	Sensitive to outliers
Regression (robust)	MAE / Huber	nn.L1Loss / nn.HuberLoss	Huber blends L1+L2; robust to outliers
Multi-label classification	BCE per label	nn.BCEWithLogitsLoss	Each label independent — not mutually exclusive
Contrastive / metric learning	Triplet margin	nn.TripletMarginLoss	Learns embeddings

import torch
import torch.nn as nn

# Binary classification — output is a single logit (no sigmoid)
bce = nn.BCEWithLogitsLoss()  # applies sigmoid internally
logit = torch.tensor([2.0, -1.0, 0.5])
label = torch.tensor([1.0, 0.0, 1.0])
loss = bce(logit, label)

# Multi-class — outputs are raw logits per class (no softmax)
ce = nn.CrossEntropyLoss()
logits = torch.randn(8, 10)    # batch of 8, 10 classes
targets = torch.randint(0, 10, (8,))  # class indices 0-9
loss = ce(logits, targets)

# Class-weighted cross-entropy — for imbalanced datasets
weights = torch.tensor([1.0]*9 + [10.0])  # up-weight class 9
ce_weighted = nn.CrossEntropyLoss(weight=weights)

# Regression
mse = nn.MSELoss()
huber = nn.HuberLoss(delta=1.0)  # L2 for |error|<1, L1 for |error|>1
pred = torch.randn(32, 1)
true = torch.randn(32, 1)
print(mse(pred, true), huber(pred, true))

Why is nn.BCEWithLogitsLoss preferred over applying torch.sigmoid followed by nn.BCELoss?BCEWithLogitsLoss is faster because it skips the sigmoid computation

✗ Try again.

Numerically: log(sigmoid(x)) is computed as log(1/(1+e^-x)) which can underflow for large x; BCEWithLogitsLoss uses the logsumexp trick to combine sigmoid and log stably

✓ Correct! Well done.

BCEWithLogitsLoss automatically adjusts the threshold from 0.5

✗ Try again.

BCELoss produces incorrect gradients when the prediction is exactly 0 or 1

✗ Try again.

What is a key difference between nn.CrossEntropyLoss and nn.BCEWithLogitsLoss?CrossEntropyLoss uses MSE internally; BCEWithLogitsLoss uses absolute error

✗ Try again.

CrossEntropyLoss is for mutually exclusive multi-class problems (one label per sample); BCEWithLogitsLoss treats each output independently and is used for multi-label problems where multiple classes can be true simultaneously

✓ Correct! Well done.

BCEWithLogitsLoss requires softmax to be applied before passing logits

✗ Try again.

CrossEntropyLoss does not support class weighting

✗ Try again.

13. What is transfer learning and how do you fine-tune a pretrained model in PyTorch?

Transfer learning reuses a model trained on a large dataset (typically ImageNet for vision, or a large text corpus for NLP) as a starting point for a related task with less data. The pretrained model has already learned general features (edges, textures, shapes for images; grammar, semantics for text) — fine-tuning adapts these features to the target task without needing to learn them from scratch.

Two common strategies: (1) Feature extraction — freeze all pretrained layers and train only a new task-specific head; (2) Full fine-tuning — unfreeze some or all pretrained layers and train end-to-end with a small learning rate to avoid overwriting the useful pretrained representations. A common practical pattern is to first train only the head for a few epochs (so it doesn't start with random gradients corrupting the pretrained backbone), then unfreeze and fine-tune everything together with a smaller lr.

import torch
import torch.nn as nn
import torchvision.models as models

# Load pretrained ResNet-50
backbone = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

# --- Strategy 1: Feature extraction ---
# Freeze ALL pretrained parameters
for param in backbone.parameters():
    param.requires_grad = False

# Replace the final FC layer for our task (e.g. 5 classes)
in_features = backbone.fc.in_features  # 2048 for ResNet-50
backbone.fc = nn.Linear(in_features, 5)
# Only backbone.fc.parameters() have requires_grad=True

# --- Strategy 2: Full fine-tuning ---
backbone2 = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
backbone2.fc = nn.Linear(backbone2.fc.in_features, 5)
# Use layer-wise lr: smaller lr for early layers
optimizer = torch.optim.AdamW([
    {'params': backbone2.layer1.parameters(), 'lr': 1e-5},
    {'params': backbone2.layer4.parameters(), 'lr': 1e-4},
    {'params': backbone2.fc.parameters(),    'lr': 1e-3},
], weight_decay=1e-2)

# Verify which parameters will be updated
trainable = sum(p.numel() for p in backbone.parameters() if p.requires_grad)
total     = sum(p.numel() for p in backbone.parameters())
print(f'Trainable: {trainable:,} / Total: {total:,}')

Why is a smaller learning rate recommended for pretrained layers during fine-tuning?Pretrained layers always converge faster so they need less lr

✗ Try again.

Large updates to pretrained weights would overwrite useful representations learned from millions of examples; a small lr makes gentle adjustments to adapt the features while preserving the general-purpose knowledge

✓ Correct! Well done.

PyTorch enforces smaller lr for frozen parameters automatically

✗ Try again.

Smaller lr prevents the fine-tuning from changing the output layer's weights

✗ Try again.

What does setting param.requires_grad = False accomplish in PyTorch?The parameter is deleted from the model to save memory

✗ Try again.

The parameter is excluded from gradient computation during backward() and will not be updated by the optimizer — effectively freezing that layer

✓ Correct! Well done.

The parameter is set to zero and kept constant

✗ Try again.

It makes the parameter shared across multiple layers

✗ Try again.

14. How does PyTorch's Dataset and DataLoader pipeline work, and what are the key performance considerations?

PyTorch's data loading follows a clean two-class design: Dataset encapsulates how to access a single sample (index → (X, y)), and DataLoader wraps a Dataset to handle batching, shuffling, and parallel data loading. Separating these responsibilities makes it easy to write dataset-specific logic once and reuse the same efficient loading infrastructure.

The most critical performance consideration is that the data loading pipeline must keep the GPU continuously fed — the GPU should never sit idle waiting for the next batch. Key knobs: num_workers launches subprocesses that prefetch batches in parallel with the GPU computation; pin_memory=True allocates batch tensors in pinned (non-pageable) CPU memory, enabling faster CPU→GPU transfers via DMA; prefetch_factor controls how many batches each worker prefetches ahead.

import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np

class TabularDataset(Dataset):
    def __init__(self, X: np.ndarray, y: np.ndarray):
        # Convert to tensors once at construction (not per __getitem__)
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.long)

    def __len__(self):
        return len(self.X)    # required — DataLoader uses this for indexing

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]  # single sample

dataset = TabularDataset(X_train, y_train)

loader = DataLoader(
    dataset,
    batch_size=256,
    shuffle=True,             # shuffle each epoch
    num_workers=4,            # parallel data loading
    pin_memory=True,          # faster CPU->GPU transfer
    drop_last=True,           # drop incomplete final batch
    persistent_workers=True,  # keep workers alive between epochs
)

# Training loop
for X_batch, y_batch in loader:
    X_batch = X_batch.cuda(non_blocking=True)  # async transfer
    y_batch = y_batch.cuda(non_blocking=True)
    # ... forward, backward, step

What must every custom PyTorch Dataset class implement?__init__, __call__, and __str__

✗ Try again.

__len__ (number of samples) and __getitem__ (return one sample given an index)

✓ Correct! Well done.

__iter__ and __next__ to support iteration

✗ Try again.

__len__ and __repr__ for display purposes

✗ Try again.

Why does pin_memory=True in DataLoader improve training throughput?It caches the entire dataset on the GPU to avoid repeated transfers

✗ Try again.

Pinned memory is non-pageable CPU memory — CUDA can transfer it to the GPU via DMA without CPU involvement, making the transfer faster and allowing the GPU computation and data transfer to overlap

✓ Correct! Well done.

It prevents the operating system from freeing batch tensors between iterations

✗ Try again.

It moves the DataLoader workers to GPU threads

✗ Try again.

15. Why is learning rate scheduling important and what are the most common strategies?

A fixed learning rate is a poor choice for most training runs: too high early on causes instability; too high late in training prevents fine convergence to a sharp minimum. Learning rate schedulers systematically vary the lr during training to get the best of both worlds — fast progress early, precise convergence later.

Common LR Schedules
Schedule	Behaviour	Best for
StepLR	Multiply lr by γ every N epochs	Quick experiments; baseline
CosineAnnealingLR	lr follows cosine curve from η_max to η_min	Most DL tasks; smooth decay
OneCycleLR	Warmup from low to high lr, then decay — all in one cycle	Fast training (super-convergence)
ReduceLROnPlateau	Reduce lr when validation metric stops improving	Unknown training time; auto-adapts
CyclicLR	Cycle between base_lr and max_lr repeatedly	Escaping sharp minima
WarmupThenDecay	Linear warmup then cosine decay	Large transformers, LLMs

import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Linear(10, 1)
optimizer = optim.AdamW(model.parameters(), lr=1e-3)

# CosineAnnealingLR — smooth decay from max to min lr
scheduler_cos = optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=100, eta_min=1e-6
)

# OneCycleLR — requires total_steps at init
n_epochs, steps_per_epoch = 10, 100
scheduler_1cycle = optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=1e-2,
    total_steps=n_epochs * steps_per_epoch,
    pct_start=0.3,   # 30% of steps for warmup
)

# ReduceLROnPlateau — triggered by validation metric
scheduler_plateau = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=5
)

# Training loop
for epoch in range(100):
    train_one_epoch(model, optimizer, loader)
    val_loss = validate(model, val_loader)

    scheduler_cos.step()             # epoch-based schedulers
    scheduler_plateau.step(val_loss) # metric-based scheduler
    print(f'LR: {optimizer.param_groups[0]["lr"]:.6f}')

Why is a warmup phase (low → high lr) commonly used at the start of training large models?Warmup makes the loss decrease faster in the first epoch

✗ Try again.

At initialisation, model weights are random and gradients are noisy — starting with a high lr can produce large, misguided updates; warmup allows the model to first settle into a stable region before applying the full learning rate

✓ Correct! Well done.

Large learning rates cause the batch normalisation statistics to diverge

✗ Try again.

PyTorch requires a warmup phase for AdamW to function correctly

✗ Try again.

When is ReduceLROnPlateau the most appropriate scheduler?When you know exactly how many epochs training will take

✗ Try again.

When you want the lr to automatically reduce if the validation metric has stopped improving — it adapts the schedule to the actual training dynamics rather than a fixed plan

✓ Correct! Well done.

For fine-tuning pretrained models only

✗ Try again.

When using SGD; it is not compatible with Adam-based optimizers

✗ Try again.

16. What are the most effective regularization strategies for deep learning and how do they differ from classical ML regularization?

Deep neural networks have millions of parameters and can trivially memorise training data. Classical regularisation (L1/L2 on weights) still applies, but modern deep learning has developed additional techniques that often work better or are used in combination.

DL Regularization Techniques
Technique	How it works	Best applied to
L2 (weight decay)	Penalises large weights: adds λ‖w‖² to loss	All DL models; use AdamW for correct implementation
Dropout	Randomly zero neurons during training	Fully-connected layers; less common in conv/transformer
Data augmentation	Artificially increase diversity of training set	Vision (flips, crop, colour jitter, mixup, cutmix)
Early stopping	Stop training when val loss stops improving	Any model; simple and effective baseline
Label smoothing	Soften one-hot labels to (1-ε, ε/(k-1),...)	Classification; improves calibration
Stochastic depth	Randomly drop entire residual blocks during training	Very deep networks (ResNets, ViTs)

import torch
import torch.nn as nn
import torchvision.transforms as T

# Data augmentation for images
train_transform = T.Compose([
    T.RandomHorizontalFlip(p=0.5),
    T.RandomCrop(32, padding=4),
    T.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]),
])

# Label smoothing: penalises overconfident predictions
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
# A 10-class example: true label 3 becomes
# [0.01, 0.01, 0.01, 0.91, 0.01, ...] instead of [0,0,0,1,0,...]

# Mixup augmentation (manual implementation)
def mixup_batch(x, y, alpha=0.4):
    lam = torch.distributions.Beta(alpha, alpha).sample().item()
    idx = torch.randperm(x.size(0))
    x_mix = lam * x + (1 - lam) * x[idx]
    y_a, y_b = y, y[idx]
    return x_mix, y_a, y_b, lam

# Early stopping — track best val loss, restore best weights
best_val_loss = float('inf')
patience_count = 0
for epoch in range(max_epochs):
    val_loss = validate(model, val_loader)
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'best_model.pt')
        patience_count = 0
    else:
        patience_count += 1
    if patience_count >= patience:
        break

What does label smoothing achieve compared to using hard one-hot targets?It reduces the number of classes the model must distinguish

✗ Try again.

It prevents the model from becoming overconfident by using soft targets, which also improves calibration — a model trained with hard labels can assign near-infinite logit to the correct class without further loss reduction

✓ Correct! Well done.

It makes the cross-entropy loss equivalent to MSE

✗ Try again.

Label smoothing is required when classes are imbalanced

✗ Try again.

Why is data augmentation one of the most effective regularization strategies for vision models?It reduces training time by using fewer unique samples

✗ Try again.

It artificially increases the effective size of the training set and teaches the model invariances (e.g. horizontal flips, colour shifts) that the raw dataset may not adequately represent — providing genuine new samples rather than just penalising complexity

✓ Correct! Well done.

It is the only method that works without a validation set

✗ Try again.

Data augmentation is applied at inference time to improve accuracy

✗ Try again.

17. What are embedding layers in deep learning and how are they different from one-hot encoding?

An embedding layer is a learnable lookup table that maps discrete tokens (words, categories, user IDs) to dense, low-dimensional real-valued vectors. It is mathematically a matrix E ∈ ℝ^{V×d} (vocabulary size × embedding dimension), and looking up token i simply retrieves row i — equivalent to multiplying a one-hot vector by E, but implemented as an O(1) table lookup rather than an O(V) matrix multiply.

The key advantage over one-hot encoding is that embeddings are learned — similar tokens (synonyms, related categories) naturally end up with similar embedding vectors because they appear in similar contexts during training. This gives embeddings semantic meaning and enables generalisation: the model can leverage the fact that 'Paris' and 'Berlin' are semantically similar even if 'Berlin' was rare in training data, because their embedding vectors will be nearby.

import torch
import torch.nn as nn

vocab_size  = 10000
embed_dim   = 128

embedding = nn.Embedding(
    num_embeddings=vocab_size,
    embedding_dim=embed_dim,
    padding_idx=0    # token 0 gets a fixed zero vector (PAD token)
)

# Input: integer token IDs
token_ids = torch.tensor([[1, 5, 23, 0], [42, 7, 0, 0]])  # (2, 4)
embedded  = embedding(token_ids)
print(embedded.shape)  # (2, 4, 128) — each token -> 128-dim vector

# Pre-trained embeddings (e.g. GloVe, Word2Vec)
pretrained = torch.randn(vocab_size, embed_dim)  # replace with real vectors
embedding.weight.data.copy_(pretrained)
# Freeze pretrained embeddings:
# embedding.weight.requires_grad = False

# In a text model:
class TextClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.lstm  = nn.LSTM(embed_dim, 256, batch_first=True)
        self.fc    = nn.Linear(256, 5)
    def forward(self, x):
        e = self.embed(x)          # (B, L, 128)
        _, (h, _) = self.lstm(e)
        return self.fc(h[-1])

What is the computational advantage of using an embedding layer over multiplying a one-hot vector by a weight matrix?The embedding layer applies L2 normalisation automatically

✗ Try again.

An embedding lookup is an O(1) table access — retrieving a single row — whereas multiplying a one-hot vector by a matrix is an O(V×d) operation that scales with vocabulary size

✓ Correct! Well done.

Embedding layers support gradients while matrix multiplication does not

✗ Try again.

One-hot encoding cannot represent tokens with more than 1000 categories

✗ Try again.

Why do similar tokens end up with similar embedding vectors after training?The embedding initialisation algorithm groups similar words together

✗ Try again.

Tokens appearing in similar contexts receive similar gradient updates during training, causing the model to learn that they are interchangeable — this is the distributional hypothesis underlying word embeddings

✓ Correct! Well done.

The padding_idx parameter causes similar tokens to cluster

✗ Try again.

Similar tokens share the same row index in the embedding table

✗ Try again.

18. How do you save and load PyTorch models correctly, and what is included in a proper checkpoint?

PyTorch provides two main ways to persist a model: saving the full model object (convenient but fragile to class definition changes) or saving only the state dictionary (recommended for production and reproducibility). The state dict is a Python OrderedDict mapping layer names to their parameter tensors — it contains everything needed to recreate the model's learned state.

A proper training checkpoint includes more than just model weights — it must also save the optimizer state (which contains momentum buffers and adaptive learning rate accumulators in Adam), the current epoch and step, the best validation metric, and the random number generator state, so that training can be resumed exactly where it left off without any change in behaviour.

import torch
import torch.nn as nn

model = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# ── Recommended: save/load state_dict ──
torch.save(model.state_dict(), 'model_weights.pt')

model_new = nn.Linear(10, 5)               # same architecture
model_new.load_state_dict(torch.load('model_weights.pt'))
model_new.eval()                            # ALWAYS call eval() for inference

# ── Full training checkpoint ──
def save_checkpoint(path, epoch, model, optimizer, best_val_loss):
    torch.save({
        'epoch':         epoch,
        'model_state':   model.state_dict(),
        'optimizer_state': optimizer.state_dict(),
        'best_val_loss': best_val_loss,
        'rng_state':     torch.get_rng_state(),
    }, path)

def load_checkpoint(path, model, optimizer):
    ckpt = torch.load(path, map_location='cpu')
    model.load_state_dict(ckpt['model_state'])
    optimizer.load_state_dict(ckpt['optimizer_state'])
    return ckpt['epoch'], ckpt['best_val_loss']

# Loading on different device: always load to CPU first,
# then move to device (avoids GPU OOM if original GPU is unavailable)
model.load_state_dict(
    torch.load('model_weights.pt', map_location='cpu')
)
model = model.to('cuda')

Why is saving model.state_dict() preferred over saving the entire model object with torch.save(model)?state_dict files are always smaller in size

✗ Try again.

Saving the full model object serialises the Python class definition too — if the class is refactored or moved to a different module, loading fails; saving only the state_dict decouples the weights from the class structure

✓ Correct! Well done.

PyTorch's torch.save cannot serialise nn.Module objects

✗ Try again.

The full model object cannot be loaded on a different GPU

✗ Try again.

Why is the optimizer state included in a training checkpoint alongside the model weights?The optimizer state contains the training labels used in the last batch

✗ Try again.

Optimizers like Adam maintain per-parameter momentum and adaptive learning rate accumulators — restoring these allows training to resume with the same convergence dynamics; starting Adam fresh would effectively restart the adaptive rates

✓ Correct! Well done.

The optimizer state is needed to compute validation loss after restoring

✗ Try again.

PyTorch requires optimizer state to load model state_dict correctly

✗ Try again.

19. What is mixed precision training and how does it speed up deep learning with torch.cuda.amp?

Modern GPUs (Volta and later) have dedicated hardware for 16-bit floating-point operations (FP16 / BFloat16) that can be 2–8× faster than FP32 for matrix multiplications. Mixed precision training runs the forward pass and gradient computations in FP16 (or BF16) for speed, while maintaining a master copy of the weights in FP32 for numerical precision during the optimizer update.

Loss scaling addresses a key challenge: FP16's limited dynamic range (smallest positive ≈ 6×10⁻⁸) can cause small gradient values to underflow to zero. The scaler multiplies the loss by a large scalar before backward (inflating gradients into FP16's representable range), then divides the gradients back before the optimizer step. PyTorch's GradScaler automates this and dynamically adjusts the scale factor.

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler

model     = nn.Linear(1024, 512).cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scaler    = GradScaler()           # manages loss scaling automatically

x = torch.randn(256, 1024).cuda()
y = torch.randn(256, 512).cuda()

for step in range(100):
    optimizer.zero_grad()

    # autocast: runs eligible ops in FP16 automatically
    with autocast(device_type='cuda', dtype=torch.float16):
        y_hat = model(x)           # FP16 matrix multiply
        loss  = nn.MSELoss()(y_hat, y)

    # Scale loss -> backward in FP16 -> unscale gradients -> optimizer step
    scaler.scale(loss).backward()  # inflate loss to prevent underflow
    scaler.unscale_(optimizer)     # restore original gradient magnitudes
    nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # clip after unscale
    scaler.step(optimizer)         # skip step if gradients are inf/NaN
    scaler.update()                # adjust scale factor for next step

# BFloat16 (bfloat16): available on A100+ GPUs
# - Same exponent range as FP32 (no underflow problem -> no scaler needed)
# - Less precision (7-bit mantissa vs 10-bit for FP16)
with autocast(device_type='cuda', dtype=torch.bfloat16):
    y_hat = model(x)  # no scaler needed with BF16

What problem does loss scaling solve in FP16 mixed precision training?Loss scaling increases the model's convergence speed by amplifying updates

✗ Try again.

FP16's small dynamic range can cause small gradient values to underflow to zero; multiplying the loss by a large scalar inflates gradients into the representable FP16 range before backward, then divides them back before the optimizer update

✓ Correct! Well done.

Loss scaling prevents the gradient from exploding in very deep networks

✗ Try again.

It converts the loss to an integer for faster GPU computation

✗ Try again.

Why does BFloat16 not require a GradScaler while FP16 does?BFloat16 is always more numerically precise than FP16

✗ Try again.

BFloat16 has the same 8-bit exponent as FP32, giving it the same dynamic range and immunity to the underflow problem — it sacrifices mantissa precision instead, which is less critical for gradient values

✓ Correct! Well done.

PyTorch's autocast automatically handles BF16 scaling internally

✗ Try again.

BFloat16 is only used on CPUs where underflow is not a concern

✗ Try again.

20. What is the difference between model.eval(), torch.no_grad(), and torch.inference_mode()? When do you use each?

These three mechanisms serve different but complementary purposes that are often confused. Understanding the distinction prevents subtle bugs in training, validation, and inference code.

eval vs no_grad vs inference_mode
Mechanism	What it controls	Effect
model.eval()	Layer behaviour (Dropout, BatchNorm)	Disables Dropout; BatchNorm uses running stats instead of batch stats
model.train()	Layer behaviour (Dropout, BatchNorm)	Enables Dropout; BatchNorm uses current batch stats
torch.no_grad()	Gradient tracking	Stops building the computation graph; saves memory; tensors cannot call .backward()
torch.inference_mode()	Gradient tracking + view tracking	Stricter than no_grad; ~10% faster; returned tensors cannot be used in autograd at all

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(10, 64), nn.ReLU(),
    nn.BatchNorm1d(64), nn.Dropout(0.5),
    nn.Linear(64, 1)
)

# ─── Training ───────────────────────────────────────────────────────
model.train()   # Dropout ACTIVE, BatchNorm uses batch stats
x = torch.randn(32, 10)
out1 = model(x)
out2 = model(x)  # DIFFERENT — Dropout randomly drops each call

# ─── Validation (compute val loss, need backward later? No) ─────────
model.eval()
with torch.no_grad():
    # Dropout OFF, BatchNorm uses running stats, no computation graph
    val_out = model(x)
    val_loss = nn.MSELoss()(val_out, torch.zeros(32, 1))

# ─── Inference / deployment ─────────────────────────────────────────
model.eval()
with torch.inference_mode():  # fastest; cannot go back to autograd
    pred = model(torch.randn(1, 10))

# COMMON BUG: forgetting model.eval() at inference
# model.eval() and torch.no_grad() are INDEPENDENT — you need BOTH:
# - model.eval() alone: still builds graph (memory waste)
# - torch.no_grad() alone: Dropout still active (wrong predictions)

What happens if you call model.eval() but forget torch.no_grad() during validation?Dropout and BatchNorm remain in training mode

✗ Try again.

The model still builds the computation graph (tracking all operations for potential gradients), wasting memory — but layer behaviours like Dropout and BatchNorm are correctly switched to eval mode

✓ Correct! Well done.

The model raises a RuntimeError

✗ Try again.

Gradients flow through the validation batch and corrupt the model weights

✗ Try again.

What does BatchNorm do differently in eval mode vs train mode?In eval mode, BatchNorm learns faster by using a larger batch

✗ Try again.

In train mode, BatchNorm normalises using the current mini-batch's mean and variance; in eval mode, it uses running statistics accumulated as exponential moving averages over all training batches — providing stable, batch-size-independent normalisation at inference

✓ Correct! Well done.

In eval mode, BatchNorm's learnable parameters gamma and beta are set to 1 and 0

✗ Try again.

In eval mode, BatchNorm is bypassed entirely for speed

✗ Try again.

21. How do you use GPUs in PyTorch and what are the key patterns for writing device-agnostic code?

PyTorch's device abstraction allows the same code to run on CPU, single GPU, or multiple GPUs with minimal changes. The fundamental operations are moving tensors to a device with .to(device) or .cuda(), and ensuring model and data tensors always reside on the same device before any computation.

A critical performance concept: CPU–GPU data transfers are expensive (PCIe bandwidth is limited vs. GPU memory bandwidth). Minimise them by loading data onto the GPU once per batch, pre-computing dataset statistics on CPU, and avoiding frequent tensor transfers inside the training loop.

import torch
import torch.nn as nn

# Device-agnostic code pattern
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using: {device}')  # cuda / mps / cpu

# Move model to device
model = nn.Linear(10, 5).to(device)

# Move data to device in the training loop
for X_batch, y_batch in loader:
    X_batch = X_batch.to(device, non_blocking=True)
    y_batch = y_batch.to(device, non_blocking=True)
    y_hat = model(X_batch)
    # ...

# Check which device a tensor is on
t = torch.randn(3)
print(t.device)         # cpu
t_gpu = t.cuda()        # or t.to('cuda:0')
print(t_gpu.device)     # cuda:0

# Apple Silicon
device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')

# Memory diagnostics
print(torch.cuda.memory_allocated() / 1e9, 'GB allocated')
print(torch.cuda.max_memory_allocated() / 1e9, 'GB peak')
torch.cuda.empty_cache()  # release unused cached GPU memory

# Multi-GPU: DistributedDataParallel (DDP) preferred over DataParallel
model_ddp = nn.parallel.DistributedDataParallel(model, device_ids=[0, 1])

Why is non_blocking=True in tensor.to(device) beneficial during training?It prevents the CPU from running out of memory during transfer

✗ Try again.

It allows the CPU-to-GPU transfer to happen asynchronously, overlapping with other CPU work (like preparing the next batch) rather than blocking until the transfer completes

✓ Correct! Well done.

It automatically pins the memory for faster transfer

✗ Try again.

It tells PyTorch to skip the transfer if the tensor is already on the correct device

✗ Try again.

What is the most important rule to avoid runtime errors when doing GPU computations in PyTorch?Always call torch.cuda.synchronize() before each forward pass

✗ Try again.

The model and all input tensors must reside on the same device — operations between tensors on different devices (e.g. CPU and GPU) raise a RuntimeError

✓ Correct! Well done.

GPU tensors must always be created with torch.zeros rather than torch.randn

✗ Try again.

The batch size must be a power of 2 for GPU operations

✗ Try again.

22. What are the differences between Batch Norm, Layer Norm, Group Norm, and Instance Norm?

All normalisation variants compute mean and variance and apply the same transformation (x-μ)/√(σ²+ε) — they differ only in which dimensions the mean and variance are computed over. This seemingly small difference has large practical consequences depending on the architecture and batch size.

Normalisation Comparison
Method	Normalises over	Best for	Key limitation
BatchNorm	Batch + spatial dims per channel	CNNs, large batch MLP	Breaks with batch_size=1; train/eval difference
LayerNorm	All features per sample	Transformers, NLP, RNNs	Slower than BN on large spatial dims
InstanceNorm	Spatial dims per channel per sample	Style transfer, GAN	Loses channel statistics
GroupNorm	Spatial dims per group of channels per sample	Object detection, small batch	Requires choosing n_groups

import torch
import torch.nn as nn

# BatchNorm1d: normalise over batch for FC layers
# Input: (N, C) or (N, C, L)
bn = nn.BatchNorm1d(num_features=128)

# LayerNorm: normalise over feature dim(s) — no dependency on batch
# Input: (*, normalized_shape)  — last dims are normalised
ln = nn.LayerNorm(normalized_shape=128)  # used in transformers
ln_2d = nn.LayerNorm([128, 8, 8])        # can normalise spatial too

# GroupNorm: split channels into groups, normalise per group per sample
# Input: (N, C, *)  — C must be divisible by num_groups
gn = nn.GroupNorm(num_groups=8, num_channels=128)

# InstanceNorm: each sample, each channel independently
inst = nn.InstanceNorm2d(num_features=128)

# Example: why LayerNorm is used in transformers
d_model = 512
x = torch.randn(4, 20, d_model)   # (batch, seq_len, d_model)
# BatchNorm would normalise over batch and seq_len per feature dim —
# unstable at inference when batch=1 (as in autoregressive generation)
# LayerNorm normalises over d_model for each (batch, seq) position independently
print(ln(x).shape)  # (4, 20, 512) — each position normalised independently

Why is LayerNorm preferred over BatchNorm in transformer architectures?LayerNorm is computationally faster than BatchNorm for all input shapes

✗ Try again.

LayerNorm normalises each sample independently over its feature dimensions — it has no dependency on batch size, making it consistent at training time, validation, and autoregressive inference with batch_size=1

✓ Correct! Well done.

BatchNorm cannot handle the variable sequence lengths common in NLP

✗ Try again.

LayerNorm does not require learnable parameters gamma and beta

✗ Try again.

When is GroupNorm particularly useful compared to BatchNorm?GroupNorm is always better than BatchNorm

✗ Try again.

GroupNorm is useful in object detection and segmentation where the GPU memory budget forces very small batch sizes — since GroupNorm normalises within each sample rather than across the batch, its estimates are stable even with batch_size=1 or 2

✓ Correct! Well done.

GroupNorm requires larger batch sizes than BatchNorm to function correctly

✗ Try again.

GroupNorm is only applicable to 1D sequence data

✗ Try again.

23. What is an autoencoder and what can a well-trained latent space be used for?

An autoencoder is a neural network trained to reconstruct its input through a bottleneck. The encoder f: X → Z maps inputs to a lower-dimensional latent space Z, and the decoder g: Z → X̂ reconstructs the input. Training minimises the reconstruction loss (e.g. MSE for continuous inputs, binary cross-entropy for binary) without any labels — it is an unsupervised learning technique.

The bottleneck forces the encoder to learn a compressed, information-dense representation. A well-trained latent space can be used for: (1) dimensionality reduction and visualisation (better than PCA for non-linear data); (2) anomaly detection (normal samples reconstruct well; anomalies have high reconstruction error); (3) de-noising (train with noisy input, clean target — denoising autoencoders); (4) generative modelling (Variational Autoencoders / VAEs impose a probabilistic structure on Z that enables generation).

import torch
import torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, input_dim=784, latent_dim=32):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 256), nn.ReLU(),
            nn.Linear(256, latent_dim)
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256), nn.ReLU(),
            nn.Linear(256, input_dim), nn.Sigmoid()  # pixel values in [0,1]
        )

    def forward(self, x):
        z = self.encoder(x)
        return self.decoder(z)

ae = Autoencoder()
optimizer = torch.optim.Adam(ae.parameters(), lr=1e-3)
criterion = nn.MSELoss()

for X_batch, _ in loader:   # labels not used!
    X_flat = X_batch.view(X_batch.size(0), -1)  # flatten images
    X_hat  = ae(X_flat)
    loss   = criterion(X_hat, X_flat)
    optimizer.zero_grad(); loss.backward(); optimizer.step()

# Anomaly detection at inference:
ae.eval()
with torch.no_grad():
    X_hat = ae(test_samples)
    recon_error = ((test_samples - X_hat) ** 2).mean(dim=1)
# High recon_error => anomalous sample

Why can autoencoders detect anomalies based on reconstruction error?Autoencoders are trained to classify samples as normal or anomalous

✗ Try again.

The encoder-decoder is trained to minimise reconstruction error on normal samples — anomalous inputs differ from the training distribution so the decoder cannot reconstruct them accurately, leading to higher reconstruction error

✓ Correct! Well done.

Anomalies always produce latent vectors with very large magnitude

✗ Try again.

The bottleneck layer compresses anomalies to exactly zero

✗ Try again.

What is the difference between a standard autoencoder and a Variational Autoencoder (VAE)?VAEs use a larger bottleneck dimension than standard autoencoders

✗ Try again.

A standard autoencoder maps each input to a single point in latent space; a VAE maps each input to a distribution (mean + variance) and samples from it, imposing a prior (usually Gaussian) on the latent space — enabling generation of new samples by sampling from the prior

✓ Correct! Well done.

VAEs use a different reconstruction loss than standard autoencoders

✗ Try again.

VAEs require labelled data while standard autoencoders do not

✗ Try again.

24. How do you diagnose a neural network that is not training correctly from its loss curves?

Reading loss curves is one of the most important practical skills in deep learning. The shape of the training and validation loss over time reveals the failure mode and guides the fix.

Common Training Failure Modes
Loss curve shape	Diagnosis	Likely fix
Loss is NaN from the start	Exploding gradients or bad init	Gradient clipping, lower lr, check data for inf/NaN
Loss doesn't decrease at all	Vanishing gradient, lr too low, dead neurons	Check activations, raise lr, use He init + ReLU
Loss decreases then plateaus early	Learning rate too high or model too small	Reduce lr / lr schedule, increase capacity
Train loss low, val loss high (large gap)	Overfitting	More regularisation: dropout, weight decay, augmentation, early stopping
Both losses plateau at high value	Underfitting (high bias)	Increase model capacity, train longer, reduce regularisation
Loss oscillates wildly	Learning rate too high	Reduce lr, use lr schedule, check batch size

import torch
import torch.nn as nn

# Checking for gradient issues
model = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for step, (X, y) in enumerate(loader):
    optimizer.zero_grad()
    loss = criterion(model(X), y)
    loss.backward()

    # Check for NaN/Inf in loss
    if not torch.isfinite(loss):
        print(f'Step {step}: non-finite loss = {loss.item()}')
        break

    # Monitor gradient norms
    total_norm = 0
    for p in model.parameters():
        if p.grad is not None:
            total_norm += p.grad.data.norm(2).item() ** 2
    total_norm = total_norm ** 0.5
    if step % 100 == 0:
        print(f'Step {step}: loss={loss.item():.4f} grad_norm={total_norm:.4f}')

    optimizer.step()

# Check dead ReLU neurons
def count_dead_neurons(model, X):
    activations = []
    def hook(m, inp, out):
        activations.append((out <= 0).float().mean().item())
    handles = [l.register_forward_hook(hook)
               for l in model.modules() if isinstance(l, nn.ReLU)]
    with torch.no_grad(): model(X)
    for h in handles: h.remove()
    return activations  # fraction of dead neurons per layer

If training loss decreases steadily but validation loss diverges upward, what is the most likely diagnosis?The learning rate is too low

✗ Try again.

Overfitting — the model is memorising the training data rather than learning generalisable patterns, as evidenced by the growing gap between train and validation performance

✓ Correct! Well done.

The model architecture has too few parameters

✗ Try again.

Batch normalisation is not correctly switched to eval mode

✗ Try again.

What does a training loss that never decreases (stays near its initial value from epoch 1) typically indicate?The model is overfitting immediately

✗ Try again.

Either vanishing gradients are preventing updates from reaching early layers, the learning rate is too small, dead neurons prevent signals from flowing, or a bug in the training loop (e.g. forgetting optimizer.step())

✓ Correct! Well done.

The dataset is too small

✗ Try again.

The number of epochs is insufficient

✗ Try again.

25. What is the mathematical setup of a Generative Adversarial Network (GAN) and what training challenges do they have?

A GAN consists of two competing networks: a generator G that maps random noise z ~ p(z) to fake data samples, and a discriminator D that classifies inputs as real or fake. They play a minimax game with objective: min_G max_D E[log D(x)] + E[log(1 - D(G(z)))]. At the Nash equilibrium, G produces samples from the true data distribution and D outputs 0.5 for every input (cannot distinguish real from fake).

In practice, GANs suffer from several well-known training challenges: mode collapse (G learns to produce only a subset of modes of the data distribution); training instability (the minimax game does not converge reliably); and vanishing generator gradient (when D becomes too good early on, it correctly classifies fake samples with near-certainty, giving G near-zero gradient signal). These led to many GAN variants — DCGAN (convolutional architecture), WGAN (Wasserstein distance instead of JS divergence), and progressive growing GANs.

import torch
import torch.nn as nn

latent_dim, img_dim = 100, 784

# Generator: noise -> fake image
generator = nn.Sequential(
    nn.Linear(latent_dim, 256), nn.ReLU(),
    nn.Linear(256, 512), nn.ReLU(),
    nn.Linear(512, img_dim), nn.Tanh()
)

# Discriminator: image -> real (1) or fake (0)
discriminator = nn.Sequential(
    nn.Linear(img_dim, 512), nn.LeakyReLU(0.2),
    nn.Linear(512, 256), nn.LeakyReLU(0.2),
    nn.Linear(256, 1)  # raw logit; use BCEWithLogitsLoss
)

criterion = nn.BCEWithLogitsLoss()
opt_G = torch.optim.Adam(generator.parameters(),     lr=2e-4, betas=(0.5, 0.999))
opt_D = torch.optim.Adam(discriminator.parameters(), lr=2e-4, betas=(0.5, 0.999))

for real_imgs, _ in loader:
    real_imgs = real_imgs.view(-1, img_dim)
    bs = real_imgs.size(0)

    # Train Discriminator
    z = torch.randn(bs, latent_dim)
    fake_imgs = generator(z).detach()  # detach: don't update G here
    loss_D = (criterion(discriminator(real_imgs), torch.ones(bs, 1))
            + criterion(discriminator(fake_imgs), torch.zeros(bs, 1)))
    opt_D.zero_grad(); loss_D.backward(); opt_D.step()

    # Train Generator
    z = torch.randn(bs, latent_dim)
    loss_G = criterion(discriminator(generator(z)), torch.ones(bs, 1))
    opt_G.zero_grad(); loss_G.backward(); opt_G.step()

What is mode collapse in GAN training?The discriminator loss collapses to zero, preventing the generator from learning

✗ Try again.

The generator learns to produce only a small subset of the modes (variety) of the data distribution — for example, generating only one type of digit when trained on MNIST — because producing that single mode fooled the discriminator

✓ Correct! Well done.

Both the generator and discriminator converge to producing the same output

✗ Try again.

The generator's weights collapse to near-zero values due to vanishing gradients

✗ Try again.

Why does detach() need to be called on generator output when training the discriminator?detach() prevents the generator from using real images

✗ Try again.

Without detach(), the backward pass for the discriminator loss would unnecessarily compute gradients through the generator's parameters — detach() stops gradient flow at the boundary between the two networks, limiting the discriminator update to its own parameters only

✓ Correct! Well done.

detach() converts the generator output to a leaf tensor for memory efficiency

✗ Try again.

PyTorch requires detach() before passing tensors between different nn.Module objects

✗ Try again.

26. What is torch.compile and how does it speed up PyTorch model execution?

Introduced in PyTorch 2.0, torch.compile applies ahead-of-time compilation to a PyTorch model or function. Rather than executing each operation eagerly (PyTorch's default), it captures the computation as a graph, optimises it (fusing operations, eliminating redundant memory reads/writes), and compiles it to efficient machine code using a backend (TorchInductor by default, which generates CUDA/C++ kernels).

The primary benefit is kernel fusion: instead of launching a separate GPU kernel for each operation (e.g. separate kernels for matrix multiply, add bias, and ReLU), the compiler fuses them into a single kernel that reads and writes GPU memory once. GPU memory bandwidth is often the bottleneck for transformer-style models, so reducing memory round-trips directly translates to throughput gains — typically 10–50% speedup for training and inference on modern hardware.

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(1024, 1024), nn.GELU(),
    nn.Linear(1024, 512), nn.GELU(),
    nn.Linear(512, 10)
)

# Compile the model — first call triggers compilation (may take 30s+)
compiled_model = torch.compile(model)

# Usage is identical to a regular model
x = torch.randn(256, 1024).cuda()
compiled_model = compiled_model.cuda()
out = compiled_model(x)   # warm-up: triggers compilation
out = compiled_model(x)   # subsequent calls use compiled kernels

# Compilation modes (trade-off speed of compilation vs runtime)
model_default = torch.compile(model)                       # best overall
model_reduce   = torch.compile(model, mode='reduce-overhead')  # fewer overheads
model_max      = torch.compile(model, mode='max-autotune') # slowest to compile, fastest to run

# Measure speedup
import time
x = torch.randn(512, 1024, device='cuda')
for _ in range(5): model(x)   # warm-up
t0 = time.time()
for _ in range(100): model(x)
torch.cuda.synchronize()
print('Eager:', time.time() - t0)

for _ in range(5): compiled_model(x)
t0 = time.time()
for _ in range(100): compiled_model(x)
torch.cuda.synchronize()
print('Compiled:', time.time() - t0)

What is the primary technique torch.compile uses to accelerate model execution?It automatically converts FP32 operations to FP16 for speed

✗ Try again.

Kernel fusion — fusing multiple sequential operations (e.g. matmul + bias + activation) into a single GPU kernel that reads and writes GPU memory fewer times, reducing the memory bandwidth bottleneck

✓ Correct! Well done.

It distributes the model across multiple GPUs automatically

✗ Try again.

It converts the Python model to C++ code that runs on the CPU

✗ Try again.

Why does the first call to a torch.compile'd model take much longer than subsequent calls?PyTorch redownloads the model weights from the internet on first use

✗ Try again.

The first call triggers the actual compilation — graph capture, optimisation, and kernel generation — which is a one-time overhead; subsequent calls execute the pre-compiled, optimised kernels directly

✓ Correct! Well done.

The first call initialises all GPU memory allocations

✗ Try again.

torch.compile runs a validation pass on the first call to check for correctness

✗ Try again.

27. Why do Transformers need positional encodings and how does sinusoidal encoding work?

Self-attention is permutation equivariant — swapping two positions in the input produces the same output with those two positions swapped, because attention treats all positions symmetrically. Without positional information, a transformer cannot distinguish 'The dog bit the man' from 'The man bit the dog'. Positional encodings inject sequence order information into the token embeddings before they enter the transformer.

The original 'Attention is All You Need' paper uses sinusoidal encodings: PE(pos, 2i) = sin(pos / 10000^{2i/d}) and PE(pos, 2i+1) = cos(pos / 10000^{2i/d}), where pos is the position and i is the dimension index. Each dimension oscillates at a different frequency, giving a unique fingerprint to every position. The key properties: (1) each position has a unique encoding; (2) the encoding for position pos+k is a linear function of position pos, allowing the model to reason about relative distances; (3) it generalises to sequence lengths unseen during training.

import torch
import math

def sinusoidal_positional_encoding(max_len, d_model):
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len).unsqueeze(1).float()
    div_term = torch.exp(
        torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)
    )
    pe[:, 0::2] = torch.sin(position * div_term)  # even dims: sin
    pe[:, 1::2] = torch.cos(position * div_term)  # odd dims: cos
    return pe  # (max_len, d_model)

import torch.nn as nn

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=512, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        pe = sinusoidal_positional_encoding(max_len, d_model)
        self.register_buffer('pe', pe)  # not a parameter; saved with model

    def forward(self, x):   # x: (batch, seq_len, d_model)
        x = x + self.pe[:x.size(1)]  # add pos encoding to each embedding
        return self.dropout(x)

# Modern alternative: Rotary Position Embeddings (RoPE)
# Used in LLaMA, Mistral — encodes relative rather than absolute position
# Applied directly to Q and K matrices before attention computation

Why do Transformers need positional encodings when RNNs do not?Transformers process tokens in parallel and thus lose sequence order information; RNNs process tokens sequentially, so order is inherently encoded in the hidden state updates

✓ Correct! Well done.

RNNs are smaller models that don't need positional information

✗ Try again.

Positional encodings are only needed for text tasks, which RNNs cannot handle

✗ Try again.

Transformers process each batch independently; RNNs share state between batches

✗ Try again.

What is the key advantage of using different sinusoidal frequencies across embedding dimensions for positional encoding?Higher frequencies are more accurate for long sequences

✗ Try again.

Each frequency oscillates with a different period, giving a unique fingerprint to each position — low frequencies capture coarse position (early vs late in sequence), high frequencies capture fine position differences between adjacent tokens

✓ Correct! Well done.

Different frequencies allow the model to process multiple positions simultaneously

✗ Try again.

It prevents the positional embeddings from interfering with the token embeddings

✗ Try again.

28. What are the most impactful hyperparameters to tune in deep learning and what is the recommended search order?

Deep learning has many hyperparameters, but they are not equally important. Empirical research and practitioner experience has established a rough hierarchy of impact. Tuning in the wrong order wastes compute — finding the optimal dropout rate is pointless if the learning rate is still wildly off.

Hyperparameter Importance Hierarchy
Priority	Hyperparameter	Typical search range
1 (highest)	Learning rate	Log-uniform: 1e-5 to 1e-1
1	Batch size	32, 64, 128, 256, 512
2	Model architecture (depth, width)	Task-specific; start from established baselines
2	Optimizer (Adam vs SGD + momentum)	Usually Adam/AdamW first
3	Weight decay / L2 penalty	Log-uniform: 1e-5 to 1e-1
3	LR schedule and warmup	Cosine with 5-10% warmup steps
4 (lower)	Dropout rate	0.0, 0.1, 0.2, 0.5
4	Batch norm epsilon / momentum	Rarely tuned; defaults usually fine

import optuna
import torch
import torch.nn as nn

def objective(trial):
    # Optuna suggests hyperparameters — log-uniform search for lr
    lr         = trial.suggest_float('lr', 1e-5, 1e-1, log=True)
    wd         = trial.suggest_float('weight_decay', 1e-5, 1e-1, log=True)
    n_layers   = trial.suggest_int('n_layers', 2, 6)
    hidden_dim = trial.suggest_categorical('hidden_dim', [128, 256, 512])
    dropout    = trial.suggest_float('dropout', 0.0, 0.5)

    layers = []
    in_dim = 784
    for _ in range(n_layers):
        layers += [nn.Linear(in_dim, hidden_dim), nn.ReLU(),
                   nn.Dropout(dropout)]
        in_dim = hidden_dim
    model = nn.Sequential(*layers, nn.Linear(in_dim, 10))

    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=wd)
    val_acc = train_and_evaluate(model, optimizer, n_epochs=10)
    return val_acc

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print('Best params:', study.best_params)

Why should the learning rate be tuned before other hyperparameters like dropout?Learning rate affects only the first epoch of training

✗ Try again.

Learning rate has the largest single impact on training dynamics — an incorrect lr can prevent any useful learning regardless of other settings; tuning secondary hyperparameters with a bad lr produces misleading results

✓ Correct! Well done.

Optuna always optimises the learning rate first by convention

✗ Try again.

Dropout and learning rate interact multiplicatively, so neither matters independently

✗ Try again.

Why is log-uniform sampling preferred over uniform sampling when searching for learning rates?Log-uniform sampling is faster to compute

✗ Try again.

Good learning rates span several orders of magnitude (1e-5 to 1e-1) — uniform sampling would concentrate ~90% of trials in the range (9e-2, 1e-1), almost never exploring smaller values; log-uniform gives equal probability to each decade

✓ Correct! Well done.

Uniform sampling does not converge for continuous hyperparameters

✗ Try again.

Log-uniform is required by PyTorch's optimizer implementations

✗ Try again.

29. What is an encoder-decoder architecture and how is it used for sequence-to-sequence tasks?

Encoder-decoder (seq2seq) architectures handle tasks where the input and output are sequences of potentially different lengths — machine translation, summarisation, speech recognition, image captioning. The encoder processes the full input sequence and produces a context representation; the decoder generates the output sequence token by token, conditioning each prediction on the context and all previously generated tokens.

In transformer-based seq2seq, the encoder uses bidirectional self-attention (each position attends to all input positions), while the decoder uses two attention mechanisms: masked self-attention (each output position can only attend to previous output positions, preserving the autoregressive property) and cross-attention (each decoder position attends to all encoder output positions to draw relevant information from the input).

import torch
import torch.nn as nn

# PyTorch's built-in Transformer (encoder-decoder)
transformer = nn.Transformer(
    d_model=512,
    nhead=8,
    num_encoder_layers=6,
    num_decoder_layers=6,
    dim_feedforward=2048,
    dropout=0.1,
    batch_first=True
)

# Source and target sequences
src = torch.randn(4, 20, 512)   # (batch, src_len, d_model)
tgt = torch.randn(4, 15, 512)   # (batch, tgt_len, d_model)

# Causal mask: prevent decoder from attending to future target tokens
tgt_len = tgt.size(1)
tgt_mask = nn.Transformer.generate_square_subsequent_mask(tgt_len)

out = transformer(src, tgt, tgt_mask=tgt_mask)
print(out.shape)  # (4, 15, 512)

# Teacher forcing: at training time, feed ground-truth previous tokens
# to the decoder (not its own previous predictions)
# At inference: autoregressive — use model's own previous output:
def greedy_decode(model, src, max_len, sos_idx, eos_idx):
    memory = model.encoder(src)
    ys = torch.tensor([[sos_idx]])
    for _ in range(max_len):
        mask = nn.Transformer.generate_square_subsequent_mask(ys.size(1))
        out  = model.decoder(ys.float(), memory, tgt_mask=mask)
        next_token = out[:, -1].argmax()
        ys = torch.cat([ys, next_token.unsqueeze(0).unsqueeze(0)], dim=1)
        if next_token.item() == eos_idx: break
    return ys

What is the purpose of the causal (subsequent) mask in the transformer decoder?It prevents the decoder from attending to padding tokens

✗ Try again.

It prevents each decoder position from attending to future target positions during training — preserving the autoregressive property so the model can't 'cheat' by looking at the answer tokens it hasn't generated yet

✓ Correct! Well done.

It prevents the encoder and decoder from attending to each other

✗ Try again.

It masks padding positions in the source sequence from the cross-attention

✗ Try again.

What is teacher forcing in seq2seq training?Providing additional labeled data to the decoder during warmup

✗ Try again.

At each decoder step, feeding the ground-truth previous token (not the model's previous prediction) as the decoder input — this speeds up training by preventing error accumulation, but requires exposure scheduling to bridge the gap with inference behavior

✓ Correct! Well done.

Using a pretrained teacher model to generate soft labels for the student

✗ Try again.

Forcing the encoder to attend to only the most relevant source tokens

✗ Try again.

30. What is model quantization in deep learning and how does PyTorch support it?

Quantization reduces model size and inference latency by representing weights and activations in lower-precision integer formats (INT8, INT4, INT2) rather than FP32 or FP16. A 32-bit float weight is replaced by an 8-bit integer plus a scale factor and zero-point: x_float = scale × (x_int - zero_point). This yields 4× memory reduction for INT8, enabling larger models to fit on limited hardware and significantly faster integer arithmetic on CPUs and mobile accelerators.

Three main approaches: (1) Post-Training Quantization (PTQ) — quantize a trained FP32 model without retraining, using a small calibration dataset to determine optimal scale factors; (2) Quantization-Aware Training (QAT) — simulate quantization noise during training (fake quantization), allowing the model to adapt and typically recovering the accuracy lost by PTQ; (3) Dynamic quantization — weights are quantized ahead of time, activations quantized dynamically at inference (simplest, good baseline for RNNs).

import torch
import torch.nn as nn
from torch.quantization import quantize_dynamic, prepare, convert

# ─── Dynamic Quantization (simplest — weights INT8, activations FP32) ───
model_fp32 = nn.LSTM(input_size=64, hidden_size=128)
model_int8 = quantize_dynamic(
    model_fp32,
    qconfig_spec={nn.Linear, nn.LSTM},
    dtype=torch.qint8
)
print('FP32 size:', sum(p.numel() * 4 for p in model_fp32.parameters()), 'bytes')
# INT8 model is ~4x smaller

# ─── Post-Training Static Quantization ───
model = nn.Sequential(nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 10))
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
model_prepared = prepare(model)  # insert observer modules

# Calibrate with representative data
model_prepared.eval()
with torch.no_grad():
    for X_cal, _ in calibration_loader:
        model_prepared(X_cal)

model_int8 = convert(model_prepared)  # convert to INT8

# ─── Modern approach: bitsandbytes / llm.int8() for LLMs ───
# 8-bit quantization of LLM weights with minimal accuracy loss
# Allows running 7B+ parameter models on consumer GPUs
# from transformers import AutoModelForCausalLM
# model = AutoModelForCausalLM.from_pretrained('gpt2', load_in_8bit=True)

What are the two components needed to convert a quantized INT8 value back to approximate FP32?A lookup table and an offset

✗ Try again.

A scale factor and a zero-point — the formula x_float ≈ scale × (x_int - zero_point) reconstructs the original value

✓ Correct! Well done.

A bias term and the original weight matrix

✗ Try again.

The layer index and the training step number

✗ Try again.

What is the key advantage of Quantization-Aware Training (QAT) over Post-Training Quantization (PTQ)?QAT is faster to run than PTQ because no calibration data is needed

✗ Try again.

QAT simulates quantization noise during training, allowing the model to adapt its weights to work well under the integer representation — typically recovering accuracy lost by PTQ, which applies quantization to a model trained entirely in full precision

✓ Correct! Well done.

QAT always produces INT4 models while PTQ only produces INT8

✗ Try again.

QAT can be applied without changing the original training code

✗ Try again.

31. What does a production-quality PyTorch training loop look like, incorporating all best practices?

A well-structured training loop separates concerns cleanly: data loading, forward pass, loss computation, backpropagation, gradient management, metric tracking, and model persistence. Each step has specific pitfalls that silently degrade results.

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler
from torch.utils.data import DataLoader

def train_epoch(model, loader, optimizer, criterion, device, scaler):
    model.train()
    total_loss, n_correct, n_total = 0.0, 0, 0

    for X, y in loader:
        X, y = X.to(device, non_blocking=True), y.to(device, non_blocking=True)

        optimizer.zero_grad(set_to_none=True)  # faster than zero_grad()

        with autocast(device_type='cuda', dtype=torch.float16):
            logits = model(X)
            loss   = criterion(logits, y)

        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        scaler.step(optimizer)
        scaler.update()

        total_loss += loss.item() * X.size(0)
        n_correct  += (logits.argmax(1) == y).sum().item()
        n_total    += X.size(0)

    return total_loss / n_total, n_correct / n_total

@torch.no_grad()
def eval_epoch(model, loader, criterion, device):
    model.eval()
    total_loss, n_correct, n_total = 0.0, 0, 0
    for X, y in loader:
        X, y = X.to(device, non_blocking=True), y.to(device, non_blocking=True)
        logits = model(X)
        loss   = criterion(logits, y)
        total_loss += loss.item() * X.size(0)
        n_correct  += (logits.argmax(1) == y).sum().item()
        n_total    += X.size(0)
    return total_loss / n_total, n_correct / n_total

# Main training loop
best_val_acc = 0
for epoch in range(n_epochs):
    tr_loss, tr_acc = train_epoch(model, train_loader, optimizer,
                                   criterion, device, scaler)
    vl_loss, vl_acc = eval_epoch(model, val_loader, criterion, device)
    scheduler.step()
    if vl_acc > best_val_acc:
        best_val_acc = vl_acc
        torch.save(model.state_dict(), 'best.pt')
    print(f'Epoch {epoch:3d}: tr={tr_loss:.4f}/{tr_acc:.3f}  '
          f'val={vl_loss:.4f}/{vl_acc:.3f}')

Why is optimizer.zero_grad(set_to_none=True) preferred over optimizer.zero_grad()?set_to_none=True skips gradient computation for frozen layers

✗ Try again.

Instead of setting gradients to zero tensors, it sets them to None — this frees the memory occupied by gradient tensors between steps (slightly less memory) and makes gradient accumulation code more explicit

✓ Correct! Well done.

It makes the backward pass run in parallel across all layers

✗ Try again.

It is required when using mixed precision training with GradScaler

✗ Try again.

Why should loss.item() be called to accumulate running loss rather than loss directly?loss.item() returns an integer which is easier to sum

✗ Try again.

Calling loss.item() detaches the loss scalar from the computation graph — accumulating loss tensors directly would keep the entire computation graph alive in memory for every batch

✓ Correct! Well done.

loss.item() is required before calling scaler.scale()

✗ Try again.

Summing loss tensors would change the gradient computation for the current batch

✗ Try again.

32. How does batch size affect deep learning training mathematically and practically?

Batch size controls the trade-off between gradient estimate quality and training speed. With batch size B, the gradient is estimated as the average loss gradient over B samples — the variance of this estimate is proportional to σ²/B, where σ² is the per-sample gradient variance. Larger batches give lower-variance (more accurate) gradient estimates, but with diminishing returns: the benefit of doubling the batch size has halved variance but the compute cost also doubles.

Generalisation effect: empirically, large batches often lead to sharper minima that generalise worse than the flatter minima found by small batches. The noise in small-batch SGD acts as implicit regularisation — the stochastic gradient trajectory tends to find broader minima, which are more robust to small perturbations. This is the 'large batch training problem'. Mitigations: linear scaling rule (scale lr proportionally with batch size), warmup, and gradient accumulation (simulate large batches while maintaining small-batch noise).

import torch
import torch.nn as nn

model     = nn.Linear(10, 1)
criterion = nn.MSELoss()

# Gradient accumulation: simulate batch_size=1024 with micro_batch=32
accumulation_steps = 32   # effective_batch_size = 32 * 32 = 1024
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

optimizer.zero_grad()
for step, (X, y) in enumerate(loader):
    # Forward and backward every micro-batch
    loss = criterion(model(X), y) / accumulation_steps  # scale by 1/K
    loss.backward()  # gradients accumulate, not cleared

    if (step + 1) % accumulation_steps == 0:
        # Clip and step only after accumulating K micro-batches
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)

# Linear scaling rule: if you double batch size, double the lr
base_lr    = 1e-3
base_batch = 256
new_batch  = 1024
new_lr     = base_lr * (new_batch / base_batch)  # 4e-3
# But use warmup to stabilise the larger lr at the start

Why do large batch sizes sometimes lead to worse generalization than small batch sizes?Large batches cause overfitting because more data is processed per update

✗ Try again.

Large batches give lower-variance gradients that consistently point toward sharp minima; small-batch SGD's gradient noise tends to push the optimizer toward broader, flatter minima that generalise better to unseen data

✓ Correct! Well done.

Large batches make it impossible to use learning rate schedules

✗ Try again.

Large batches use more GPU memory, leaving less for the model's parameters

✗ Try again.

What does gradient accumulation achieve, and when is it useful?It averages gradients across model layers for more stable updates

✗ Try again.

It accumulates gradients over multiple small micro-batches before performing one optimizer step, effectively simulating a large batch size on hardware that cannot fit the large batch in GPU memory at once

✓ Correct! Well done.

It prevents gradient explosion by clipping after each forward pass

✗ Try again.

It is used to implement momentum without additional memory overhead

✗ Try again.

33. How do you choose the right layer type (Linear, Conv, Attention) for a given input modality?

Each layer type encodes different structural assumptions (inductive biases) about the data. Using a layer whose assumptions match the data's structure allows the model to learn faster and with less data than a generic alternative.

Layer Selection by Modality and Structure
Data type	Structure	Recommended layer	Reason
Tabular	No spatial/sequential structure	Linear (MLP)	Features are independent; no shared structure to exploit
Images	2D spatial locality + translation equivariance	Conv2d	Same pattern anywhere in image; fewer params than FC
Text/sequences	Long-range dependencies, variable length	Transformer (self-attention)	O(1) path length between any two positions
Short sequences / time series	Local temporal patterns	Conv1d or LSTM	Local: Conv1d; long-range: LSTM
Graphs	Irregular node connectivity	Graph Conv (GCN/GAT)	Aggregates neighbor information per node
Point clouds	Permutation invariant 3D	PointNet / sparse conv	Must handle unordered sets

import torch
import torch.nn as nn

# Tabular data: simple MLP
mlp = nn.Sequential(
    nn.Linear(30, 128), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(128, 64), nn.ReLU(),
    nn.Linear(64, 1)
)

# Image: CNN
cnn = nn.Sequential(
    nn.Conv2d(3, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(),
    nn.AdaptiveAvgPool2d(1),   # global average pooling -> (B, 64, 1, 1)
    nn.Flatten(),
    nn.Linear(64, 10)
)

# Text: embedding + transformer encoder
encoder_layer = nn.TransformerEncoderLayer(
    d_model=256, nhead=4, dim_feedforward=512,
    dropout=0.1, batch_first=True
)
text_model = nn.Sequential(
    nn.Embedding(10000, 256),
    nn.TransformerEncoder(encoder_layer, num_layers=4)
)

# Time series: Conv1d (local patterns) or LSTM (sequential patterns)
ts_cnn = nn.Conv1d(in_channels=1, out_channels=32, kernel_size=5, padding=2)
ts_rnn = nn.LSTM(input_size=1, hidden_size=64, batch_first=True)

Why is Conv2d more parameter-efficient than a fully-connected layer for image data?Conv2d layers are always faster due to CUDA optimization

✗ Try again.

Conv2d shares the same filter weights across all spatial positions — a 3x3 filter has only 3x3xCxC_out parameters regardless of image size, while a fully-connected layer needs separate weights per input-output pixel pair, scaling as H×W×H'×W'×C×C_out

✓ Correct! Well done.

Conv2d can only process images of fixed resolution

✗ Try again.

Fully-connected layers cannot be applied to 2D data

✗ Try again.

When would you choose LSTM over a Transformer for sequence modelling?LSTM always outperforms Transformers for sequences

✗ Try again.

For very long sequences with limited compute or memory, or when processing streaming data token-by-token in production (LSTMs have O(1) inference cost per new token vs O(seq_len) for transformers); also when the training dataset is too small to learn the complex attention patterns

✓ Correct! Well done.

LSTM is required for sequences with more than 512 tokens

✗ Try again.

LSTM is used when the sequence contains numerical values; Transformer for text only

✗ Try again.

34. What evaluation metrics are most commonly used in deep learning tasks and how do you implement them in PyTorch?

The choice of evaluation metric should match the task's real-world objective, not just be the easiest to compute. The training loss and the evaluation metric are often different — models are trained with cross-entropy but evaluated with accuracy, F1, mAP, or BLEU depending on the application.

Metrics by Task
Task	Primary metric	When it falls short
Classification (balanced)	Accuracy	Misleading on imbalanced classes
Classification (imbalanced)	F1 / AUC-ROC / PR-AUC	PR-AUC better than ROC-AUC for severe imbalance
Object detection	mAP (mean Average Precision)	Doesn't account for localisation precision at all scales
Regression	MAE / RMSE / R²	RMSE sensitive to outliers; R² can be negative
Machine translation	BLEU score	Doesn't capture semantic similarity
Language generation	Perplexity / ROUGE / BERTScore	Perplexity doesn't measure fluency
Segmentation	Intersection over Union (IoU / mIoU)	Sensitive to class imbalance

import torch
from torchmetrics import Accuracy, F1Score, AUROC, MeanSquaredError

# torchmetrics: handles accumulation across batches correctly
n_classes = 5
acc  = Accuracy(task='multiclass', num_classes=n_classes)
f1   = F1Score(task='multiclass', num_classes=n_classes, average='macro')
auroc = AUROC(task='multiclass', num_classes=n_classes)

model.eval()
with torch.no_grad():
    for X, y in val_loader:
        logits = model(X)
        preds  = logits.argmax(dim=1)
        probs  = torch.softmax(logits, dim=1)
        acc.update(preds, y)
        f1.update(preds, y)
        auroc.update(probs, y)

print(f'Val Acc:  {acc.compute():.4f}')
print(f'Val F1:   {f1.compute():.4f}')
print(f'Val AUROC:{auroc.compute():.4f}')

# Manual accuracy (without torchmetrics)
all_preds, all_labels = [], []
with torch.no_grad():
    for X, y in val_loader:
        preds = model(X).argmax(1)
        all_preds.append(preds.cpu())
        all_labels.append(y.cpu())
preds  = torch.cat(all_preds)
labels = torch.cat(all_labels)
accuracy = (preds == labels).float().mean()
print(f'Accuracy: {accuracy:.4f}')

Why is accuracy a misleading metric for highly imbalanced classification datasets?Accuracy cannot be computed when classes have different sizes

✗ Try again.

A model predicting only the majority class achieves high accuracy (e.g. 99% if the minority class is 1%) while completely failing to detect the minority class — the metric doesn't reveal this failure

✓ Correct! Well done.

Accuracy does not account for the model's predicted probabilities

✗ Try again.

Accuracy is only valid for binary classification problems

✗ Try again.

What does the torchmetrics library's .update() and .compute() pattern accomplish?update() trains the metric model; compute() predicts with it

✗ Try again.

update() accumulates predictions and labels across batches without computing the metric; compute() combines all accumulated values to produce the final metric — this is more memory-efficient and numerically correct than computing per-batch metrics and averaging them

✓ Correct! Well done.

update() validates predictions; compute() returns a confidence interval

✗ Try again.

The two methods are interchangeable — update() and compute() perform the same operation

✗ Try again.

35. How do you export a PyTorch model for production deployment using TorchScript or ONNX?

Research-time PyTorch models depend on Python's interpreter and PyTorch's eager execution mode — both are too slow and have too many dependencies for production deployment. Two standard serialisation formats allow deploying PyTorch models without Python: TorchScript (PyTorch-native, supports dynamic shapes better) and ONNX (framework-agnostic, runs on TensorRT, OpenVINO, CoreML, ONNX Runtime across many hardware targets).

TorchScript compiles a PyTorch model into an intermediate representation (IR) that can run in C++ via LibTorch, without any Python dependency. It is created either via torch.jit.trace (records operations from a concrete example — doesn't handle data-dependent control flow) or torch.jit.script (analyzes Python source — handles control flow but requires type annotations and a subset of Python). ONNX export traces the model similarly and serialises it to the ONNX protobuf format, which can then be run on any ONNX-compatible runtime.

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(10, 64), nn.ReLU(), nn.Linear(64, 5)
).eval()

example_input = torch.randn(1, 10)

# ─── TorchScript: tracing ───────────────────────────────────────────
traced = torch.jit.trace(model, example_input)
traced.save('model_traced.pt')
# Load and run without original Python class:
loaded = torch.jit.load('model_traced.pt')
output = loaded(torch.randn(4, 10))

# ─── TorchScript: scripting (handles if/for in forward) ────────────
@torch.jit.script
def activation(x: torch.Tensor) -> torch.Tensor:
    if x.sum() > 0:
        return torch.relu(x)
    return torch.tanh(x)

# ─── ONNX export ────────────────────────────────────────────────────
torch.onnx.export(
    model,
    example_input,
    'model.onnx',
    input_names=['features'],
    output_names=['logits'],
    dynamic_axes={'features': {0: 'batch_size'},  # variable batch
                  'logits':   {0: 'batch_size'}},
    opset_version=17,
)

# Validate exported ONNX model
import onnx, onnxruntime as ort
onnx.checker.check_model('model.onnx')
sess = ort.InferenceSession('model.onnx')
result = sess.run(None, {'features': example_input.numpy()})

What is the key difference between torch.jit.trace and torch.jit.script?trace is for CPU models; script is for GPU models

✗ Try again.

trace records the actual operations executed for a specific example input — it cannot capture data-dependent control flow (if/else based on tensor values); script analyzes the Python source code and can handle control flow, but requires type annotations and restricted Python syntax

✓ Correct! Well done.

script produces a larger model file than trace

✗ Try again.

trace only works for nn.Sequential models; script works for all nn.Module subclasses

✗ Try again.

What advantage does ONNX export provide over TorchScript for production deployment?ONNX models are always smaller than TorchScript models

✗ Try again.

ONNX is a framework-agnostic format that can be executed by many runtimes (TensorRT, OpenVINO, CoreML, ONNX Runtime) across diverse hardware — CPUs, GPUs, mobile, edge devices — without requiring PyTorch or LibTorch to be installed

✓ Correct! Well done.

ONNX models support dynamic computation graphs that TorchScript cannot

✗ Try again.

ONNX automatically optimises the model for the target hardware at export time

✗ Try again.

36. What is knowledge distillation and how does it compress large neural networks into smaller ones?

Knowledge distillation (Hinton et al., 2015) trains a small student network to mimic the output distribution of a large, accurate teacher network. Instead of training only on hard labels (the correct class as a one-hot vector), the student is also trained to match the teacher's soft probabilities — the full output distribution including small probabilities assigned to incorrect classes.

The soft probabilities carry richer information than hard labels: if the teacher assigns 0.7 to 'cat' and 0.25 to 'dog', this communicates that the image looks somewhat cat-like but also dog-like — a nuanced signal the student can learn from. A temperature parameter T sharpens or softens this distribution: p_i = exp(z_i/T) / Σ exp(z_j/T). Higher T produces a softer, more uniform distribution that exposes the teacher's confidence relationships across all classes, giving the student a richer gradient signal. The distillation loss combines the cross-entropy with hard labels and the KL divergence with the teacher's soft targets.

import torch
import torch.nn as nn
import torch.nn.functional as F

teacher = BigModel().eval()      # pretrained, frozen
student = SmallModel()           # to be trained

T           = 3.0   # temperature — soften the distributions
alpha       = 0.7   # weight for distillation vs hard-label loss
ce_loss     = nn.CrossEntropyLoss()
kl_div_loss = nn.KLDivLoss(reduction='batchmean')

optimizer = torch.optim.AdamW(student.parameters(), lr=1e-3)

for X, y_hard in loader:
    # Teacher forward (no grad)
    with torch.no_grad():
        teacher_logits = teacher(X)

    # Student forward
    student_logits = student(X)

    # Hard-label cross-entropy
    loss_hard = ce_loss(student_logits, y_hard)

    # Soft-target KL divergence (temperature-scaled)
    student_soft = F.log_softmax(student_logits / T, dim=1)
    teacher_soft = F.softmax(teacher_logits / T, dim=1)
    loss_kl = kl_div_loss(student_soft, teacher_soft) * (T ** 2)
    # T^2 scaling: compensates for the T-scaled gradients

    loss = alpha * loss_kl + (1 - alpha) * loss_hard
    optimizer.zero_grad(); loss.backward(); optimizer.step()

Why are 'soft probabilities' from a teacher network more informative than one-hot hard labels for training a student?Soft probabilities contain more data per sample than hard labels

✗ Try again.

Soft probabilities encode the teacher's uncertainty and the similarity structure between classes — e.g. a small probability on 'dog' when the true class is 'cat' reveals that the input has dog-like features, providing richer gradient signal than a one-hot label that treats all wrong classes identically

✓ Correct! Well done.

Hard labels cause the cross-entropy loss to become non-convex

✗ Try again.

Soft probabilities allow the student network to have more parameters than the teacher

✗ Try again.

Why is the KL divergence term multiplied by T² in the distillation loss?T² normalises the loss to have the same scale as cross-entropy

✗ Try again.

Using temperature T in softmax divides the logits by T, which reduces gradient magnitudes by T² via the chain rule — multiplying by T² restores the gradient to the original scale, ensuring the distillation term has appropriate influence relative to the hard-label term

✓ Correct! Well done.

T² prevents the soft targets from summing to more than 1

✗ Try again.

T² is a convention from the original paper with no mathematical justification

✗ Try again.

37. What is self-supervised learning and how do contrastive methods like SimCLR learn representations?

Self-supervised learning (SSL) is a form of unsupervised learning where the model is trained on a pretext task defined entirely from the data itself — no human-provided labels. The learned representations can then be transferred to downstream tasks with few or no labels (linear probe, fine-tuning).

Contrastive methods like SimCLR define a pretext task based on augmentation invariance: for each input, create two random augmented views (crops, colour jitter, flips) and train the model so that representations of the two views of the same image are similar (positive pair), while representations of views from different images are dissimilar (negative pairs). The NT-Xent loss (normalised temperature-scaled cross-entropy) implements this: for a batch of N images (2N views), the model is trained to identify the matching view among 2(N-1) negative candidates.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as T

# Augmentation pipeline: two random views of the same image
augment = T.Compose([
    T.RandomResizedCrop(224),
    T.RandomHorizontalFlip(),
    T.ColorJitter(0.4, 0.4, 0.4, 0.1),
    T.RandomGrayscale(p=0.2),
    T.GaussianBlur(kernel_size=23),
    T.ToTensor(),
])

class SimCLRLoss(nn.Module):
    def __init__(self, temperature=0.07):
        super().__init__()
        self.tau = temperature

    def forward(self, z1, z2):
        # L2-normalise projections to unit sphere
        z1 = F.normalize(z1, dim=1)
        z2 = F.normalize(z2, dim=1)
        # All 2N representations as rows
        z  = torch.cat([z1, z2], dim=0)   # (2N, d)
        # Pairwise cosine similarities / temperature
        sim_matrix = z @ z.T / self.tau   # (2N, 2N)
        # Mask out self-similarities on diagonal
        n = z1.size(0)
        labels = torch.cat([torch.arange(n, 2*n), torch.arange(n)]).to(z.device)
        # Remove diagonal (self-similarity)
        mask = ~torch.eye(2*n, dtype=bool, device=z.device)
        sim_matrix = sim_matrix[mask].view(2*n, -1)
        return F.cross_entropy(sim_matrix, labels)

# After pretraining: linear evaluation
# Freeze backbone, train linear head on downstream task
backbone = resnet50_pretrained
for p in backbone.parameters(): p.requires_grad = False
linear_head = nn.Linear(2048, num_classes)
optimizer = torch.optim.Adam(linear_head.parameters(), lr=1e-3)

What are positive and negative pairs in contrastive self-supervised learning?Positive pairs are correctly labelled samples; negative pairs are mislabelled

✗ Try again.

Positive pairs are two different augmented views of the same image (should have similar representations); negative pairs are views from different images (should have dissimilar representations)

✓ Correct! Well done.

Positive pairs are used for training; negative pairs are used for validation

✗ Try again.

Positive pairs have higher cross-entropy loss; negative pairs have lower loss

✗ Try again.

What does the temperature parameter τ (tau) control in the NT-Xent contrastive loss?It controls how many negative pairs are sampled per positive

✗ Try again.

It scales the cosine similarity before applying softmax — low temperature sharpens the distribution (makes the model more confident, harder negatives matter more); high temperature flattens it (all negatives equally weighted, easier training signal)

✓ Correct! Well done.

It sets the learning rate schedule for the projection head

✗ Try again.

It determines the size of the augmentation crops

✗ Try again.

38. How would you implement and train a simple feedforward neural network in PyTorch from scratch, without using nn.Sequential?

This question tests whether you understand the full PyTorch workflow: defining a custom nn.Module, implementing forward, and running the standard train loop. It is a common practical screen in ML engineering interviews.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# ─── 1. Define the model ────────────────────────────────────────────
class FeedForwardNet(nn.Module):
    def __init__(self, in_dim: int, hidden_dim: int, out_dim: int,
                 dropout: float = 0.1):
        super().__init__()
        self.fc1     = nn.Linear(in_dim, hidden_dim)
        self.bn1     = nn.BatchNorm1d(hidden_dim)
        self.relu    = nn.ReLU()
        self.drop    = nn.Dropout(dropout)
        self.fc2     = nn.Linear(hidden_dim, out_dim)
        self._init_weights()

    def _init_weights(self):
        nn.init.kaiming_uniform_(self.fc1.weight, nonlinearity='relu')
        nn.init.zeros_(self.fc1.bias)
        nn.init.xavier_uniform_(self.fc2.weight)
        nn.init.zeros_(self.fc2.bias)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.relu(self.bn1(self.fc1(x)))
        x = self.drop(x)
        return self.fc2(x)

# ─── 2. Create data ──────────────────────────────────────────────────
torch.manual_seed(42)
X = torch.randn(1000, 20)
y = (X[:, 0] + X[:, 1] > 0).long()  # binary label
ds     = TensorDataset(X, y)
loader = DataLoader(ds, batch_size=64, shuffle=True)

# ─── 3. Instantiate model, loss, optimizer ───────────────────────────
device    = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model     = FeedForwardNet(20, 64, 2, dropout=0.1).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)

# ─── 4. Training loop ─────────────────────────────────────────────────
for epoch in range(30):
    model.train()
    epoch_loss = 0.0
    for X_b, y_b in loader:
        X_b, y_b = X_b.to(device), y_b.to(device)
        optimizer.zero_grad(set_to_none=True)
        loss = criterion(model(X_b), y_b)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    if epoch % 5 == 0:
        print(f'Epoch {epoch:3d}: loss={epoch_loss / len(loader):.4f}')

Key interview checkpoints: (1) subclass nn.Module and call super().__init__(); (2) define all layers as attributes in __init__; (3) implement forward; (4) follow the zero-grad → forward → loss → backward → step order; (5) call model.train() before training and model.eval() before evaluation.

What is the required order of operations in a standard PyTorch training step?forward → backward → zero_grad → step

✗ Try again.

zero_grad → forward → loss → backward → step

✓ Correct! Well done.

step → zero_grad → forward → backward

✗ Try again.

backward → forward → zero_grad → loss → step

✗ Try again.

Why must all learnable layers be defined as attributes in nn.Module's __init__ rather than created inside forward()?forward() is called once; __init__ is called on every batch

✗ Try again.

Attributes defined in __init__ are registered as nn.Parameters or sub-modules, allowing .parameters(), .state_dict(), .to(device), and .train()/.eval() to find and manage them — layers created inside forward() are not registered and would be invisible to these methods

✓ Correct! Well done.

Python garbage collection removes layers created inside forward()

✗ Try again.

PyTorch requires all tensors to be created before the training loop begins

✗ Try again.

Tools

	Interviews Questions Java Spring Hibernate Maven Testing API BigData Web DataStructures AI Database Integration Cloud Scala Python Tools Golang	About Javapedia.net Javapedia.net is for Java and J2EE developers, technologist and college students who prepare of interview. Also this site includes many practical examples. This site is developed using J2EE technologies by Steve Antony, a senior Developer/lead at one of the logistics based company.
	contact: javatutorials2016[at]gmail[dot]com
Kindly consider donating for maintaining this website. Thanks.
	Copyright © 2026, javapedia.net, all rights reserved. privacy policy.

Python / Python Deep Learning and Neural Networks Interview Questions

Comments & Discussions

Recently added...