Python / PyTorch Fundamentals Interview Questions

1. What is PyTorch and what are its key advantages over other deep learning frameworks? 2. What is a PyTorch tensor and how does it differ from a NumPy array? 3. What are the most important tensor operations in PyTorch? 4. What are tensor data types (dtypes) in PyTorch and why do they matter? 5. How does broadcasting work in PyTorch and what are the rules? 6. What is autograd in PyTorch and how does it compute gradients? 7. What is the computation graph in PyTorch and how does the dynamic graph differ from a static graph? 8. How do torch.no_grad() and tensor.detach() differ, and when do you use each? 9. What is nn.Module and how do you build a custom neural network in PyTorch? 10. What are nn.Sequential and other container modules in PyTorch? 11. What built-in layers does PyTorch's nn module provide and how do you use the most common ones? 12. What are activation functions in PyTorch and how do you apply them? 13. What are the most important loss functions in PyTorch and when do you use each? 14. What optimizers does PyTorch provide and how do you configure them? 15. What are learning rate schedulers in PyTorch and how do you use them? 16. What are the most common built-in layers in torch.nn and what do they do? 17. How do you initialise weights in a PyTorch model? 18. What loss functions does PyTorch provide and when do you use each? 19. What optimizers does PyTorch provide and how do you choose between them? 20. What are learning rate schedulers in PyTorch and how do you use them? 21. What activation functions are commonly used in PyTorch and how do you choose between them? 22. What loss functions does PyTorch provide and how do you choose the right one? 23. What optimizers does PyTorch provide and what is the difference between SGD, Adam, and AdamW? 24. What is the standard PyTorch training loop and what does each step do? 25. What are Dataset and DataLoader in PyTorch and how do they work together? 26. How do you move tensors and models between CPU and GPU in PyTorch? 27. What is the difference between model.parameters() and model.state_dict() in PyTorch? 28. How do you save and load PyTorch models correctly, including full training checkpoints? 29. What is overfitting and what regularization techniques does PyTorch support to address it? 30. What is the vanishing/exploding gradient problem and how do you detect and fix it in PyTorch? 31. What is weight initialization in PyTorch and why does it matter? 32. What is the difference between nn.Parameter and a regular tensor attribute in nn.Module? 33. How do you implement and use learning rate schedulers in PyTorch? 34. How do you debug a PyTorch training loop where the loss is not decreasing or is NaN? 35. What is the difference between torch.tensor() and torch.Tensor() (capital T) for creating tensors? 36. How does gradient accumulation work in PyTorch and when would you use it? 37. What is mixed precision training in PyTorch and how do you implement it with torch.cuda.amp? 38. What is torch.compile() and how does it speed up PyTorch model execution? 39. What is the difference between batch size, epoch, and iteration in PyTorch training? 40. How do you compute and track evaluation metrics like accuracy during PyTorch training? 41. What is the purpose of torch.manual_seed() and how do you ensure reproducibility in PyTorch? 42. How does PyTorch handle multi-dimensional indexing and slicing of tensors? 43. What is the difference between.view(),.reshape(), and.contiguous() in PyTorch, and why does it matter? 44. How do you freeze layers and perform transfer learning / fine-tuning in PyTorch? 45. What is the purpose of torch.utils.data.random_split() and how do you create train/validation/test splits in PyTorch? 46. What is Batch Normalization in PyTorch and how does it differ from Layer Normalization? 47. How do you implement and use a custom loss function in PyTorch? 48. What is torch.compile() vs TorchScript and how do you export a PyTorch model for production deployment?

Could not find what you were looking for? send us the question and we would be happy to answer your question.

1. What is PyTorch and what are its key advantages over other deep learning frameworks?

PyTorch is an open-source deep learning framework developed by Meta AI (Facebook), released in 2016. It is built around two core ideas: tensor computation with GPU acceleration (similar to NumPy but on the GPU) and automatic differentiation via a dynamic computation graph (called define-by-run or eager execution).

PyTorch vs TensorFlow comparison
Feature	PyTorch	TensorFlow 2.x
Graph style	Dynamic (eager by default)	Eager by default (was static in v1)
Debugging	Native Python debugger (pdb, print)	More complex — graph abstractions
Research adoption	Dominant in academia	Strong in production
Deployment	TorchScript, ONNX, TorchServe	TensorFlow Serving, TFLite, TF.js
API feel	Pythonic, NumPy-like	More verbose historically
Community	Fast-growing, most ML papers	Large, enterprise-focused

/div>

Key advantages of PyTorch:

Dynamic computation graph — the graph is built at runtime, making debugging with standard Python tools natural
Pythonic API — feels like writing NumPy code; easy to mix with standard Python control flow
Strong GPU support — .cuda() / .to(device) moves tensors to GPU with one call
Rich ecosystem — torchvision, torchaudio, torchtext, HuggingFace Transformers, PyTorch Lightning
Production path — TorchScript, torch.compile, and ONNX export for deployment

Take quiz

What type of computation graph does PyTorch use by default?Static — compiled before execution

✗ Try again.

Dynamic (define-by-run) — built at runtime during the forward pass

✓ Correct! Well done.

Lazy — built only when explicitly evaluated

✗ Try again.

Symbolic — deferred like SymPy expressions

✗ Try again.

Which organisation originally developed and open-sourced PyTorch?Google

✗ Try again.

Microsoft

✗ Try again.

Meta AI (Facebook)

✓ Correct! Well done.

OpenAI

✗ Try again.

2. What is a PyTorch tensor and how does it differ from a NumPy array?

A tensor is PyTorch's core data structure — an n-dimensional array similar to NumPy's ndarray, but with two critical extra capabilities: it can live on a GPU for accelerated computation, and it supports automatic differentiation (autograd) for computing gradients during backpropagation.

import torch
import numpy as np

# Creating tensors
t1 = torch.tensor([1.0, 2.0, 3.0])          # from Python list
t2 = torch.zeros(3, 4)                       # 3×4 zeros
t3 = torch.ones(2, 3)                        # 2×3 ones
t4 = torch.rand(2, 3)                        # uniform random [0,1)
t5 = torch.randn(2, 3)                       # standard normal
t6 = torch.arange(0, 10, 2)                  # [0, 2, 4, 6, 8]
t7 = torch.linspace(0, 1, 5)                 # 5 evenly spaced pts

# Shape, dtype, device
print(t2.shape)     # torch.Size([3, 4])
print(t1.dtype)     # torch.float32
print(t1.device)    # cpu

# NumPy ↔ PyTorch bridge (shares memory on CPU!)
np_array = np.array([1.0, 2.0, 3.0])
torch_from_np = torch.from_numpy(np_array)   # shares memory
np_from_torch = t1.numpy()                   # shares memory

np_array[0] = 99
print(torch_from_np[0])  # tensor(99.) — memory is shared!

Tensor vs NumPy ndarray
Feature	PyTorch Tensor	NumPy ndarray
GPU support	Yes — .to('cuda')	No
Autograd	Yes — requires_grad=True	No
Memory sharing	Yes (CPU tensors)	Yes (via from_numpy)
Default dtype	float32	float64
Broadcasting	Yes (same rules)	Yes

/div>

Take quiz

What two capabilities does a PyTorch tensor have that a NumPy array does not?Indexing and slicing

✗ Try again.

GPU acceleration and automatic differentiation (autograd)

✓ Correct! Well done.

Broadcasting and matrix multiplication

✗ Try again.

Integer and float dtypes

✗ Try again.

What happens to memory when you call torch.from_numpy(arr) on a NumPy array?PyTorch makes a deep copy of the array

✗ Try again.

The tensor and the NumPy array share the same memory — modifying one changes the other

✓ Correct! Well done.

The NumPy array is deleted and replaced by the tensor

✗ Try again.

A read-only view is created; writes to the tensor are blocked

✗ Try again.

3. What are the most important tensor operations in PyTorch?

PyTorch provides a rich set of tensor operations covering arithmetic, shape manipulation, reduction, and linear algebra. Most have both a functional form (torch.add) and a method form (tensor.add), plus in-place variants with a trailing underscore (tensor.add_).

import torch

a = torch.tensor([[1.,2.,3.],[4.,5.,6.]])
b = torch.tensor([[7.,8.,9.],[10.,11.,12.]])

# ── Arithmetic
print(a + b)          # element-wise add
print(a * b)          # element-wise multiply (Hadamard)
print(torch.matmul(a, b.T))  # matrix multiply  (2×3) @ (3×2) → (2×2)
print(a @ b.T)        # same with @ operator

# ── Shape manipulation
print(a.shape)                    # torch.Size([2, 3])
print(a.reshape(3, 2))            # (3, 2) — new view if possible
print(a.view(6))                  # (6,)   — must be contiguous
print(a.unsqueeze(0).shape)       # (1, 2, 3) — add dim
print(a.squeeze(0).shape)         # removes dim of size 1
print(torch.cat([a, b], dim=0))   # (4, 3) — concatenate rows
print(torch.stack([a, b], dim=0)) # (2, 2, 3) — new dim
print(a.permute(1, 0))            # (3, 2) — transpose

# ── Reduction
print(a.sum())           # scalar sum
print(a.sum(dim=1))      # sum along rows → (2,)
print(a.mean(dim=0))     # mean along columns → (3,)
print(a.max(), a.min())
print(a.argmax())        # index of max (flattened)

# ── In-place (modifies tensor, avoids memory allocation)
a.add_(1)   # a += 1
a.mul_(2)   # a *= 2
# Warning: in-place ops on tensors requiring grad can cause issues!

Key distinction: reshape returns a view when possible (no copy) and falls back to a copy if the tensor is not contiguous. view always requires a contiguous tensor and always returns a view. Use contiguous().view() or just reshape() to be safe.

Take quiz

What is the difference between torch.cat and torch.stack?cat and stack are identical operations

✗ Try again.

cat concatenates tensors along an existing dimension; stack creates a new dimension and stacks tensors along it

✓ Correct! Well done.

cat requires tensors of the same shape; stack does not

✗ Try again.

stack is faster than cat for large tensors

✗ Try again.

What does the trailing underscore in PyTorch method names like tensor.add_() signify?The method returns a tuple

✗ Try again.

The operation is performed in-place, modifying the tensor directly without allocating new memory

✓ Correct! Well done.

The method is deprecated

✗ Try again.

The operation runs on CPU only

✗ Try again.

4. What are tensor data types (dtypes) in PyTorch and why do they matter?

Every tensor has a dtype that determines the numeric type and precision of its elements. Choosing the right dtype affects memory usage, computation speed, and numeric precision — a critical consideration when training on GPUs.

Common PyTorch dtypes
dtype	Alias	Bits	Use case
torch.float32	torch.float	32	Default for model weights and activations
torch.float64	torch.double	64	High-precision numerical work
torch.float16	torch.half	16	Mixed-precision training (GPU)
torch.bfloat16	—	16	Modern GPUs (A100+); wider exponent than float16
torch.int64	torch.long	64	Indices, class labels, sequence lengths
torch.int32	torch.int	32	General integer computation
torch.bool	—	8	Masks, boolean indexing
torch.uint8	—	8	Image pixel values (0–255)

/div>

import torch

# Creating tensors with specific dtypes
x = torch.tensor([1.0, 2.0], dtype=torch.float32)
y = torch.tensor([1, 2, 3], dtype=torch.long)     # class labels
m = torch.tensor([True, False, True], dtype=torch.bool)

# Casting between dtypes
print(x.dtype)             # torch.float32
x64 = x.double()           # → float64
x16 = x.half()             # → float16
xi  = x.to(torch.int32)   # → int32

# Default dtype (float32 for floats, int64 for ints)
print(torch.tensor([1.0]).dtype)   # torch.float32
print(torch.tensor([1]).dtype)     # torch.int64

# Change global default
torch.set_default_dtype(torch.float64)  # rarely needed

# Why dtype matters for loss computation:
# CrossEntropyLoss expects:
#   input:  float32  (logits)
#   target: int64    (class indices)
loss_fn = torch.nn.CrossEntropyLoss()
logits = torch.randn(4, 10)                 # float32
targets = torch.randint(0, 10, (4,))        # int64
loss = loss_fn(logits, targets)             # works!
# targets_wrong = targets.float()           # would error!

Most common dtype errors: passing float64 weights into a model expecting float32, or passing float targets to a loss function expecting long (e.g. CrossEntropyLoss).

Take quiz

What dtype should class label targets be for PyTorch's CrossEntropyLoss?torch.float32

✗ Try again.

torch.float64

✗ Try again.

torch.int32

✗ Try again.

torch.int64 (long)

✓ Correct! Well done.

What is the advantage of torch.bfloat16 over torch.float16 for training on modern GPUs?bfloat16 has more mantissa bits, giving higher precision

✗ Try again.

bfloat16 has the same 8-bit exponent range as float32, making it less prone to overflow/underflow during training while using half the memory

✓ Correct! Well done.

bfloat16 is twice as fast as float16

✗ Try again.

bfloat16 is supported on all GPUs, not just NVIDIA

✗ Try again.

5. How does broadcasting work in PyTorch and what are the rules?

Broadcasting allows PyTorch to perform arithmetic between tensors of different shapes without explicit copying. PyTorch follows the same broadcasting rules as NumPy. Understanding broadcasting is essential to avoid subtle shape bugs.

import torch

# Rule: align shapes from the RIGHT, expand dims of size 1
a = torch.ones(3, 4)     # shape (3, 4)
b = torch.ones(4)        # shape    (4) → treated as (1, 4) → broadcast to (3, 4)
c = a + b                # works! c.shape = (3, 4)

# Adding a bias vector to a batch of activations
batch = torch.randn(32, 128)   # (batch=32, features=128)
bias  = torch.randn(128)       # (128,) broadcasts across the batch dim
out   = batch + bias           # (32, 128) ✓

# Adding column and row vectors → 2D result
col = torch.arange(3).reshape(3, 1)  # (3, 1)
row = torch.arange(4).reshape(1, 4)  # (1, 4)
grid = col + row                      # (3, 4) — outer-sum
print(grid)
# tensor([[0, 1, 2, 3],
#         [1, 2, 3, 4],
#         [2, 3, 4, 5]])

# Common broadcasting errors:
# a = torch.ones(3, 4)
# b = torch.ones(3)        # (3,) aligns to (1, 3) NOT (3, 1)
# a + b  → ERROR: size 4 != size 3 in dimension 1
# Fix: b.reshape(3, 1) to make it (3, 1)

Broadcasting rules (step by step)
Step	Rule
1. Align right	Pad missing leading dimensions with 1
2. Check compatibility	Each dim must be equal, or one of them must be 1
3. Expand size-1 dims	Dimension of size 1 is stretched to match the other tensor
4. Error if incompatible	Raises RuntimeError if no dim is 1 and sizes differ

/div>

Take quiz

Tensors of shape (3,1) and (1,4) are added together. What is the output shape?(3,4)

✓ Correct! Well done.

(4,3)

✗ Try again.

Error — incompatible shapes

✗ Try again.

(1,1)

✗ Try again.

Tensors of shape (32,128) and (128,) are added. Why does this work?PyTorch pads the (128,) tensor with zeros to match

✗ Try again.

The (128,) tensor is treated as (1,128) and broadcast across the batch dimension of 32

✓ Correct! Well done.

PyTorch automatically transposes the smaller tensor

✗ Try again.

Broadcasting only works for tensors with the same number of dimensions

✗ Try again.

6. What is autograd in PyTorch and how does it compute gradients?

PyTorch's autograd engine implements automatic differentiation. When you perform operations on tensors with requires_grad=True, PyTorch records every operation in a dynamic computation graph. Calling .backward() on a scalar loss traverses this graph in reverse using the chain rule, accumulating gradients in each tensor's .grad attribute.

import torch

# requires_grad=True tells PyTorch to track this tensor
x = torch.tensor([2.0, 3.0], requires_grad=True)

# Forward pass — operations are recorded
y = x ** 2           # y = [4.0, 9.0]
z = y.sum()          # z = 13.0  (scalar)

# Backward pass — computes dz/dx using chain rule
z.backward()
print(x.grad)        # tensor([4., 6.])  dz/dx = 2x

# Verify: dz/d(x[0]) = d(x[0]^2)/d(x[0]) = 2*x[0] = 4 ✓

# Gradients ACCUMULATE — always zero before next backward!
x.grad.zero_()   # or optimizer.zero_grad()

# Non-leaf tensors (created by ops) have grad_fn
a = torch.tensor(3.0, requires_grad=True)
b = a * 2
print(b.grad_fn)       # <MulBackward0 object>
print(b.requires_grad) # True — inherited from a

# Detach: stop tracking a tensor
c = b.detach()         # c shares data with b but no grad history
print(c.requires_grad) # False

# torch.no_grad(): context manager to disable gradient tracking
with torch.no_grad():
    inference = a * 2   # faster, no graph built
    print(inference.requires_grad)  # False

Key autograd concepts
Concept	What it is
requires_grad=True	Tells autograd to track operations on this tensor
.grad	Accumulated gradient after .backward() — lives on leaf tensors
grad_fn	Reference to the function that created a non-leaf tensor
.backward()	Traverses graph backwards, fills .grad via chain rule
.detach()	Returns tensor with same data but no gradient history
torch.no_grad()	Context: disables gradient tracking (inference, validation)

/div>

Take quiz

What method do you call to trigger gradient computation in PyTorch?loss.forward()

✗ Try again.

loss.backward()

✓ Correct! Well done.

torch.autograd.compute()

✗ Try again.

loss.differentiate()

✗ Try again.

Why must you call optimizer.zero_grad() (or tensor.grad.zero_()) before each backward pass?Backward() raises an error if grads are non-zero

✗ Try again.

PyTorch accumulates gradients by default — without zeroing, gradients from successive batches add together, corrupting the update

✓ Correct! Well done.

zero_grad() resets the computation graph

✗ Try again.

It frees GPU memory before the backward pass

✗ Try again.

7. What is the computation graph in PyTorch and how does the dynamic graph differ from a static graph?

PyTorch builds a dynamic computation graph (also called eager execution or define-by-run). Every time you run the forward pass, a new graph is constructed on-the-fly based on the actual Python code paths executed. This is in contrast to TensorFlow 1.x's static graph, which is compiled once and then executed repeatedly.

import torch

# Dynamic graph: Python control flow works naturally
def dynamic_model(x, use_relu=True):
    h = x @ torch.randn(4, 4)
    if use_relu:           # real Python if — changes the graph!
        h = torch.relu(h)
    else:
        h = torch.tanh(h)
    return h.sum()

x = torch.randn(2, 4, requires_grad=True)

# Each call may build a DIFFERENT graph depending on use_relu
loss1 = dynamic_model(x, use_relu=True)
loss1.backward()   # graph includes ReLU nodes

x.grad.zero_()
loss2 = dynamic_model(x, use_relu=False)
loss2.backward()   # graph includes Tanh nodes

# The graph is discarded after backward() by default
# retain_graph=True keeps it for multiple backward calls
y = (x ** 2).sum()
y.backward(retain_graph=True)   # graph kept
y.backward()                    # can call again

# Inspecting the graph
z = x ** 3
print(z.grad_fn)                # <PowBackward0>
print(z.grad_fn.next_functions) # upstream functions

Dynamic vs Static computation graph
Aspect	Dynamic (PyTorch eager)	Static (TF1 / torch.compile)
When built	At runtime, every forward pass	Once, then reused
Python control flow	Works natively (if/for/while)	Must use special graph ops
Debugging	Use pdb, print anywhere	Harder — graph is opaque
Performance	Slight overhead from graph construction	Faster after compilation
Flexibility	High — easy to change architectures	Low — recompile to change

/div>

Take quiz

What happens to the computation graph by default after calling loss.backward()?It is saved for future backward calls

✗ Try again.

It is destroyed (freed) — call backward(retain_graph=True) to keep it

✓ Correct! Well done.

It is converted to a static graph for efficiency

✗ Try again.

It is moved to the CPU to save GPU memory

✗ Try again.

What is the main debugging advantage of PyTorch's dynamic computation graph over a static graph?Dynamic graphs use less memory

✗ Try again.

You can use standard Python debugging tools (print, pdb, breakpoints) anywhere in the forward pass — the graph is just Python code executing normally

✓ Correct! Well done.

Dynamic graphs are automatically optimised by the compiler

✗ Try again.

Dynamic graphs support larger batch sizes

✗ Try again.

8. How do torch.no_grad() and tensor.detach() differ, and when do you use each?

Both torch.no_grad() and .detach() stop gradient tracking, but they work at different levels and serve different purposes.

import torch

model_param = torch.tensor(2.0, requires_grad=True)

# ── torch.no_grad(): context manager — disables ALL grad tracking
# Use for inference and validation loops
with torch.no_grad():
    out = model_param * 3       # no graph built
    print(out.requires_grad)    # False
    print(out.grad_fn)          # None
# Faster + less memory — standard pattern for eval

# ── .detach(): detaches a SPECIFIC tensor from the graph
# The tensor still knows about grad, but is cut off from history
a = model_param * 4
print(a.requires_grad)          # True  (still tracking)
b = a.detach()                  # b shares data with a
print(b.requires_grad)          # False (disconnected)
print(b.data_ptr() == a.data_ptr())  # True — SAME memory!

# Common use case: compute a "stop gradient" target
# in actor-critic / target networks
target = a.detach()             # stop gradient through target
loss = (a - target) ** 2       # gradient only flows through a, not target

# ── @torch.no_grad() decorator variant
@torch.no_grad()
def predict(x):
    return model_param * x      # no grad even without with block

# Validation loop pattern
def validate(model, loader):
    model.eval()                # turns off dropout, batchnorm train mode
    with torch.no_grad():       # no gradient computation
        for x, y in loader:
            pred = model(x)
            # compute metrics...

no_grad vs detach comparison
Feature	torch.no_grad()	tensor.detach()
Scope	All ops within the context block	One specific tensor
Memory saved	Yes — no graph built	Partial — graph still exists upstream
Typical use	Inference, validation loops	Target networks, stop-gradient
Output requires_grad	False	False

/div>

Take quiz

When should you use torch.no_grad() during model training?During the forward pass to speed up gradient computation

✗ Try again.

During validation/inference loops — no gradients are needed, saving memory and computation

✓ Correct! Well done.

When you want to freeze specific layers

✗ Try again.

When computing the loss function

✗ Try again.

What is the key difference between tensor.detach() and torch.no_grad()?detach() is faster than no_grad()

✗ Try again.

no_grad() disables gradient tracking for all operations in a block; detach() disconnects one specific tensor from the graph while operations inside no_grad() still share data with the original

✓ Correct! Well done.

detach() clears existing gradients; no_grad() prevents new ones

✗ Try again.

no_grad() only works on CPU tensors

✗ Try again.

9. What is nn.Module and how do you build a custom neural network in PyTorch?

nn.Module is the base class for all neural network components in PyTorch. Subclassing it gives you parameter management, device placement, train/eval mode toggling, state dict serialisation, and hooks — all for free.

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, in_features: int, hidden: int, out_features: int):
        super().__init__()   # MUST call this first!

        # Layers defined as attributes are auto-registered as sub-modules
        self.fc1  = nn.Linear(in_features, hidden)
        self.relu = nn.ReLU()
        self.drop = nn.Dropout(p=0.3)
        self.fc2  = nn.Linear(hidden, out_features)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Define the forward computation."""
        x = self.fc1(x)
        x = self.relu(x)
        x = self.drop(x)
        x = self.fc2(x)
        return x

# Instantiate and inspect
model = MLP(in_features=784, hidden=256, out_features=10)

# Forward pass — calls forward() via __call__
x = torch.randn(32, 784)   # batch of 32
out = model(x)              # shape (32, 10)

# Parameter inspection
for name, param in model.named_parameters():
    print(name, param.shape, param.requires_grad)
# fc1.weight  torch.Size([256, 784])  True
# fc1.bias    torch.Size([256])       True
# fc2.weight  torch.Size([10, 256])   True
# fc2.bias    torch.Size([10])        True

total = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total:,}")

Critical rules:

Always call super().__init__() in __init__
Define layers as attributes (not local variables) so PyTorch registers them
Implement the forward() method — never call it directly; use model(x) which invokes hooks
Use model(x) not model.forward(x) so pre/post-forward hooks fire

Take quiz

Why must you call super().__init__() at the start of an nn.Module subclass's __init__?It imports the required PyTorch libraries

✗ Try again.

It initialises nn.Module's internal bookkeeping (parameter registry, hooks, state) — without it, attributes like named_parameters() and .to(device) will not work

✓ Correct! Well done.

It sets the random seed for weight initialisation

✗ Try again.

It registers the model with PyTorch's global model registry

✗ Try again.

What is the difference between calling model.forward(x) and model(x)?model(x) is a shorthand that is exactly equivalent to model.forward(x)

✗ Try again.

model(x) invokes __call__ which runs registered forward hooks before and after forward() — always use model(x) in practice

✓ Correct! Well done.

model.forward(x) is the correct API; model(x) is deprecated

✗ Try again.

model(x) moves tensors to the correct device automatically

✗ Try again.

10. What are nn.Sequential and other container modules in PyTorch?

PyTorch provides several container modules that compose layers without requiring a custom nn.Module subclass. They are convenient for simple feedforward architectures but less flexible than full subclassing.

import torch
import torch.nn as nn

# ── nn.Sequential: layers applied in order
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10),
)
out = model(torch.randn(32, 784))  # (32, 10)

# Named layers in Sequential (for easier access)
model_named = nn.Sequential(
    ("fc1",  nn.Linear(784, 256)),
    ("relu", nn.ReLU()),
    ("fc2",  nn.Linear(256, 10)),
)
print(model_named.fc1.weight.shape)   # torch.Size([256, 784])

# ── nn.ModuleList: list of modules (for dynamic use)
class ResNet(nn.Module):
    def __init__(self, n_blocks: int):
        super().__init__()
        # ModuleList properly registers all contained modules
        self.blocks = nn.ModuleList([
            nn.Linear(64, 64) for _ in range(n_blocks)
        ])
    def forward(self, x):
        for block in self.blocks:
            x = torch.relu(block(x)) + x  # residual
        return x

# ── nn.ModuleDict: dict of modules (for conditional routing)
class MultiHead(nn.Module):
    def __init__(self):
        super().__init__()
        self.heads = nn.ModuleDict({
            "sentiment": nn.Linear(128, 2),
            "topic":     nn.Linear(128, 10),
        })
    def forward(self, x, task: str):
        return self.heads[task](x)

Container modules
Container	When to use
nn.Sequential	Simple feedforward chains; no branching
nn.ModuleList	Dynamic or variable-length list of modules in a loop
nn.ModuleDict	Named modules selected conditionally (e.g. multi-task)
nn.ParameterList	List of nn.Parameter objects (rare)
nn.ParameterDict	Dict of nn.Parameter objects (rare)

/div>

Take quiz

Why use nn.ModuleList instead of a plain Python list to store layers?ModuleList is faster than a Python list

✗ Try again.

nn.ModuleList properly registers contained modules so their parameters appear in model.parameters(), are moved with .to(device), and saved in state_dict() — plain lists are invisible to nn.Module

✓ Correct! Well done.

Python lists cannot hold nn.Module objects

✗ Try again.

ModuleList supports iteration but Python list does not

✗ Try again.

When is nn.Sequential NOT a suitable choice for building a model?When the model has more than 10 layers

✗ Try again.

When the forward pass requires branching, skip connections, or access to intermediate outputs — Sequential only supports a linear chain of operations

✓ Correct! Well done.

Sequential cannot be used with GPU acceleration

✗ Try again.

Sequential does not support Dropout layers

✗ Try again.

11. What built-in layers does PyTorch's nn module provide and how do you use the most common ones?

PyTorch's torch.nn module contains all the standard neural network building blocks. Understanding what each layer does mathematically helps you choose the right component and configure it correctly.

Most common nn layers
Layer	Formula / purpose	Key parameters
nn.Linear	y = xW^T + b — fully connected	in_features, out_features, bias=True
nn.Conv2d	2D cross-correlation — feature extraction	in_channels, out_channels, kernel_size, stride, padding
nn.BatchNorm1d/2d	Normalise over batch; learnable γ, β	num_features, eps, momentum
nn.Dropout	Zero random neurons with prob p during train	p (dropout probability)
nn.Embedding	Learnable lookup table for integer tokens	num_embeddings, embedding_dim
nn.LSTM	Long Short-Term Memory recurrent layer	input_size, hidden_size, num_layers
nn.MultiheadAttention	Scaled dot-product attention	embed_dim, num_heads
nn.LayerNorm	Normalise over feature dims per sample	normalized_shape

/div>

import torch, torch.nn as nn

# nn.Linear
fc = nn.Linear(128, 64)        # (batch, 128) → (batch, 64)
print(fc.weight.shape)          # (64, 128)  — transposed internally
print(fc.bias.shape)            # (64,)

# nn.Conv2d
conv = nn.Conv2d(
    in_channels=3,
    out_channels=32,
    kernel_size=3,
    stride=1,
    padding=1,           # "same" padding preserves H, W
)
x_img = torch.randn(8, 3, 32, 32)   # (batch, C, H, W)
print(conv(x_img).shape)             # (8, 32, 32, 32)

# nn.BatchNorm2d
bn = nn.BatchNorm2d(32)       # num_features = channels
# In train mode: normalises over (N, H, W) per channel
# In eval mode:  uses running mean/var from training

# nn.Embedding
emb = nn.Embedding(num_embeddings=10000, embedding_dim=128)
tokens = torch.tensor([5, 23, 100])  # integer token ids
print(emb(tokens).shape)              # (3, 128)

# nn.Dropout — active only in train mode
drop = nn.Dropout(p=0.5)
x = torch.ones(4, 8)
print(drop(x))   # ~half zeros (train), all ones after model.eval()

Take quiz

What input shape does nn.Conv2d expect in PyTorch?(batch, H, W, channels) — channels last

✗ Try again.

(batch, channels, H, W) — channels first

✓ Correct! Well done.

(channels, batch, H, W)

✗ Try again.

(H, W, batch, channels)

✗ Try again.

When does nn.Dropout zero out neurons?During both training and inference for regularisation

✗ Try again.

Only during training (model.train() mode) — it passes all values through unchanged during evaluation (model.eval())

✓ Correct! Well done.

Only during inference to reduce computation

✗ Try again.

Only when p > 0.5

✗ Try again.

12. What are activation functions in PyTorch and how do you apply them?

Activation functions introduce non-linearity into neural networks, enabling them to learn complex mappings. PyTorch provides them both as nn.Module classes (usable as layers) and as functional forms in torch.nn.functional.

Common activation functions
Function	Formula	Typical use
ReLU	max(0, x)	Default for hidden layers (fast, avoids vanishing grad)
LeakyReLU	max(αx, x), α≈0.01	When dying ReLU is a problem
Sigmoid	1/(1+e^−x) → (0,1)	Binary classification output
Tanh	(e^x−e^−x)/(e^x+e^−x) → (−1,1)	RNNs, zero-centred alternative to sigmoid
Softmax	e^xᵢ/Σe^xⱼ → sums to 1	Multi-class output (use with NLLLoss)
GELU	x·Φ(x) smooth	Transformers (BERT, GPT)
SiLU/Swish	x·sigmoid(x)	Modern architectures (EfficientNet)

/div>

import torch
import torch.nn as nn
import torch.nn.functional as F

x = torch.tensor([-2., -1., 0., 1., 2.])

# ── As nn.Module (use inside nn.Sequential or __init__)
relu = nn.ReLU()
print(relu(x))          # [0, 0, 0, 1, 2]

sigmoid = nn.Sigmoid()
print(sigmoid(x))       # [0.12, 0.27, 0.50, 0.73, 0.88]

# ── As functional (use inside forward())
print(F.relu(x))        # same as nn.ReLU()(x)
print(F.gelu(x))        # smooth approximation

# Softmax: dim must be specified!
logits = torch.randn(4, 10)   # (batch=4, classes=10)
probs = F.softmax(logits, dim=1)   # dim=1 (classes)
print(probs.sum(dim=1))            # tensor([1., 1., 1., 1.])

# !! Never apply Softmax before CrossEntropyLoss !!
# CrossEntropyLoss = LogSoftmax + NLLLoss internally
# Applying softmax first → double-softmax = wrong!
loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, torch.randint(0, 10, (4,)))  # pass raw logits!

Take quiz

Why should you NOT apply nn.Softmax before nn.CrossEntropyLoss?Softmax makes gradients too large

✗ Try again.

CrossEntropyLoss applies LogSoftmax internally — applying Softmax first causes a double-softmax which produces incorrect loss values and gradients

✓ Correct! Well done.

Softmax only works on 1D tensors

✗ Try again.

CrossEntropyLoss expects probabilities summing to greater than 1

✗ Try again.

What is the key advantage of ReLU over sigmoid/tanh as an activation function in deep networks?ReLU outputs are always positive, making training more stable

✗ Try again.

ReLU's gradient is exactly 1 for positive inputs — it does not saturate in the positive region, which mitigates the vanishing gradient problem that sigmoid and tanh suffer from in deep networks

✓ Correct! Well done.

ReLU is differentiable everywhere

✗ Try again.

ReLU uses less memory than sigmoid

✗ Try again.

13. What are the most important loss functions in PyTorch and when do you use each?

Choosing the right loss function is critical — it defines what the model is optimising for. PyTorch provides loss functions in torch.nn as modules and in torch.nn.functional as functions.

Common PyTorch loss functions
Loss	Use case	Input / Target
nn.MSELoss	Regression — minimise squared error	pred: float, target: float
nn.MAELoss / L1Loss	Regression — robust to outliers	pred: float, target: float
nn.CrossEntropyLoss	Multi-class classification	pred: (N,C) logits, target: (N,) long
nn.BCEWithLogitsLoss	Binary classification (numerically stable)	pred: (N,) logits, target: (N,) float 0/1
nn.NLLLoss	Used with log-softmax output	pred: (N,C) log-probs, target: (N,) long
nn.KLDivLoss	Distribution divergence (VAE, distillation)	pred: log-probs, target: probs
nn.HuberLoss	Regression robust to outliers	pred: float, target: float

/div>

import torch, torch.nn as nn

batch = 8

# ── Regression
pred = torch.randn(batch, 1)
target = torch.randn(batch, 1)
mse  = nn.MSELoss()(pred, target)
mae  = nn.L1Loss()(pred, target)
print(mse, mae)

# ── Multi-class classification
logits  = torch.randn(batch, 10)        # raw scores, NOT softmax
labels  = torch.randint(0, 10, (batch,)) # class indices, dtype=long
ce_loss = nn.CrossEntropyLoss()(logits, labels)
print(ce_loss)

# Class-weighted cross entropy (handle imbalance)
weights = torch.tensor([1.0]*9 + [5.0])  # upweight class 9
ce_weighted = nn.CrossEntropyLoss(weight=weights)(logits, labels)

# ── Binary classification (single output neuron)
bin_logits = torch.randn(batch)           # single score
bin_labels = torch.randint(0, 2, (batch,)).float()  # 0 or 1, float!
bce_loss   = nn.BCEWithLogitsLoss()(bin_logits, bin_labels)
# BCEWithLogitsLoss = sigmoid + BCE in one numerically stable op

# ── Label smoothing (reduces overconfidence)
ce_smooth = nn.CrossEntropyLoss(label_smoothing=0.1)(logits, labels)

# reduction parameter
nn.MSELoss(reduction="mean")    # default: mean over batch
nn.MSELoss(reduction="sum")     # sum over batch
nn.MSELoss(reduction="none")    # per-sample loss (no reduction)

Take quiz

What input format does nn.CrossEntropyLoss expect?Softmax probabilities as input, one-hot targets

✗ Try again.

Raw logits (N, C) as input and integer class indices (N,) of dtype long as targets

✓ Correct! Well done.

Log-softmax probabilities and float targets

✗ Try again.

Normalised predictions and binary targets

✗ Try again.

Why use nn.BCEWithLogitsLoss instead of applying sigmoid first and then nn.BCELoss?BCEWithLogitsLoss is faster

✗ Try again.

BCEWithLogitsLoss combines sigmoid and BCE in a single numerically stable operation using the log-sum-exp trick — applying sigmoid first can cause underflow/overflow for extreme logit values

✓ Correct! Well done.

BCELoss does not support GPU tensors

✗ Try again.

BCEWithLogitsLoss handles multi-class problems

✗ Try again.

14. What optimizers does PyTorch provide and how do you configure them?

Optimizers update model parameters based on computed gradients. PyTorch provides all major optimizers in torch.optim. Choosing and configuring the right optimizer significantly affects training speed and final performance.

PyTorch optimizers
Optimizer	Key feature	Typical use
SGD	Simple, supports momentum and weight decay	Computer vision with lr scheduling
Adam	Adaptive lr per param; momentum + RMSProp	Default for NLP, general purpose
AdamW	Adam with decoupled weight decay	Transformers, fine-tuning (recommended over Adam)
RMSprop	Adaptive lr without momentum	RNNs
Adagrad	Accumulates squared gradients; rare today	Sparse features
LBFGS	Second-order quasi-Newton; very slow	Small networks, physics-informed NNs

/div>

import torch, torch.nn as nn, torch.optim as optim

model = nn.Linear(128, 10)

# ── SGD with momentum and weight decay
opt_sgd = optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,       # Nesterov-style acceleration
    weight_decay=1e-4,  # L2 regularisation
    nesterov=True,
)

# ── Adam
opt_adam = optim.Adam(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999), # (β1, β2) — momentum terms
    eps=1e-8,
    weight_decay=0,     # Adam + L2 is suboptimal — use AdamW!
)

# ── AdamW (preferred for transformers)
opt_adamw = optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=0.01,  # decoupled from gradient update
)

# ── Per-layer learning rates
opt_layerwise = optim.Adam([
    {"params": model.weight, "lr": 1e-4},  # slower for weight
    {"params": model.bias,   "lr": 1e-3},  # faster for bias
])

# ── Standard training step
opt = optim.AdamW(model.parameters(), lr=1e-3)
for x, y in [(torch.randn(32,128), torch.randint(0,10,(32,)))]:
    opt.zero_grad()               # 1. clear old gradients
    loss = nn.CrossEntropyLoss()(model(x), y)  # 2. forward
    loss.backward()              # 3. backward
    opt.step()                   # 4. update parameters

Take quiz

What is the key difference between Adam and AdamW?AdamW uses a different momentum calculation

✗ Try again.

AdamW decouples weight decay from the gradient update — in Adam, weight decay is incorrectly folded into the adaptive learning rate scaling, weakening its regularisation effect

✓ Correct! Well done.

AdamW is faster than Adam

✗ Try again.

AdamW does not use momentum

✗ Try again.

What is the correct order of operations in a PyTorch training step?forward → backward → zero_grad → step

✗ Try again.

zero_grad → forward → backward → step

✓ Correct! Well done.

backward → zero_grad → forward → step

✗ Try again.

forward → zero_grad → step → backward

✗ Try again.

15. What are learning rate schedulers in PyTorch and how do you use them?

A learning rate scheduler adjusts the learning rate during training — typically starting high for fast initial progress and decaying for fine-grained convergence. Schedulers wrap an optimizer and must be stepped after each epoch (or each batch for some schedulers).

Common LR schedulers
Scheduler	Behaviour	Step
StepLR	Multiply lr by gamma every step_size epochs	Per epoch
MultiStepLR	Decay at specified milestone epochs	Per epoch
ExponentialLR	lr *= gamma every epoch	Per epoch
CosineAnnealingLR	Cosine decay from lr to eta_min	Per epoch
OneCycleLR	Warmup then cosine decay (superconvergence)	Per batch
ReduceLROnPlateau	Reduce lr when metric stops improving	Per epoch (with metric
CosineAnnealingWarmRestarts	Cosine with periodic restarts	Per epoch

/div>

import torch, torch.optim as optim
import torch.nn as nn

model = nn.Linear(128, 10)
optimizer = optim.SGD(model.parameters(), lr=0.1)

# ── StepLR: multiply lr by 0.1 every 30 epochs
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)

# ── CosineAnnealingLR: smooth cosine decay
scheduler = optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=100, eta_min=1e-6
)

# ── OneCycleLR: requires total_steps at init
scheduler = optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=0.1,
    total_steps=100 * len([1]*1000),  # epochs * batches_per_epoch
)

# ── ReduceLROnPlateau: triggered by validation loss
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode="min", factor=0.5, patience=5, verbose=True
)

# ── Training loop integration
for epoch in range(100):
    train_loss = 0.0  # ... train ...
    val_loss   = 0.0  # ... validate ...

    # Most schedulers: step after epoch
    scheduler.step()              # for StepLR, CosineAnnealingLR etc.
    # scheduler.step(val_loss)    # for ReduceLROnPlateau (needs metric)

    # Check current lr
    current_lr = optimizer.param_groups[0]["lr"]
    print(f"Epoch {epoch}: lr={current_lr:.6f}")

Take quiz

When should you call scheduler.step() for most epoch-based learning rate schedulers?Before optimizer.step() in each batch

✗ Try again.

After each training epoch, once per epoch

✓ Correct! Well done.

Before each forward pass

✗ Try again.

Inside the loss.backward() call

✗ Try again.

Which PyTorch scheduler is most suitable when you want to reduce learning rate only if validation loss stops improving?StepLR

✗ Try again.

CosineAnnealingLR

✗ Try again.

ReduceLROnPlateau

✓ Correct! Well done.

OneCycleLR

✗ Try again.

16. What are the most common built-in layers in torch.nn and what do they do?

PyTorch's torch.nn module provides all the standard building blocks for neural networks. Understanding what each layer does mathematically and when to use it is fundamental to building effective models.

Common nn layers
Layer	Formula / behaviour	Typical use
nn.Linear(in, out)	y = xW^T + b	Fully connected / dense layer
nn.Conv2d(in, out, k)	2D convolution with kernel k×k	Image feature extraction
nn.BatchNorm1d/2d	Normalise per feature/channel over batch	After linear/conv, before activation
nn.LayerNorm	Normalise over feature dim per sample	Transformers, NLP
nn.Dropout(p)	Zeros random fraction p during train	Regularisation
nn.Embedding(V,d)	Lookup table V vocab × d dim	Word/token embeddings
nn.ReLU/GELU/Tanh	Element-wise activations	After linear/conv layers
nn.Softmax(dim)	exp(x)/Σexp(x) along dim	Output probabilities (use LogSoftmax+NLLLoss or CrossEntropyLoss directly)
nn.MaxPool2d	Takes max over kernel window	Spatial downsampling in CNNs
nn.LSTM/GRU	Gated recurrent cells	Sequence modelling

/div>

import torch, torch.nn as nn

# Linear layer internals
fc = nn.Linear(4, 8)
print(fc.weight.shape)   # (8, 4) — note: output × input
print(fc.bias.shape)     # (8,)

# Embedding
emb = nn.Embedding(num_embeddings=10000, embedding_dim=128,
                   padding_idx=0)   # index 0 gets a zero vector
tokens = torch.tensor([1, 42, 7])   # shape (3,)
out = emb(tokens)                   # shape (3, 128)

# BatchNorm vs LayerNorm
bn  = nn.BatchNorm1d(64)   # input (N, 64) — normalises across N
ln  = nn.LayerNorm(64)     # input (N, 64) — normalises across 64 features

x = torch.randn(16, 64)
print(bn(x).shape)   # (16, 64)
print(ln(x).shape)   # (16, 64)

# Dropout only active during training
drop = nn.Dropout(p=0.5)
model = nn.Sequential(nn.Linear(32,32), drop, nn.ReLU())
model.train();  x_tr = model(torch.randn(4,32))  # 50% zeros
model.eval();   x_ev = model(torch.randn(4,32))  # all active

Take quiz

What are the weight dimensions of nn.Linear(in_features=4, out_features=8)?(4, 8)

✗ Try again.

(8, 4)

✓ Correct! Well done.

(4,)

✗ Try again.

(8,)

✗ Try again.

What is the key difference between BatchNorm and LayerNorm?BatchNorm is faster; LayerNorm is more accurate

✗ Try again.

BatchNorm normalises across the batch dimension per feature; LayerNorm normalises across the feature dimension per sample — LayerNorm works with batch size 1 and is standard in Transformers

✓ Correct! Well done.

BatchNorm works on images; LayerNorm works on text

✗ Try again.

They are identical but have different API signatures

✗ Try again.

17. How do you initialise weights in a PyTorch model?

PyTorch uses sensible default initialisations (Kaiming uniform for Linear and Conv layers), but custom initialisation is often needed to match a paper or improve convergence. The torch.nn.init module provides all standard schemes.

import torch, torch.nn as nn

# Default initialisation:
# nn.Linear  → Kaiming uniform (He init) for weight, uniform for bias
# nn.Conv2d  → Kaiming uniform
# nn.Embedding → Normal(0, 1)

# Custom initialisation using apply()
def init_weights(module):
    if isinstance(module, nn.Linear):
        nn.init.xavier_uniform_(module.weight)   # Xavier/Glorot
        nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Conv2d):
        nn.init.kaiming_normal_(module.weight,
                                mode="fan_out",
                                nonlinearity="relu")  # He init
        if module.bias is not None:
            nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Embedding):
        nn.init.normal_(module.weight, mean=0.0, std=0.02)  # GPT-style

model = nn.Sequential(
    nn.Linear(128, 64), nn.ReLU(),
    nn.Linear(64,  10)
)
model.apply(init_weights)   # recursively applies to all sub-modules

# Direct initialisation with torch.no_grad()
with torch.no_grad():
    model[0].weight.fill_(0.01)
    model[0].bias.zero_()

Common init schemes
Scheme	API	Best for
Xavier / Glorot uniform	nn.init.xavier_uniform_()	Sigmoid / Tanh activations
Xavier / Glorot normal	nn.init.xavier_normal_()	Sigmoid / Tanh activations
Kaiming / He uniform	nn.init.kaiming_uniform_()	ReLU (PyTorch default)
Kaiming / He normal	nn.init.kaiming_normal_()	ReLU (often better than uniform)
Normal	nn.init.normal_(mean, std)	Embeddings (std=0.02 GPT-style)
Zeros / Ones	nn.init.zeros_() / ones_()	Biases, gates
Orthogonal	nn.init.orthogonal_()	RNNs

/div>

Take quiz

What initialisation scheme does PyTorch use by default for nn.Linear weights?Xavier uniform

✗ Try again.

Standard normal (mean=0, std=1)

✗ Try again.

Kaiming uniform (He initialisation)

✓ Correct! Well done.

All zeros

✗ Try again.

Why is Kaiming (He) initialisation preferred over Xavier when using ReLU activations?Kaiming init produces smaller initial weights

✗ Try again.

He initialisation accounts for ReLU zeroing half its inputs on average — it uses a larger scale factor (√(2/fan_in)) to maintain activation variance across layers; Xavier assumes symmetric activations and underestimates the needed scale

✓ Correct! Well done.

Kaiming init converges faster regardless of activation function

✗ Try again.

Xavier init is not supported for convolutional layers

✗ Try again.

18. What loss functions does PyTorch provide and when do you use each?

Loss functions (criteria) measure the difference between predictions and targets. PyTorch provides them in torch.nn. Choosing the right one for your task is critical — using the wrong loss gives poor training signal even if the architecture is correct.

Common PyTorch loss functions
Loss	Class	Task	Target dtype
Cross-entropy	nn.CrossEntropyLoss	Multi-class classification	Long (class indices)
Binary cross-entropy + logits	nn.BCEWithLogitsLoss	Binary / multi-label	Float
MSE	nn.MSELoss	Regression	Float
MAE / L1	nn.L1Loss	Robust regression	Float
Huber / Smooth L1	nn.HuberLoss / nn.SmoothL1Loss	Robust regression	Float
NLL Loss	nn.NLLLoss	After log-softmax	Long
KL Divergence	nn.KLDivLoss	Distribution matching	Float
Triplet Margin	nn.TripletMarginLoss	Metric learning	Float

/div>

import torch, torch.nn as nn

# Multi-class classification: CrossEntropyLoss
# Input: (N, C) logits — raw, before softmax
# Target: (N,) class indices — dtype=long
ce = nn.CrossEntropyLoss()
logits  = torch.randn(4, 10)              # 4 samples, 10 classes
targets = torch.tensor([2, 5, 0, 9])      # true class indices
loss = ce(logits, targets)

# Binary classification: BCEWithLogitsLoss
# Numerically stable (fuses sigmoid + BCE)
bce = nn.BCEWithLogitsLoss()
preds = torch.randn(4)       # logits, NOT sigmoid output
true  = torch.tensor([1.,0.,1.,0.])
loss_b = bce(preds, true)

# Class weighting for imbalanced datasets
weights = torch.tensor([1.0]*9 + [10.0])   # class 9 is rare
ce_w = nn.CrossEntropyLoss(weight=weights)

# Label smoothing (reduces overconfidence)
ce_ls = nn.CrossEntropyLoss(label_smoothing=0.1)

# Regression: MSE vs Huber
mseLoss   = nn.MSELoss()
huberLoss = nn.HuberLoss(delta=1.0)   # L2 near 0, L1 for large errors
pred_r = torch.randn(4)
true_r = torch.randn(4)
print(mseLoss(pred_r, true_r))
print(huberLoss(pred_r, true_r))

Critical gotcha: nn.CrossEntropyLoss expects raw logits (before softmax), not probabilities. It internally applies log-softmax, so applying softmax first leads to double-softmax and incorrect training.

Take quiz

What dtype must the target tensor be for nn.CrossEntropyLoss?torch.float32

✗ Try again.

torch.float64

✗ Try again.

torch.int32

✗ Try again.

torch.int64 (long)

✓ Correct! Well done.

Why is nn.BCEWithLogitsLoss preferred over applying sigmoid then nn.BCELoss?BCEWithLogitsLoss is faster because it skips the sigmoid

✗ Try again.

BCEWithLogitsLoss fuses sigmoid and BCE into a single numerically stable computation using the log-sum-exp trick — applying sigmoid first then BCELoss can cause NaN from log(0) for saturated activations

✓ Correct! Well done.

BCELoss is deprecated in PyTorch

✗ Try again.

BCEWithLogitsLoss works with multi-class; BCELoss does not

✗ Try again.

19. What optimizers does PyTorch provide and how do you choose between them?

An optimizer updates model parameters based on computed gradients. PyTorch provides all major optimizers in torch.optim. Choosing the right optimizer and tuning its hyperparameters has a large impact on training speed and final performance.

Common PyTorch optimizers
Optimizer	Class	Key parameters	Best for
SGD	optim.SGD	lr, momentum, weight_decay, nesterov	Image classification (with momentum); can generalise better than Adam
SGD + Momentum	optim.SGD(momentum=0.9)	momentum=0.9 standard	Most vision tasks
Adam	optim.Adam	lr=1e-3, betas=(0.9,0.999), eps=1e-8	Default choice; fast convergence
AdamW	optim.AdamW	lr, weight_decay (decoupled)	Fine-tuning transformers; correct L2
RMSprop	optim.RMSprop	lr, alpha=0.99	RNNs
Adagrad	optim.Adagrad	lr	Sparse features, NLP

/div>

import torch, torch.nn as nn, torch.optim as optim

model = nn.Linear(10, 1)

# SGD with momentum (common for vision)
sgd = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,
    weight_decay=1e-4,   # L2 regularisation
    nesterov=True,
)

# Adam (default for most tasks)
adam = optim.Adam(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0,      # NOTE: weight decay in Adam is coupled (bug!)
)

# AdamW — decoupled weight decay (correct implementation)
adamw = optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=0.01,   # decoupled from gradient update
)

# Per-layer learning rates (useful for fine-tuning)
optimizer = optim.AdamW([
    {"params": model.weight, "lr": 1e-4},   # lower lr for pretrained
    {"params": model.bias,   "lr": 1e-3},   # higher lr for new head
], weight_decay=0.01)

# Standard training step
optimizer.zero_grad()
loss = nn.MSELoss()(model(torch.randn(8,10)), torch.randn(8,1))
loss.backward()
optimizer.step()

Adam vs AdamW: In standard Adam, adding weight_decay couples the regularisation with the adaptive learning rate, weakening its effect. AdamW fixes this by applying weight decay directly to the parameters, separate from the gradient update — this is the correct L2 regularisation and is now the standard for transformer fine-tuning.

Take quiz

What is the key difference between Adam and AdamW?AdamW is faster than Adam

✗ Try again.

AdamW decouples weight decay from the gradient update — in Adam, weight decay is incorrectly scaled by the adaptive learning rate; AdamW applies it directly to parameters

✓ Correct! Well done.

Adam uses a different momentum formula

✗ Try again.

AdamW is only available for PyTorch models, not custom modules

✗ Try again.

When might SGD with momentum outperform Adam for a vision model?SGD always outperforms Adam

✗ Try again.

Never — Adam always converges to a better solution

✗ Try again.

SGD with careful tuning can find flatter minima that generalise better on i.i.d. image datasets — several papers show SGD beats Adam on CIFAR and ImageNet despite Adam's faster early convergence

✓ Correct! Well done.

SGD with momentum is only useful when training on CPU

✗ Try again.

20. What are learning rate schedulers in PyTorch and how do you use them?

A learning rate (LR) scheduler adjusts the learning rate during training. Starting with a high LR enables fast early progress; decaying it later allows finer convergence. PyTorch provides many schedulers in torch.optim.lr_scheduler.

Common LR schedulers
Scheduler	Behaviour	Use case
StepLR	Multiply lr by gamma every step_size epochs	Simple decay; quick experiments
MultiStepLR	Decay at specific epoch milestones	ResNet training schedules
CosineAnnealingLR	Cosine curve from lr to eta_min	Most modern training runs
OneCycleLR	Warmup to max_lr then cosine decay	Super-convergence; fast training
ReduceLROnPlateau	Reduce lr when metric stops improving	When training time is unknown
LinearLR	Linear warm-up	Transformer fine-tuning
CosineAnnealingWarmRestarts	Cosine + periodic restarts (SGDR)	Ensemble-style training

/div>

import torch, torch.nn as nn, torch.optim as optim
from torch.optim import lr_scheduler

model = nn.Linear(10, 1)
optimizer = optim.AdamW(model.parameters(), lr=1e-3)

# CosineAnnealingLR — most popular modern choice
scheduler_cos = lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=100, eta_min=1e-6
)

# OneCycleLR — great for fast training
scheduler_1c = lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=1e-2,
    steps_per_epoch=100,   # batches per epoch
    epochs=10,
)

# ReduceLROnPlateau — metric-driven
scheduler_plat = lr_scheduler.ReduceLROnPlateau(
    optimizer, mode="min", factor=0.5, patience=5, verbose=True
)

# Standard usage in training loop
for epoch in range(100):
    train_one_epoch(model, optimizer)   # forward + backward + step

    # --- Epoch-based schedulers ---
    scheduler_cos.step()    # call AFTER optimizer.step()

    # --- Metric-based scheduler ---
    val_loss = validate(model)
    scheduler_plat.step(val_loss)

    # --- OneCycleLR is per-batch ---
    # for batch in dataloader:
    #     optimizer.step()
    #     scheduler_1c.step()

    print(f"lr: {optimizer.param_groups[0]['lr']:.6f}")

Key rule: call scheduler.step() after optimizer.step(). For OneCycleLR and other per-batch schedulers, call scheduler.step() inside the batch loop, not the epoch loop.

Take quiz

When should scheduler.step() be called relative to optimizer.step()?Before optimizer.step()

✗ Try again.

After optimizer.step()

✓ Correct! Well done.

Once at the start of training

✗ Try again.

Only when validation loss plateaus

✗ Try again.

Which scheduler is well-suited when you don't know how many epochs you'll train for?StepLR

✗ Try again.

CosineAnnealingLR

✗ Try again.

ReduceLROnPlateau — it reduces lr automatically when a metric stops improving, adapting to actual training dynamics

✓ Correct! Well done.

OneCycleLR

✗ Try again.

21. What activation functions are commonly used in PyTorch and how do you choose between them?

Activation functions introduce non-linearity, allowing networks to model complex functions. PyTorch provides them as both nn.Module classes (for use in nn.Sequential) and functional calls in torch.nn.functional.

Common PyTorch activations
Activation	nn class	Range	Typical use
ReLU	nn.ReLU()	[0, ∞)	Default for hidden layers — fast, avoids vanishing gradient for x>0
LeakyReLU	nn.LeakyReLU(0.01)	(-∞, ∞)	Fixes ReLU's dying neuron problem
Sigmoid	nn.Sigmoid()	(0, 1)	Binary classification output layer
Tanh	nn.Tanh()	(-1, 1)	RNN hidden states (zero-centred)
Softmax	nn.Softmax(dim=-1)	(0,1), sums to 1	Multi-class output (use with NLLLoss, not CrossEntropyLoss)
GELU	nn.GELU()	(-∞, ∞)	Transformers (BERT, GPT)

/div>

import torch
import torch.nn as nn
import torch.nn.functional as F

x = torch.tensor([-2.0, -0.5, 0.0, 0.5, 2.0])

# Module form — for use inside nn.Sequential / __init__
relu = nn.ReLU()
print(relu(x))   # tensor([0.0, 0.0, 0.0, 0.5, 2.0])

# Functional form — for use directly inside forward()
print(F.relu(x))
print(F.leaky_relu(x, negative_slope=0.01))
print(F.gelu(x))

# Using inside a model
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 20)
        self.fc2 = nn.Linear(20, 1)

    def forward(self, x):
        x = F.relu(self.fc1(x))   # functional — common in forward()
        return torch.sigmoid(self.fc2(x))  # binary output

# IMPORTANT: never apply softmax before CrossEntropyLoss
# CrossEntropyLoss = LogSoftmax + NLLLoss internally
logits = torch.randn(4, 10)             # raw scores, NOT softmaxed
loss_fn = nn.CrossEntropyLoss()
targets = torch.randint(0, 10, (4,))
loss = loss_fn(logits, targets)         # correct — pass raw logits!

Common mistake: applying Softmax before CrossEntropyLoss — the loss function already applies LogSoftmax internally, so double-softmaxing produces incorrect gradients and degraded training.

Take quiz

Why should you NOT apply nn.Softmax before passing logits to nn.CrossEntropyLoss?Softmax is too computationally expensive

✗ Try again.

CrossEntropyLoss internally applies LogSoftmax + NLLLoss — applying Softmax beforehand double-applies the transformation, producing incorrect gradients

✓ Correct! Well done.

Softmax only works with binary classification

✗ Try again.

CrossEntropyLoss requires integer inputs, not probabilities

✗ Try again.

Which activation function commonly used in transformer models like BERT and GPT is smoother than ReLU?Sigmoid

✗ Try again.

Tanh

✗ Try again.

GELU

✓ Correct! Well done.

Softmax

✗ Try again.

22. What loss functions does PyTorch provide and how do you choose the right one?

The loss function defines the training objective. PyTorch's torch.nn module provides loss classes for classification, regression, and more specialised tasks. Choosing the wrong loss for your task is one of the most common beginner mistakes.

Common PyTorch loss functions
Loss	Class	Input shape	Use case
MSELoss	nn.MSELoss()	pred & target same shape	Regression
L1Loss	nn.L1Loss()	pred & target same shape	Regression, robust to outliers
CrossEntropyLoss	nn.CrossEntropyLoss()	logits (N,C), target (N,) int64	Multi-class classification
BCELoss	nn.BCELoss()	probabilities (N,), target (N,) float	Binary classification (after sigmoid)
BCEWithLogitsLoss	nn.BCEWithLogitsLoss()	raw logits (N,), target (N,) float	Binary classification (numerically stable)
NLLLoss	nn.NLLLoss()	log-probabilities (N,C)	Used after LogSoftmax manually

/div>

import torch
import torch.nn as nn

# ── Regression: MSE
mse = nn.MSELoss()
pred = torch.tensor([2.5, 3.0, 4.1])
target = torch.tensor([3.0, 3.0, 4.0])
loss = mse(pred, target)   # mean((pred-target)^2)

# ── Multi-class classification: CrossEntropyLoss
ce = nn.CrossEntropyLoss()
logits = torch.randn(8, 5)         # batch=8, 5 classes — RAW logits
targets = torch.randint(0, 5, (8,))  # class indices, dtype long
loss = ce(logits, targets)

# ── Binary classification: BCEWithLogitsLoss (preferred over BCELoss)
bce = nn.BCEWithLogitsLoss()       # combines Sigmoid + BCE, numerically stable
logits_binary = torch.randn(8, 1)
targets_binary = torch.randint(0, 2, (8, 1)).float()
loss = bce(logits_binary, targets_binary)

# ── Class-weighted CrossEntropy for imbalanced data
class_weights = torch.tensor([1.0, 1.0, 5.0, 1.0, 1.0])  # upweight class 2
ce_weighted = nn.CrossEntropyLoss(weight=class_weights)

# ── Custom loss function
class FocalLoss(nn.Module):
    def __init__(self, gamma=2.0):
        super().__init__()
        self.gamma = gamma
        self.ce = nn.CrossEntropyLoss(reduction="none")

    def forward(self, logits, targets):
        ce_loss = self.ce(logits, targets)
        pt = torch.exp(-ce_loss)
        focal = ((1 - pt) ** self.gamma * ce_loss).mean()
        return focal

Take quiz

Why is nn.BCEWithLogitsLoss preferred over manually applying nn.Sigmoid() followed by nn.BCELoss()?BCEWithLogitsLoss is faster because it skips the sigmoid computation entirely

✗ Try again.

BCEWithLogitsLoss combines sigmoid and BCE in a numerically stable way using the log-sum-exp trick, avoiding overflow/underflow that can occur with separate sigmoid + BCE

✓ Correct! Well done.

BCELoss does not support batched inputs

✗ Try again.

BCEWithLogitsLoss is required for multi-class problems

✗ Try again.

What dtype and shape does CrossEntropyLoss expect for the target tensor in a 10-class classification problem with batch size 16?Shape (16, 10), dtype float32 — one-hot encoded

✗ Try again.

Shape (16,), dtype int64 — class indices

✓ Correct! Well done.

Shape (16, 1), dtype float32

✗ Try again.

Shape (10,), dtype int64

✗ Try again.

23. What optimizers does PyTorch provide and what is the difference between SGD, Adam, and AdamW?

Optimizers update model parameters based on computed gradients. PyTorch's torch.optim module provides many algorithms; understanding their differences helps you choose the right one and tune hyperparameters effectively.

Common PyTorch optimizers
Optimizer	Key idea	Typical lr	Best for
SGD	Plain gradient descent, optional momentum	0.01–0.1	Image classification (with momentum + schedule)
SGD + momentum	Accumulates velocity to smooth updates	0.01–0.1	Often best final generalisation
Adam	Adaptive per-parameter learning rates + momentum	1e-3	Fast convergence, good default
AdamW	Adam with decoupled weight decay	1e-3 to 5e-5	Fine-tuning transformers, modern default
RMSprop	Adaptive lr based on recent gradient magnitude	1e-3	RNNs (historically popular)

/div>

import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Linear(10, 1)

# ── SGD with momentum
opt_sgd = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,        # accelerates in consistent gradient directions
    weight_decay=1e-4,   # L2 regularisation
)

# ── Adam — adaptive learning rate per parameter
opt_adam = optim.Adam(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),  # momentum decay rates
    eps=1e-8,
)

# ── AdamW — decoupled weight decay (recommended for fine-tuning)
opt_adamw = optim.AdamW(
    model.parameters(),
    lr=2e-5,              # typical for fine-tuning pretrained models
    weight_decay=0.01,
)

# ── Standard training step
x, y = torch.randn(16, 10), torch.randn(16, 1)
loss_fn = nn.MSELoss()

opt_adamw.zero_grad()       # 1. clear old gradients
pred = model(x)              # 2. forward pass
loss = loss_fn(pred, y)      # 3. compute loss
loss.backward()              # 4. backpropagate
opt_adamw.step()             # 5. update parameters

# ── Different learning rates per parameter group
optimizer = optim.AdamW([
    {"params": model.weight, "lr": 1e-3},
    {"params": model.bias,   "lr": 1e-4},
])

Take quiz

What is the key difference between Adam and AdamW?AdamW uses a different momentum formula than Adam

✗ Try again.

AdamW decouples weight decay from the gradient-based adaptive update — Adam applies L2 regularisation through the gradient, which interacts poorly with its adaptive learning rates; AdamW applies decay directly to the weights

✓ Correct! Well done.

AdamW is always faster to converge than Adam

✗ Try again.

AdamW does not use momentum, only Adam does

✗ Try again.

What is the correct order of operations in a standard PyTorch training step?forward → backward → zero_grad → step

✗ Try again.

zero_grad → forward → loss.backward() → optimizer.step()

✓ Correct! Well done.

optimizer.step() → zero_grad → forward → backward

✗ Try again.

backward → zero_grad → forward → step

✗ Try again.

24. What is the standard PyTorch training loop and what does each step do?

The PyTorch training loop follows a fixed five-step pattern repeated for every batch. Understanding exactly what each line does — and what happens if you skip or reorder a step — is essential for debugging training issues.

import torch
import torch.nn as nn
import torch.optim as optim

model     = nn.Sequential(nn.Linear(20, 64), nn.ReLU(), nn.Linear(64, 2))
optimizer = optim.AdamW(model.parameters(), lr=1e-3)
loss_fn   = nn.CrossEntropyLoss()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def train_one_epoch(model, loader, optimizer, loss_fn, device):
    model.train()                       # 0. enables Dropout, BatchNorm train mode
    total_loss = 0.0

    for X_batch, y_batch in loader:
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)

        optimizer.zero_grad()           # 1. clear gradients from previous step
        logits = model(X_batch)         # 2. forward pass
        loss   = loss_fn(logits, y_batch)  # 3. compute loss
        loss.backward()                 # 4. backpropagate — fills .grad
        optimizer.step()                # 5. update weights using gradients

        total_loss += loss.item() * X_batch.size(0)

    return total_loss / len(loader.dataset)

@torch.no_grad()                       # disable gradient tracking for eval
def validate(model, loader, loss_fn, device):
    model.eval()                        # disables Dropout, BatchNorm uses running stats
    total_loss, correct = 0.0, 0

    for X_batch, y_batch in loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        logits = model(X_batch)
        loss   = loss_fn(logits, y_batch)
        total_loss += loss.item() * X_batch.size(0)
        correct    += (logits.argmax(1) == y_batch).sum().item()

    return total_loss / len(loader.dataset), correct / len(loader.dataset)

# Full training loop
for epoch in range(10):
    train_loss        = train_one_epoch(model, train_loader, optimizer, loss_fn, device)
    val_loss, val_acc  = validate(model, val_loader, loss_fn, device)
    print(f"Epoch {epoch}: train_loss={train_loss:.4f} val_loss={val_loss:.4f} val_acc={val_acc:.4f}")

Training loop steps
Step	Call	Purpose
0	model.train()	Enable Dropout, set BatchNorm to use batch statistics
1	optimizer.zero_grad()	Clear accumulated gradients from the previous step
2	model(x)	Forward pass — compute predictions
3	loss_fn(pred, target)	Compute scalar loss
4	loss.backward()	Backpropagate — populate .grad on each parameter
5	optimizer.step()	Update parameters using gradients and the optimizer's rule

/div>

Take quiz

What would happen if you forgot to call optimizer.zero_grad() before loss.backward() in every training iteration?Training would fail with a RuntimeError

✗ Try again.

Gradients from previous batches would accumulate with the current batch's gradients, corrupting the parameter updates

✓ Correct! Well done.

The model would train faster since no memory is wasted clearing gradients

✗ Try again.

Nothing — zero_grad() is purely optional for correctness

✗ Try again.

What is the difference between model.train() and model.eval()?train() enables gradient computation; eval() disables it entirely

✗ Try again.

train() puts Dropout and BatchNorm into training behaviour (dropout active, batch statistics used); eval() switches them to inference behaviour (dropout off, running statistics used) — neither affects gradient tracking directly

✓ Correct! Well done.

eval() is required to use the GPU; train() runs only on CPU

✗ Try again.

model.eval() deletes the optimizer's state

✗ Try again.

25. What are Dataset and DataLoader in PyTorch and how do they work together?

PyTorch's data pipeline follows a clean two-class design: Dataset defines how to access a single sample (index → data), and DataLoader wraps a Dataset to handle batching, shuffling, and parallel loading.

import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np

class TabularDataset(Dataset):
    def __init__(self, X: np.ndarray, y: np.ndarray):
        # Convert once at construction — not inside __getitem__!
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.long)

    def __len__(self) -> int:
        """Required — tells DataLoader how many samples exist."""
        return len(self.X)

    def __getitem__(self, idx: int):
        """Required — return a single (features, label) sample."""
        return self.X[idx], self.y[idx]

# Synthetic data
X = np.random.randn(1000, 20).astype(np.float32)
y = np.random.randint(0, 3, size=1000)

dataset = TabularDataset(X, y)
print(len(dataset))          # 1000
print(dataset[0])            # (tensor of 20 features, tensor scalar label)

loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,            # shuffle each epoch — essential for training
    num_workers=4,           # parallel data loading subprocesses
    pin_memory=True,         # faster CPU→GPU transfer
    drop_last=True,          # drop incomplete final batch
)

# Iterate over batches
for X_batch, y_batch in loader:
    print(X_batch.shape, y_batch.shape)  # (32, 20) (32,)
    break

# torchvision pre-built datasets
from torchvision import datasets, transforms
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,)),
])
mnist = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
mnist_loader = DataLoader(mnist, batch_size=64, shuffle=True)

Dataset and DataLoader responsibilities
Component	Responsibility	Required methods
Dataset	Defines how to access ONE sample by index	__len__, __getitem__
DataLoader	Batches samples, shuffles, parallelises loading	Wraps any Dataset object

/div>

Take quiz

What two methods must a custom PyTorch Dataset class implement?__init__ and __call__

✗ Try again.

__len__ (number of samples) and __getitem__ (return one sample by index)

✓ Correct! Well done.

__iter__ and __next__

✗ Try again.

load() and fetch()

✗ Try again.

Why is shuffle=True important when creating a DataLoader for training (but typically False for validation)?Shuffling speeds up data loading

✗ Try again.

Without shuffling, the model sees data in the same fixed order every epoch, which can cause it to learn spurious patterns related to data ordering rather than the underlying signal — shuffling each epoch prevents this; validation order doesn't affect learning so it's left False for reproducibility

✓ Correct! Well done.

Shuffling reduces GPU memory usage

✗ Try again.

DataLoader requires shuffle=True to support batching

✗ Try again.

26. How do you move tensors and models between CPU and GPU in PyTorch?

PyTorch's device abstraction allows the same code to run on CPU or GPU with minimal changes. The fundamental rule: a model and its input tensors must reside on the same device before any computation, or PyTorch raises a RuntimeError.

import torch
import torch.nn as nn

# Device-agnostic pattern — always write code this way
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Move a model to the device
model = nn.Linear(10, 1).to(device)

# Move data to the same device, every batch, inside the loop
for X_batch, y_batch in loader:
    X_batch = X_batch.to(device, non_blocking=True)
    y_batch = y_batch.to(device, non_blocking=True)
    pred = model(X_batch)   # works — both on same device

# WRONG — mismatched devices raises RuntimeError
# model_cpu = nn.Linear(10, 1)              # stays on CPU
# x_gpu = torch.randn(4, 10).to("cuda")
# model_cpu(x_gpu)  # RuntimeError: Expected all tensors on same device

# Checking tensor device
t = torch.randn(3)
print(t.device)            # cpu
t_gpu = t.cuda()           # or t.to("cuda:0")
print(t_gpu.device)        # cuda:0

# GPU memory diagnostics
if torch.cuda.is_available():
    print(torch.cuda.memory_allocated() / 1e9, "GB allocated")
    print(torch.cuda.max_memory_allocated() / 1e9, "GB peak")
    torch.cuda.empty_cache()   # release unused cached memory

# Moving a tensor back to CPU (required before .numpy())
result = t_gpu.cpu().numpy()   # numpy() requires a CPU tensor

# Apple Silicon (M1/M2/M3) GPU support
mps_device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

Device transfer methods
Method	Effect
tensor.to(device)	Moves to specified device — most flexible, recommended
tensor.cuda()	Shorthand for .to('cuda')
tensor.cpu()	Moves back to CPU (required before .numpy())
model.to(device)	Moves all model parameters and buffers
non_blocking=True	Allows async transfer when paired with pin_memory=True

/div>

Take quiz

What happens if you try to run a forward pass with a model on the GPU but input tensors still on the CPU?PyTorch automatically moves the input tensors to match the model

✗ Try again.

PyTorch raises a RuntimeError because operations require all tensors to be on the same device

✓ Correct! Well done.

The computation runs on the CPU instead, ignoring the GPU model

✗ Try again.

PyTorch silently moves the model back to the CPU

✗ Try again.

Why must you call .cpu() on a tensor before calling .numpy() on it?numpy() is only defined for CPU tensors — NumPy has no concept of GPU memory, so GPU tensors must be copied back to host memory first

✓ Correct! Well done.

cpu() converts the tensor's dtype to be NumPy-compatible

✗ Try again.

numpy() automatically calls cpu() internally, so this step is unnecessary

✗ Try again.

GPU tensors do not support indexing required by numpy()

✗ Try again.

27. What is the difference between model.parameters() and model.state_dict() in PyTorch?

Both expose a model's learnable values, but they serve different purposes. parameters() returns an iterator of nn.Parameter tensor objects (used by the optimizer); state_dict() returns an OrderedDict mapping layer names to tensors (used for saving/loading and inspection).

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(10, 20),
    nn.ReLU(),
    nn.Linear(20, 1),
)

# ── parameters(): iterator of Parameter tensors (no names)
for p in model.parameters():
    print(p.shape, p.requires_grad)
# torch.Size([20, 10]) True
# torch.Size([20])     True
# torch.Size([1, 20])  True
# torch.Size([1])      True

# Used to construct optimizers
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# ── named_parameters(): iterator of (name, Parameter) tuples
for name, p in model.named_parameters():
    print(name, p.shape)
# 0.weight torch.Size([20, 10])
# 0.bias   torch.Size([20])
# 2.weight torch.Size([1, 20])
# 2.bias   torch.Size([1])

# ── state_dict(): OrderedDict for save/load
sd = model.state_dict()
print(type(sd))           # <class 'collections.OrderedDict'>
print(sd.keys())          # dict_keys(['0.weight', '0.bias', '2.weight', '2.bias'])

# Saving and loading via state_dict (the recommended pattern)
torch.save(model.state_dict(), "model_weights.pt")

new_model = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 1))
new_model.load_state_dict(torch.load("model_weights.pt"))
new_model.eval()           # ALWAYS call after loading for inference

# Total parameter count
total_params = sum(p.numel() for p in model.parameters())
trainable    = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total: {total_params}, Trainable: {trainable}")

Take quiz

What is the primary use of model.state_dict() compared to model.parameters()?state_dict() is used to construct the optimizer; parameters() is used for saving the model

✗ Try again.

state_dict() returns a named, serialisable OrderedDict used for saving/loading model weights; parameters() returns an unnamed iterator used to construct the optimizer

✓ Correct! Well done.

They are functionally identical — state_dict() is just a renamed alias

✗ Try again.

state_dict() only includes biases; parameters() includes weights and biases

✗ Try again.

What should you always call after loading weights into a model with load_state_dict(), before running inference?model.train()

✗ Try again.

model.eval() — to set Dropout and BatchNorm to inference mode

✓ Correct! Well done.

model.zero_grad()

✗ Try again.

model.reset_parameters()

✗ Try again.

28. How do you save and load PyTorch models correctly, including full training checkpoints?

PyTorch supports saving either the full model object or just its weights (state_dict). Saving only the state_dict is the recommended approach because it decouples weights from the Python class definition. A full training checkpoint includes the optimizer state too, so training can resume exactly where it left off.

import torch
import torch.nn as nn

model     = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# ── RECOMMENDED: save/load state_dict only
torch.save(model.state_dict(), "weights.pt")

model_new = nn.Linear(10, 5)        # must define the SAME architecture first
model_new.load_state_dict(torch.load("weights.pt"))
model_new.eval()                     # always call before inference

# ── NOT recommended: save the entire model object
# Fragile — breaks if the class definition moves or changes
torch.save(model, "full_model.pt")
loaded_model = torch.load("full_model.pt", weights_only=False)

# ── Full training checkpoint — for resuming training
def save_checkpoint(path, epoch, model, optimizer, best_val_loss):
    torch.save({
        "epoch":            epoch,
        "model_state":      model.state_dict(),
        "optimizer_state":  optimizer.state_dict(),  # Adam momentum buffers etc.
        "best_val_loss":    best_val_loss,
    }, path)

def load_checkpoint(path, model, optimizer):
    ckpt = torch.load(path, map_location="cpu")  # always load to CPU first
    model.load_state_dict(ckpt["model_state"])
    optimizer.load_state_dict(ckpt["optimizer_state"])
    return ckpt["epoch"], ckpt["best_val_loss"]

save_checkpoint("ckpt.pt", epoch=5, model=model, optimizer=optimizer, best_val_loss=0.42)
epoch, best_loss = load_checkpoint("ckpt.pt", model_new, optimizer)

# ── Loading on a different device than it was saved
model.load_state_dict(
    torch.load("weights.pt", map_location="cpu")  # avoid GPU OOM if GPU unavailable
)
model = model.to("cuda")  # then move to the desired device

Take quiz

Why is saving model.state_dict() preferred over saving the entire model object with torch.save(model)?state_dict files are always smaller in size

✗ Try again.

Saving the full model object pickles the Python class definition too — if the class is refactored or moved, loading later can fail; saving only the state_dict decouples weights from code structure

✓ Correct! Well done.

torch.save() cannot serialise nn.Module objects directly

✗ Try again.

Full model objects cannot be loaded onto a different device

✗ Try again.

Why does a full training checkpoint include the optimizer's state_dict, not just the model's?The optimizer state contains the training data used in the last batch

✗ Try again.

Optimizers like Adam maintain per-parameter momentum and adaptive learning rate buffers — restoring these allows training to resume with the same convergence dynamics, rather than restarting the adaptive estimates from scratch

✓ Correct! Well done.

PyTorch requires the optimizer state to correctly load the model's state_dict

✗ Try again.

The optimizer state is needed to compute validation accuracy

✗ Try again.

29. What is overfitting and what regularization techniques does PyTorch support to address it?

Overfitting occurs when a model memorises the training data instead of learning generalisable patterns — visible as low training loss but high validation loss. PyTorch provides several built-in tools to combat overfitting.

PyTorch regularization techniques
Technique	How to apply	Effect
Dropout	nn.Dropout(p=0.5) layer	Randomly zeroes activations during training, preventing co-adaptation
Weight decay (L2)	optimizer weight_decay= parameter	Penalises large weights, encourages simpler models
Early stopping	Manual: track val_loss, stop when it plateaus	Prevents training past the point of generalisation
Data augmentation	torchvision.transforms	Increases effective dataset size and diversity
Batch Normalization	nn.BatchNorm1d/2d	Stabilises training; has a mild regularising side effect
Label smoothing	CrossEntropyLoss(label_smoothing=0.1)	Prevents overconfident predictions

/div>

import torch
import torch.nn as nn

class RegularizedNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1  = nn.Linear(784, 256)
        self.bn1  = nn.BatchNorm1d(256)
        self.drop = nn.Dropout(p=0.5)        # 50% dropout
        self.fc2  = nn.Linear(256, 10)

    def forward(self, x):
        x = torch.relu(self.bn1(self.fc1(x)))
        x = self.drop(x)                      # active in train(), off in eval()
        return self.fc2(x)

model = RegularizedNet()

# Weight decay — L2 penalty added by the optimizer
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=1e-2,    # penalise large weights
)

# Label smoothing — softens hard one-hot targets
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

# Early stopping pattern
best_val_loss = float("inf")
patience, patience_counter = 5, 0

for epoch in range(100):
    train_loss = train_one_epoch(model, train_loader, optimizer, criterion, device)
    val_loss, _ = validate(model, val_loader, criterion, device)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        torch.save(model.state_dict(), "best_model.pt")  # save best checkpoint
    else:
        patience_counter += 1

    if patience_counter >= patience:
        print(f"Early stopping at epoch {epoch}")
        break

Take quiz

How does Dropout help prevent overfitting?It removes the weakest neurons permanently from the network

✗ Try again.

It randomly zeroes a fraction of activations during training, preventing neurons from co-adapting and forcing the network to learn redundant, more robust representations

✓ Correct! Well done.

It reduces the learning rate during training

✗ Try again.

It adds noise directly to the input data only

✗ Try again.

What is the purpose of weight_decay in an optimizer like AdamW?It decays the learning rate over time

✗ Try again.

It adds an L2 penalty proportional to the weight magnitudes, discouraging the model from relying on very large weights and encouraging simpler, more generalisable solutions

✓ Correct! Well done.

It removes weights below a certain threshold

✗ Try again.

It controls how quickly the optimizer's momentum buffer decays

✗ Try again.

30. What is the vanishing/exploding gradient problem and how do you detect and fix it in PyTorch?

During backpropagation, gradients are computed via repeated multiplication through the chain rule. In deep networks, this can cause gradients to shrink toward zero (vanishing) or grow toward infinity (exploding) as they propagate backward through many layers, preventing effective training.

import torch
import torch.nn as nn

model     = nn.LSTM(input_size=10, hidden_size=128, num_layers=3, batch_first=True)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

x = torch.randn(32, 20, 10)
output, _ = model(x)
loss = output.sum()

optimizer.zero_grad()
loss.backward()

# ── Detect: monitor gradient norms
total_norm = 0.0
for p in model.parameters():
    if p.grad is not None:
        total_norm += p.grad.data.norm(2).item() ** 2
total_norm = total_norm ** 0.5
print(f"Gradient norm: {total_norm:.4f}")
# Very small (~1e-6) → vanishing; very large (~1e3+) → exploding

# ── Fix 1: Gradient clipping — caps the gradient norm before the step
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

# ── Fix 2: Better weight initialisation (He init for ReLU networks)
def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.kaiming_uniform_(m.weight, nonlinearity="relu")
        nn.init.zeros_(m.bias)

mlp = nn.Sequential(nn.Linear(64, 64), nn.ReLU(), nn.Linear(64, 64))
mlp.apply(init_weights)

# ── Fix 3: Batch Normalization — stabilises layer input distributions
class StableNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(64, 64), nn.BatchNorm1d(64), nn.ReLU(),
            nn.Linear(64, 64), nn.BatchNorm1d(64), nn.ReLU(),
        )
    def forward(self, x):
        return self.net(x)

# ── Fix 4: Residual / skip connections — gradient highway
class ResidualBlock(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.net = nn.Sequential(nn.Linear(dim, dim), nn.ReLU(), nn.Linear(dim, dim))
    def forward(self, x):
        return x + self.net(x)   # gradient flows through x directly

Take quiz

What does nn.utils.clip_grad_norm_() do and when should it be called?It removes parameters with zero gradient; called before forward()

✗ Try again.

It rescales gradients so their total norm does not exceed a threshold, preventing exploding gradients — called after loss.backward() and before optimizer.step()

✓ Correct! Well done.

It clips the loss value to prevent NaN; called before backward()

✗ Try again.

It limits the learning rate; called once at the start of training

✗ Try again.

How do residual (skip) connections help mitigate vanishing gradients in very deep networks?They reduce the total number of layers the gradient must pass through

✗ Try again.

They add a direct additive path from input to output of a block, so the gradient can flow through this identity shortcut largely unchanged, bypassing layers that might otherwise shrink it

✓ Correct! Well done.

They automatically increase the learning rate for deep layers

✗ Try again.

They replace backpropagation with a faster approximation

✗ Try again.

31. What is weight initialization in PyTorch and why does it matter?

How a network's weights are initialised at the start of training significantly affects whether training converges quickly, slowly, or not at all. PyTorch's default initialisation (Kaiming uniform for Linear/Conv layers) works well in most cases, but understanding the principles helps when debugging training issues.

import torch
import torch.nn as nn

# PyTorch default: Linear layers use Kaiming Uniform initialisation
layer = nn.Linear(256, 128)
print(layer.weight.std().item())   # approximately sqrt(2/256) ≈ 0.088

# Explicit initialisation methods
def init_weights(m):
    if isinstance(m, nn.Linear):
        # Xavier/Glorot — good for Tanh/Sigmoid activations
        nn.init.xavier_uniform_(m.weight)

        # He/Kaiming — good for ReLU-family activations (PyTorch default)
        # nn.init.kaiming_uniform_(m.weight, nonlinearity="relu")

        nn.init.zeros_(m.bias)

model = nn.Sequential(
    nn.Linear(784, 256), nn.ReLU(),
    nn.Linear(256, 128), nn.ReLU(),
    nn.Linear(128, 10),
)
model.apply(init_weights)   # applies init_weights to every sub-module

# Why initialisation matters: too small → vanishing activations
# too large → exploding activations, especially in deep nets
x = torch.randn(100, 784)
for layer in model:
    x = layer(x)
    if hasattr(layer, "weight"):
        print(f"{layer}: activation std={x.std().item():.4f}")
# With good init, std should stay roughly stable across layers

# Custom initialisation from scratch
with torch.no_grad():
    layer.weight.normal_(mean=0.0, std=0.02)  # common for transformer init
    layer.bias.zero__()

Initialization strategies
Method	Formula (roughly)	Best for
Xavier/Glorot	Var = 2/(fan_in+fan_out)	Tanh, Sigmoid activations
Kaiming/He (PyTorch default for Linear)	Var = 2/fan_in	ReLU, LeakyReLU activations
Zero init	All weights = 0	NEVER for weights — breaks symmetry; OK for biases
Small normal (std≈0.02)	N(0, 0.02²)	Transformer architectures (BERT, GPT)

/div>

Take quiz

Why should weights never be initialised to all zeros in a neural network?Zero weights cause the model to output zero always

✗ Try again.

All neurons in a layer would compute identical gradients and update identically — they would never differentiate from each other, defeating the purpose of having multiple neurons (the symmetry problem)

✓ Correct! Well done.

Zero initialisation causes a division-by-zero error in backpropagation

✗ Try again.

PyTorch does not allow zero-initialised weight tensors

✗ Try again.

Why does Kaiming/He initialisation use variance 2/fan_in specifically, tuned for ReLU?It is an arbitrary convention with no mathematical basis

✗ Try again.

ReLU zeroes out roughly half its inputs on average, halving the output variance — doubling the initial weight variance (using 2/fan_in instead of 1/fan_in) compensates for this so activations don't shrink layer by layer

✓ Correct! Well done.

It matches the number of training examples

✗ Try again.

It minimises the number of training epochs required

✗ Try again.

32. What is the difference between nn.Parameter and a regular tensor attribute in nn.Module?

nn.Parameter is a special tensor subclass that, when assigned as an attribute of an nn.Module, is automatically registered in the module's parameter list — meaning it appears in model.parameters(), gets moved by .to(device), and is saved in state_dict(). A plain tensor attribute does none of this.

import torch
import torch.nn as nn

class CustomLayer(nn.Module):
    def __init__(self, dim: int):
        super().__init__()

        # nn.Parameter — automatically registered, tracked, trained
        self.weight = nn.Parameter(torch.randn(dim, dim))
        self.bias   = nn.Parameter(torch.zeros(dim))

        # Plain tensor — NOT registered, NOT trained, invisible to optimizer
        self.scale  = torch.tensor(2.0)  # WRONG if meant to be learnable!

        # register_buffer — for non-trainable state that SHOULD move with
        # the model and be saved (e.g. BatchNorm running mean/var)
        self.register_buffer("running_mean", torch.zeros(dim))

    def forward(self, x):
        return x @ self.weight + self.bias

layer = CustomLayer(10)

# Check what appears in parameters()
for name, p in layer.named_parameters():
    print(name, p.shape)
# weight torch.Size([10, 10])
# bias   torch.Size([10])
# scale and running_mean do NOT appear here!

# Check state_dict — includes parameters AND buffers, but not plain tensors
print(layer.state_dict().keys())
# odict_keys(['weight', 'bias', 'running_mean'])

# .to(device) moves Parameters and registered buffers, but NOT plain tensor attrs
layer.to("cuda") if torch.cuda.is_available() else None
# layer.scale would STILL be on CPU — a common silent bug!

Attribute types in nn.Module
Attribute type	In parameters()?	In state_dict()?	Moved by .to(device)?	Trained by optimizer?
nn.Parameter	Yes	Yes	Yes	Yes
register_buffer tensor	No	Yes	Yes	No
Plain tensor attribute	No	No	No (silent bug risk!)	No

/div>

Take quiz

What happens if you assign a plain torch.Tensor (not nn.Parameter) as an attribute of an nn.Module meant to be learnable?PyTorch automatically converts it to nn.Parameter

✗ Try again.

The tensor is invisible to model.parameters() and the optimizer — it will never be updated during training, and .to(device) will not move it, causing potential device-mismatch bugs

✓ Correct! Well done.

Training raises an immediate error

✗ Try again.

The tensor is treated identically to nn.Parameter

✗ Try again.

When should you use register_buffer() instead of nn.Parameter?When the tensor needs gradients computed for it

✗ Try again.

For non-trainable state that should still be moved with .to(device) and saved in state_dict() — like BatchNorm's running mean and variance

✓ Correct! Well done.

Buffers and Parameters are interchangeable

✗ Try again.

Only when working with convolutional layers

✗ Try again.

33. How do you implement and use learning rate schedulers in PyTorch?

A fixed learning rate throughout training is rarely optimal — too high late in training prevents fine convergence, while too low early on wastes time. PyTorch's torch.optim.lr_scheduler module adjusts the learning rate systematically as training progresses.

import torch
import torch.nn as nn
import torch.optim as optim

model     = nn.Linear(10, 1)
optimizer = optim.AdamW(model.parameters(), lr=1e-3)

# ── StepLR: multiply lr by gamma every step_size epochs
scheduler_step = optim.lr_scheduler.StepLR(
    optimizer, step_size=10, gamma=0.1
)

# ── CosineAnnealingLR: smooth decay following a cosine curve
scheduler_cos = optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=100, eta_min=1e-6
)

# ── ReduceLROnPlateau: reduce lr when a metric stops improving
scheduler_plateau = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode="min", factor=0.5, patience=5
)

# ── OneCycleLR: warmup then decay — fast convergence ("super-convergence")
n_epochs, steps_per_epoch = 10, 100
scheduler_1cycle = optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=1e-2,
    total_steps=n_epochs * steps_per_epoch,
    pct_start=0.3,    # 30% of steps used for warmup
)

# ── Training loop with scheduler
for epoch in range(100):
    train_one_epoch(model, train_loader, optimizer, loss_fn, device)
    val_loss, _ = validate(model, val_loader, loss_fn, device)

    scheduler_cos.step()              # epoch-based scheduler — call once per epoch
    scheduler_plateau.step(val_loss)  # metric-based — pass the metric value

    current_lr = optimizer.param_groups[0]["lr"]
    print(f"Epoch {epoch}: lr={current_lr:.6f}")

# Note: OneCycleLR and some schedulers are called PER BATCH, not per epoch
# for step in range(total_steps):
#     train_step(...)
#     scheduler_1cycle.step()  # called inside the batch loop

Common LR schedulers
Scheduler	Behaviour	Call frequency
StepLR	Multiply lr by gamma every N epochs	Per epoch
CosineAnnealingLR	Smooth cosine decay	Per epoch
ReduceLROnPlateau	Reduce lr when validation metric plateaus	Per epoch, after computing metric
OneCycleLR	Warmup then decay in one cycle	Per batch/step
LinearLR / warmup schedules	Linear ramp from low to target lr	Per step, common for transformers

/div>

Take quiz

What is the key difference between ReduceLROnPlateau and most other PyTorch schedulers?It does not require an optimizer to be passed in

✗ Try again.

It requires a metric value (like validation loss) to be passed to .step(), and only reduces the learning rate when that metric stops improving — most other schedulers follow a fixed, predetermined schedule

✓ Correct! Well done.

It cannot be used with Adam-based optimizers

✗ Try again.

It is the only scheduler compatible with multi-GPU training

✗ Try again.

Why is OneCycleLR typically stepped once per batch rather than once per epoch?Stepping per batch is required by PyTorch's API for all schedulers

✗ Try again.

OneCycleLR's warmup and decay schedule is designed at the granularity of individual training steps, allowing finer control over the learning rate trajectory within and across epochs

✓ Correct! Well done.

Stepping per epoch would cause a RuntimeError

✗ Try again.

Per-batch stepping uses less GPU memory

✗ Try again.

34. How do you debug a PyTorch training loop where the loss is not decreasing or is NaN?

Diagnosing a stuck or diverging training loop is one of the most valuable practical PyTorch skills. The shape of the loss curve and a few targeted checks usually reveal the root cause.

Common training failure modes
Symptom	Likely cause	Fix
Loss is NaN from step 1	Exploding gradients, bad data (inf/NaN inputs), lr too high	Check input data, add gradient clipping, lower lr
Loss never decreases	Vanishing gradients, lr too low, forgot optimizer.step()	Check gradient norms, raise lr, verify training loop order
Loss decreases then plateaus high	Model too small, lr too high for fine convergence	Increase capacity, add lr scheduler
Train loss low, val loss high	Overfitting	Add dropout, weight decay, more data, early stopping
Loss oscillates wildly	lr too high, batch size too small	Lower lr, increase batch size, use lr warmup

/div>

import torch
import torch.nn as nn

model     = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for step, (X, y) in enumerate(loader):
    optimizer.zero_grad()
    logits = model(X)
    loss = criterion(logits, y)

    # ── Check 1: is the loss finite?
    if not torch.isfinite(loss):
        print(f"Step {step}: non-finite loss = {loss.item()}")
        print("Input contains NaN:", torch.isnan(X).any().item())
        print("Input contains Inf:", torch.isinf(X).any().item())
        break

    loss.backward()

    # ── Check 2: gradient norms — are gradients flowing at all?
    total_norm = sum(
        p.grad.norm().item() ** 2 for p in model.parameters() if p.grad is not None
    ) ** 0.5
    if step % 50 == 0:
        print(f"Step {step}: loss={loss.item():.4f} grad_norm={total_norm:.4f}")

    # ── Check 3: are any gradients None? (means that param was unused!)
    for name, p in model.named_parameters():
        if p.grad is None:
            print(f"WARNING: {name} has no gradient — is it used in forward()?")

    optimizer.step()

# ── Check 4: verify model output shape and range make sense
with torch.no_grad():
    sample_out = model(X[:1])
    print("Output range:", sample_out.min().item(), sample_out.max().item())

# ── Check 5: overfit a tiny batch — sanity check the architecture
# If the model cannot drive loss near zero on 5 examples, there is a bug
tiny_X, tiny_y = X[:5], y[:5]
for _ in range(200):
    optimizer.zero_grad()
    loss = criterion(model(tiny_X), tiny_y)
    loss.backward()
    optimizer.step()
print(f"Tiny-batch overfit loss: {loss.item():.6f}")  # should approach 0

Take quiz

If a training loss is NaN starting from the very first step, what should you check first?Whether the model has enough layers

✗ Try again.

Whether the input data itself contains NaN or Inf values, and whether the learning rate is too high — both are the most common immediate causes of NaN loss

✓ Correct! Well done.

Whether the batch size is too large

✗ Try again.

Whether the validation set is correctly split

✗ Try again.

What does the 'overfit a tiny batch' sanity check (training on 5 examples until loss ≈ 0) verify?That the model will generalise well to the full dataset

✗ Try again.

That the model architecture, loss function, and training loop are wired correctly — if a model cannot memorise even 5 examples, there's a bug somewhere in the implementation, not the data or hyperparameters

✓ Correct! Well done.

That the learning rate is optimal for the full dataset

✗ Try again.

That the model is not overfitting

✗ Try again.

35. What is the difference between torch.tensor() and torch.Tensor() (capital T) for creating tensors?

This is a subtle but important PyTorch gotcha. torch.tensor() (lowercase, a function) infers dtype from the input data and copies it — the recommended way to create tensors from data. torch.Tensor() (uppercase, a class constructor) is an alias for torch.FloatTensor and behaves inconsistently depending on the argument type.

import torch

# ── torch.tensor() — RECOMMENDED, infers dtype, copies data
a = torch.tensor([1, 2, 3])
print(a.dtype)   # torch.int64 — inferred from Python ints

b = torch.tensor([1.0, 2.0, 3.0])
print(b.dtype)   # torch.float32 — inferred from Python floats

c = torch.tensor([1, 2, 3], dtype=torch.float32)  # explicit override
print(c.dtype)   # torch.float32

# ── torch.Tensor() — confusing, AVOID for creating tensors from data
d = torch.Tensor([1, 2, 3])
print(d.dtype)   # torch.float32 — ALWAYS float32, ignores int input!

e = torch.Tensor(3, 4)   # interprets ints as a SHAPE, not data!
print(e.shape)    # torch.Size([3, 4]) — uninitialised memory, random values

# Common gotcha: these look similar but behave VERY differently
f1 = torch.tensor(3)      # scalar tensor with value 3
f2 = torch.Tensor(3)      # tensor of SHAPE (3,) with garbage/uninitialised values!
print(f1)   # tensor(3)
print(f2)   # tensor([4.6e-41, 0.0, 1.4e-45])  — random uninitialised memory!

# Recommended explicit constructors for empty/typed tensors:
g = torch.empty(3, 4)               # uninitialised, explicit intent
h = torch.zeros(3, 4, dtype=torch.float32)
i = torch.ones(3, 4, dtype=torch.int64)

Rule of thumb: always use lowercase torch.tensor() when creating a tensor from existing data (a list, NumPy array, or scalar). Use torch.zeros(), torch.ones(), torch.empty(), or torch.rand() when you want a new tensor of a given shape. Avoid torch.Tensor() entirely in new code.

Take quiz

What is the critical difference between torch.tensor(3) and torch.Tensor(3)?They produce identical scalar tensors with value 3

✗ Try again.

torch.tensor(3) creates a scalar tensor with value 3; torch.Tensor(3) interprets 3 as a SHAPE argument, creating a 1-D tensor of length 3 filled with uninitialised (garbage) memory

✓ Correct! Well done.

torch.Tensor(3) is faster because it skips data validation

✗ Try again.

torch.tensor(3) only works for floats, not integers

✗ Try again.

Which function should you use to create a PyTorch tensor from an existing Python list or NumPy array?torch.Tensor() (uppercase)

✗ Try again.

torch.tensor() (lowercase) — it correctly infers dtype from the data and is the documented, recommended approach

✓ Correct! Well done.

Either works identically — it is purely a style preference

✗ Try again.

torch.from_list()

✗ Try again.

36. How does gradient accumulation work in PyTorch and when would you use it?

Gradient accumulation simulates a larger effective batch size than fits in GPU memory by summing gradients over several smaller forward/backward passes before calling optimizer.step(). This is useful when training large models on limited GPU memory.

import torch
import torch.nn as nn

model     = nn.Linear(100, 10)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Simulate effective batch_size=128 using micro_batch=32 (4 accumulation steps)
accumulation_steps = 4

optimizer.zero_grad()
for step, (X_micro, y_micro) in enumerate(loader):  # loader yields micro-batches
    logits = model(X_micro)
    loss = criterion(logits, y_micro)

    # CRITICAL: scale loss by 1/accumulation_steps before backward
    # so the accumulated gradient matches what a single large-batch
    # backward pass would have produced
    loss = loss / accumulation_steps
    loss.backward()           # gradients ACCUMULATE (not cleared)

    if (step + 1) % accumulation_steps == 0:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()                          # update only every N micro-batches
        optimizer.zero_grad(set_to_none=True)     # clear for next accumulation cycle

# Effective batch size = micro_batch_size * accumulation_steps
# This trades extra forward/backward compute for lower peak memory usage

Gradient accumulation trade-offs
Aspect	Effect
GPU memory	Stays at micro-batch level — much lower peak usage
Wall-clock time	Slightly slower than one large batch (more Python overhead)
Effective batch size	micro_batch_size × accumulation_steps
BatchNorm caveat	Statistics computed per micro-batch, not the full effective batch — can behave differently than true large-batch training

/div>

Take quiz

Why must the loss be divided by accumulation_steps before calling backward() in gradient accumulation?To prevent the loss from becoming NaN

✗ Try again.

Because gradients accumulate (sum) across the micro-batches; dividing the loss ensures the accumulated gradient matches the average gradient a single large-batch backward pass would have produced, rather than being accumulation_steps times too large

✓ Correct! Well done.

To make the loss value easier to read in logs

✗ Try again.

Division is not actually required — it's only a convention

✗ Try again.

What is the main motivation for using gradient accumulation in PyTorch?It speeds up training compared to a single large batch

✗ Try again.

It allows simulating a large effective batch size on hardware that cannot fit that batch size in GPU memory at once, by accumulating gradients over several smaller forward/backward passes before stepping the optimizer

✓ Correct! Well done.

It eliminates the need for a learning rate scheduler

✗ Try again.

It is required when using BatchNorm layers

✗ Try again.

37. What is mixed precision training in PyTorch and how do you implement it with torch.cuda.amp?

Mixed precision training runs most operations in FP16 (or BF16) for speed while keeping a master copy of weights in FP32 for numerical stability. Modern GPUs (Volta and later) have dedicated hardware (Tensor Cores) that make FP16 matrix multiplication significantly faster than FP32.

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler

model     = nn.Linear(1024, 512).cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scaler    = GradScaler()   # manages loss scaling to prevent FP16 underflow

x = torch.randn(256, 1024).cuda()
y = torch.randn(256, 512).cuda()

for step in range(100):
    optimizer.zero_grad()

    # autocast: automatically runs eligible ops in FP16/BF16
    with autocast(device_type="cuda", dtype=torch.float16):
        y_hat = model(x)                  # matmul runs in FP16 — faster!
        loss  = nn.MSELoss()(y_hat, y)

    # Loss scaling: inflate loss before backward to prevent small
    # gradients from underflowing to zero in FP16's limited range
    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)            # restore original gradient magnitudes
    nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)                # skips the step if grads are inf/NaN
    scaler.update()                       # adjusts scale factor for next iteration

# BFloat16: no GradScaler needed (same exponent range as FP32)
with autocast(device_type="cuda", dtype=torch.bfloat16):
    y_hat = model(x)   # no underflow risk — scaling unnecessary

Take quiz

What problem does GradScaler solve in FP16 mixed precision training?It increases the model's overall accuracy

✗ Try again.

FP16's limited dynamic range can cause small gradient values to underflow to zero; GradScaler inflates the loss before backward (pushing gradients into FP16's representable range) and then unscales them before the optimizer step

✓ Correct! Well done.

It prevents the loss from exceeding a maximum value

✗ Try again.

It automatically converts all model weights to FP16 permanently

✗ Try again.

Why doesn't BFloat16 require a GradScaler while Float16 does?BFloat16 is always more numerically precise than Float16

✗ Try again.

BFloat16 has the same 8-bit exponent range as Float32, giving it the same dynamic range and immunity to the underflow problem that affects Float16 — it sacrifices mantissa precision instead

✓ Correct! Well done.

BFloat16 is only used during inference, never training

✗ Try again.

GradScaler only works with NVIDIA GPUs, and BFloat16 is for other hardware

✗ Try again.

38. What is torch.compile() and how does it speed up PyTorch model execution?

Introduced in PyTorch 2.0, torch.compile() performs just-in-time compilation of a model. Instead of executing each tensor operation eagerly (PyTorch's default), it captures the computation graph, fuses operations, and generates optimised kernels — primarily reducing GPU memory round-trips.

import torch
import torch.nn as nn
import time

model = nn.Sequential(
    nn.Linear(1024, 1024), nn.GELU(),
    nn.Linear(1024, 512),  nn.GELU(),
    nn.Linear(512, 10),
).cuda()

# Compile the model — wraps it, does NOT change the API
compiled_model = torch.compile(model)

x = torch.randn(256, 1024).cuda()

# First call triggers compilation (slow — may take 10-60 seconds)
out = compiled_model(x)

# Subsequent calls use the compiled, optimised kernels (fast)
for _ in range(5):
    out = compiled_model(x)

# Compilation modes — trade compile time for runtime speed
model_default = torch.compile(model)                              # balanced
model_reduce  = torch.compile(model, mode="reduce-overhead")      # less Python overhead
model_max     = torch.compile(model, mode="max-autotune")         # slowest compile, fastest run

# Benchmark comparison
def benchmark(fn, x, n=100):
    for _ in range(5): fn(x)            # warmup
    torch.cuda.synchronize()
    start = time.time()
    for _ in range(n): fn(x)
    torch.cuda.synchronize()
    return time.time() - start

eager_time    = benchmark(model, x)
compiled_time = benchmark(compiled_model, x)
print(f"Eager: {eager_time:.3f}s, Compiled: {compiled_time:.3f}s")

Take quiz

What is the primary technique torch.compile() uses to speed up model execution?It automatically converts the model weights to FP16

✗ Try again.

Kernel fusion — combining multiple sequential operations (e.g. matmul + bias + activation) into a single GPU kernel, reducing the number of times data must be read from and written to GPU memory

✓ Correct! Well done.

It distributes computation across multiple GPUs automatically

✗ Try again.

It converts the Python model into compiled C++ that runs entirely on CPU

✗ Try again.

Why does the first call to a torch.compile()'d model take significantly longer than subsequent calls?PyTorch downloads additional dependencies on first use

✗ Try again.

The first call triggers the actual graph capture, optimisation, and kernel compilation — a one-time cost; subsequent calls reuse the compiled, optimised kernels directly

✓ Correct! Well done.

The first call validates the model architecture for correctness

✗ Try again.

GPU memory must be allocated fresh on the first call only

✗ Try again.

39. What is the difference between batch size, epoch, and iteration in PyTorch training?

These three terms are fundamental to understanding any training loop, and confusing them is a common source of bugs when computing metrics or setting up learning rate schedules.

Training terminology
Term	Definition	Example
Batch size	Number of samples processed together in one forward/backward pass	32
Iteration (step)	One forward + backward + optimizer.step() call — processes one batch	1 step = 1 batch processed
Epoch	One complete pass through the entire training dataset	1 epoch = dataset_size / batch_size iterations

/div>

import torch
from torch.utils.data import DataLoader, TensorDataset

# Example: 1000 training samples, batch size 32
X = torch.randn(1000, 20)
y = torch.randint(0, 5, (1000,))
dataset = TensorDataset(X, y)
loader  = DataLoader(dataset, batch_size=32, shuffle=True)

iterations_per_epoch = len(loader)   # = ceil(1000 / 32) = 32
print(f"Iterations per epoch: {iterations_per_epoch}")

n_epochs = 10
total_iterations = n_epochs * iterations_per_epoch
print(f"Total training iterations: {total_iterations}")  # 320

global_step = 0
for epoch in range(n_epochs):
    for batch_idx, (X_batch, y_batch) in enumerate(loader):
        # This inner loop body executes once PER ITERATION
        # X_batch.shape[0] == batch_size (32, except possibly the last batch)
        global_step += 1
        if global_step % 10 == 0:
            print(f"Epoch {epoch}, iteration {batch_idx}, global step {global_step}")

    print(f"--- Completed epoch {epoch} ---")  # runs once PER EPOCH

# Common pitfall: confusing scheduler.step() granularity
# Some schedulers (StepLR) expect ONE call per epoch
# Others (OneCycleLR) expect ONE call per iteration/step
# Mixing these up silently breaks the intended learning rate schedule

Take quiz

If a dataset has 10,000 samples and the batch size is 50, how many iterations occur in one epoch?50

✗ Try again.

10,000

✗ Try again.

200 (10,000 / 50)

✓ Correct! Well done.

It depends on the number of epochs

✗ Try again.

Why is it important to know whether a learning rate scheduler should be stepped per epoch or per iteration?It only affects logging output, not actual training

✗ Try again.

Stepping a scheduler at the wrong granularity silently breaks the intended learning rate trajectory — e.g. calling an OneCycleLR-style scheduler once per epoch instead of once per batch would make the learning rate change far too slowly

✓ Correct! Well done.

PyTorch raises an error if you step at the wrong granularity

✗ Try again.

It only matters when using multiple GPUs

✗ Try again.

40. How do you compute and track evaluation metrics like accuracy during PyTorch training?

Tracking metrics correctly requires accumulating values across all batches (not just averaging per-batch metrics naively, which can be biased if the last batch has a different size) and ensuring computations happen without gradient tracking.

import torch
import torch.nn as nn

@torch.no_grad()   # disable gradient tracking for the entire evaluation function
def evaluate(model, loader, criterion, device):
    model.eval()                       # disable dropout, use BN running stats

    total_loss    = 0.0
    total_correct = 0
    total_samples = 0

    for X_batch, y_batch in loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        batch_size = X_batch.size(0)

        logits = model(X_batch)
        loss   = criterion(logits, y_batch)

        # Weight by batch_size — correct even if the last batch is smaller
        total_loss += loss.item() * batch_size

        preds = logits.argmax(dim=1)
        total_correct += (preds == y_batch).sum().item()
        total_samples += batch_size

    avg_loss = total_loss / total_samples
    accuracy = total_correct / total_samples
    return avg_loss, accuracy

# WRONG pattern — naively averaging per-batch averages
# is biased if batch sizes are unequal (e.g. last batch is smaller)
def evaluate_wrong(model, loader, criterion):
    losses = []
    for X_batch, y_batch in loader:
        loss = criterion(model(X_batch), y_batch)
        losses.append(loss.item())   # all batches weighted EQUALLY — wrong!
    return sum(losses) / len(losses)  # biased if last batch has fewer samples

# Using torchmetrics for more complex metrics (F1, precision, AUROC)
# pip install torchmetrics
from torchmetrics import Accuracy, F1Score

acc_metric = Accuracy(task="multiclass", num_classes=5).to(device)
f1_metric  = F1Score(task="multiclass", num_classes=5, average="macro").to(device)

for X_batch, y_batch in loader:
    preds = model(X_batch).argmax(dim=1)
    acc_metric.update(preds, y_batch)   # accumulates internally across batches
    f1_metric.update(preds, y_batch)

print(f"Accuracy: {acc_metric.compute():.4f}")  # final correct aggregate
print(f"F1: {f1_metric.compute():.4f}")

Take quiz

Why is naively averaging per-batch loss values across an epoch potentially biased?Loss values are always biased regardless of batching

✗ Try again.

If the last batch has fewer samples than the others (a common occurrence), simple averaging weights it equally with full batches, skewing the overall average — weighting each batch's loss by its actual sample count gives the correct epoch-level average

✓ Correct! Well done.

Averaging causes numerical overflow

✗ Try again.

PyTorch automatically handles this correctly, so it is never an issue

✗ Try again.

Why is the @torch.no_grad() decorator applied to an evaluation function?It speeds up the forward pass by skipping layer computations

✗ Try again.

Evaluation does not need gradients, so disabling gradient tracking saves memory and computation that would otherwise be wasted building an unused computation graph

✓ Correct! Well done.

It is required for model.eval() to function correctly

✗ Try again.

It prevents the model's weights from being accidentally modified

✗ Try again.

41. What is the purpose of torch.manual_seed() and how do you ensure reproducibility in PyTorch?

PyTorch uses pseudo-random number generators for weight initialisation, dropout masks, data shuffling, and more. Setting seeds explicitly ensures experiments are reproducible — critical for debugging, comparing model variants fairly, and scientific rigor.

import torch
import numpy as np
import random
import os

def set_seed(seed: int = 42):
    """Set all relevant seeds for full reproducibility."""
    random.seed(seed)                       # Python's random module
    np.random.seed(seed)                    # NumPy
    torch.manual_seed(seed)                 # PyTorch CPU
    torch.cuda.manual_seed_all(seed)        # PyTorch all GPUs

    # Force deterministic algorithms (may be slower!)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False   # disable auto-tuner (non-deterministic)

    os.environ["PYTHONHASHSEED"] = str(seed)

set_seed(42)

# Verify reproducibility
model1 = torch.nn.Linear(10, 5)
set_seed(42)
model2 = torch.nn.Linear(10, 5)
print(torch.equal(model1.weight, model2.weight))  # True — identical init

# DataLoader reproducibility — also needs a worker_init_fn for num_workers > 0
def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

generator = torch.Generator()
generator.manual_seed(42)

from torch.utils.data import DataLoader
loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4,
    worker_init_fn=seed_worker,   # seeds each worker process
    generator=generator,           # seeds the shuffling order
)

Reproducibility checklist
Source of randomness	How to control it
Weight initialisation	torch.manual_seed(seed)
Dropout masks	Covered by torch.manual_seed (same RNG stream)
Data shuffling	DataLoader(generator=torch.Generator().manual_seed(seed))
Multi-worker DataLoader	worker_init_fn to seed each subprocess
GPU non-determinism	torch.backends.cudnn.deterministic = True
cuDNN auto-tuner	torch.backends.cudnn.benchmark = False

/div>

Take quiz

Why is torch.backends.cudnn.deterministic = True sometimes necessary in addition to torch.manual_seed()?manual_seed() does not work on GPU at all

✗ Try again.

Some cuDNN GPU algorithms (especially convolutions) select among multiple implementation variants non-deterministically for performance — disabling this ensures bit-for-bit reproducible results, though potentially at the cost of speed

✓ Correct! Well done.

deterministic mode is required for the model to converge

✗ Try again.

It is purely a deprecated legacy setting with no current effect

✗ Try again.

Why does a multi-worker DataLoader (num_workers > 0) require special handling for reproducibility beyond just calling torch.manual_seed()?DataLoader workers ignore the main process's manual_seed entirely

✗ Try again.

Each worker process runs in a separate process with its own RNG state inherited at fork time — without an explicit worker_init_fn to reseed them, different worker processes may produce different, non-reproducible random behaviour (like augmentation)

✓ Correct! Well done.

num_workers > 0 disables shuffling entirely

✗ Try again.

Workers always use the same random seed regardless of configuration

✗ Try again.

42. How does PyTorch handle multi-dimensional indexing and slicing of tensors?

PyTorch tensor indexing follows NumPy-style conventions, including basic slicing, advanced (fancy) indexing with integer/boolean tensors, and the powerful ... (ellipsis) operator for indexing high-dimensional tensors concisely.

import torch

x = torch.arange(24).reshape(2, 3, 4)   # shape (2, 3, 4)

# ── Basic slicing — same as Python lists/NumPy
print(x[0])           # shape (3, 4) — first "batch"
print(x[0, 1])         # shape (4,)  — first batch, second row
print(x[0, 1, 2])      # scalar — single element
print(x[:, 0, :])      # shape (2, 4) — all batches, first row, all cols
print(x[..., 0])       # shape (2, 3) — ellipsis: all leading dims, last dim index 0
print(x[0:1, :, -1])    # shape (1, 3) — slice + negative index

# ── Boolean (mask) indexing
mask = x > 10
print(x[mask])         # 1D tensor of all elements > 10
x_clamped = x.clone()
x_clamped[x_clamped > 10] = 0   # zero out values > 10

# ── Fancy (advanced) integer indexing
idx = torch.tensor([0, 2])
print(x[:, idx, :])    # shape (2, 2, 4) — select specific indices along dim 1

# ── torch.gather: select elements using an index tensor
scores = torch.tensor([[0.1, 0.7, 0.2], [0.3, 0.3, 0.4]])  # (2, 3)
top_idx = scores.argmax(dim=1, keepdim=True)                 # (2, 1)
top_val = scores.gather(dim=1, index=top_idx)                 # (2, 1)
print(top_val)   # tensor([[0.7], [0.4]])

# ── torch.where: conditional element selection
result = torch.where(x > 10, x, torch.zeros_like(x))   # keep if >10, else 0

# ── Important: most slicing returns a VIEW, not a copy!
y = x[0]
y[0, 0] = 999
print(x[0, 0, 0])      # 999 — x was modified too! (shared memory)
# Use x[0].clone() to get an independent copy

Indexing patterns
Pattern	Example	Returns
Basic slicing	x[:, 0]	View (shares memory)
Boolean mask	x[x > 0]	Copy (1D, new memory)
Fancy indexing	x[:, [0,2]]	Copy (new memory)
Ellipsis	x[..., 0]	View — skips middle dims
gather	x.gather(dim, index)	Copy — selects per index

/div>

Take quiz

What happens when you modify a slice obtained via basic indexing, like y = x[0]; y[0] = 999?Only y is modified; x remains unchanged because slicing always copies

✗ Try again.

x is also modified because basic slicing returns a VIEW that shares the same underlying memory as the original tensor — use .clone() to get an independent copy

✓ Correct! Well done.

PyTorch raises an error preventing modification of slices

✗ Try again.

The behaviour depends on whether the tensor requires gradients

✗ Try again.

What does the ellipsis (...) do in PyTorch tensor indexing like x[..., 0]?It selects every element in the tensor

✗ Try again.

It represents as many full slices (:) as needed to match the tensor's number of dimensions, letting you index the last dimension without specifying every preceding dimension explicitly

✓ Correct! Well done.

It raises a syntax error — ellipsis is not valid in tensor indexing

✗ Try again.

It only works on 2-dimensional tensors

✗ Try again.

43. What is the difference between.view(),.reshape(), and.contiguous() in PyTorch, and why does it matter?

These three methods deal with how a tensor's underlying memory is interpreted as a different shape. Understanding the difference prevents a class of confusing runtime errors related to tensor memory layout.

import torch

x = torch.arange(12).reshape(3, 4)    # shape (3, 4), contiguous memory

# ── .view(): ALWAYS returns a view (no copy), but requires contiguous memory
y = x.view(4, 3)     # works — x is contiguous
print(y.shape)        # (4, 3)

# ── Transpose breaks contiguity — the data is NOT rearranged in memory,
# only the strides describing how to read it change
xt = x.t()             # transpose — x.t() is a VIEW with different strides
print(xt.is_contiguous())  # False!

# This FAILS — view() cannot reinterpret non-contiguous memory
try:
    xt.view(3, 4)
except RuntimeError as e:
    print(f"Error: {e}")
# RuntimeError: view size is not compatible with input tensor's size and stride

# ── .reshape(): tries view() first; falls back to copying if needed
z = xt.reshape(3, 4)   # WORKS — automatically copies if necessary
print(z.shape)          # (3, 4)

# ── .contiguous(): explicitly forces a contiguous copy in memory
xt_contig = xt.contiguous()
print(xt_contig.is_contiguous())  # True
xt_contig.view(3, 4)   # now works, since it is contiguous

# Strides explain WHY this happens
print(x.stride())   # (4, 1) — contiguous: move 1 step = 1 memory address
print(xt.stride())  # (1, 4) — transposed: strides reflect the swap, no copy made

.view() vs .reshape() vs .contiguous()
Method	Copies data?	Requires contiguous input?	Safety
.view()	Never — always a view	Yes — raises RuntimeError otherwise	Fails loudly on non-contiguous tensors
.reshape()	Only if necessary	No — handles either case automatically	Safer general-purpose choice
.contiguous()	Yes, if not already contiguous	N/A — this is what fixes it	Use before .view() on a transposed/permuted tensor

/div>

Take quiz

Why does calling .view() on a transposed tensor often raise a RuntimeError?Transposed tensors cannot be reshaped under any circumstances

✗ Try again.

Transpose changes the tensor's strides (how memory is read) without physically rearranging the underlying data — the result is non-contiguous, and .view() requires contiguous memory to safely reinterpret shape without copying

✓ Correct! Well done.

view() only works on tensors with an even number of elements

✗ Try again.

Transposed tensors lose their gradient history

✗ Try again.

What is the practical advantage of .reshape() over .view() in most code?reshape() is always faster than view()

✗ Try again.

reshape() automatically falls back to copying the data when the tensor is not contiguous, so it works in both cases without raising an error — making it the safer default choice

✓ Correct! Well done.

reshape() supports more dimensions than view()

✗ Try again.

view() is deprecated in favour of reshape()

✗ Try again.

44. How do you freeze layers and perform transfer learning / fine-tuning in PyTorch?

Transfer learning reuses a model pretrained on a large dataset and adapts it to a new task. Freezing layers (setting requires_grad=False) prevents their weights from updating during backpropagation — useful when you want to keep pretrained features fixed and only train a new task-specific head.

import torch
import torch.nn as nn
import torchvision.models as models

# Load a pretrained ResNet-50
backbone = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

# ── Strategy 1: Feature extraction — freeze ALL pretrained layers
for param in backbone.parameters():
    param.requires_grad = False     # excluded from gradient computation

# Replace the final classification layer for our task (e.g. 5 classes)
in_features = backbone.fc.in_features    # 2048 for ResNet-50
backbone.fc = nn.Linear(in_features, 5)  # NEW layer — requires_grad=True by default

# Only backbone.fc parameters will be updated by the optimizer
optimizer = torch.optim.AdamW(
    filter(lambda p: p.requires_grad, backbone.parameters()),  # only trainable params
    lr=1e-3,
)

# ── Strategy 2: Full fine-tuning with layer-wise (discriminative) learning rates
backbone2 = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
backbone2.fc = nn.Linear(backbone2.fc.in_features, 5)

optimizer2 = torch.optim.AdamW([
    {"params": backbone2.layer1.parameters(), "lr": 1e-5},  # earliest layers — smallest lr
    {"params": backbone2.layer4.parameters(), "lr": 1e-4},  # later layers — bigger lr
    {"params": backbone2.fc.parameters(),     "lr": 1e-3},  # new head — largest lr
])

# ── Verify which parameters are trainable
trainable = sum(p.numel() for p in backbone.parameters() if p.requires_grad)
total     = sum(p.numel() for p in backbone.parameters())
print(f"Trainable: {trainable:,} / Total: {total:,} ({100*trainable/total:.1f}%)")

# ── Common pattern: train head first, then unfreeze and fine-tune everything
# Phase 1: only train backbone.fc for a few epochs
# Phase 2: unfreeze all layers, train with a small lr to fine-tune end-to-end
for param in backbone.parameters():
    param.requires_grad = True   # unfreeze for phase 2

Take quiz

What does setting param.requires_grad = False accomplish for a layer in a pretrained model?The parameter is permanently deleted from the model

✗ Try again.

The parameter is excluded from gradient computation during backward() and will not be updated by the optimizer — effectively freezing that layer's weights

✓ Correct! Well done.

The parameter's value is reset to its initial pretrained value before each forward pass

✗ Try again.

It makes the parameter shared across multiple layers of the model

✗ Try again.

Why might you use a smaller learning rate for early (pretrained) layers and a larger one for the new task-specific head during fine-tuning?Early layers always have more parameters and need slower updates for computational reasons

✗ Try again.

Early layers encode general, broadly useful features learned from a large pretraining dataset — large updates could overwrite this useful knowledge; the new head has random initialisation and needs larger updates to learn quickly

✓ Correct! Well done.

PyTorch requires layer-wise learning rates for any pretrained model

✗ Try again.

Smaller learning rates prevent the model from running out of GPU memory

✗ Try again.

45. What is the purpose of torch.utils.data.random_split() and how do you create train/validation/test splits in PyTorch?

Splitting a dataset into training, validation, and test subsets is a fundamental step before training. PyTorch's random_split() creates non-overlapping random subsets from a single Dataset, while preserving the lazy-loading behaviour of the original Dataset.

import torch
from torch.utils.data import Dataset, DataLoader, random_split

class MyDataset(Dataset):
    def __init__(self, n=1000):
        self.data   = torch.randn(n, 20)
        self.labels = torch.randint(0, 3, (n,))
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

full_dataset = MyDataset(n=1000)

# Split: 70% train, 15% val, 15% test
train_size = int(0.7 * len(full_dataset))
val_size   = int(0.15 * len(full_dataset))
test_size  = len(full_dataset) - train_size - val_size  # remainder, avoids rounding loss

# Use a generator for reproducible splits
generator = torch.Generator().manual_seed(42)
train_ds, val_ds, test_ds = random_split(
    full_dataset,
    [train_size, val_size, test_size],
    generator=generator,
)

print(len(train_ds), len(val_ds), len(test_ds))   # 700 150 150

train_loader = DataLoader(train_ds, batch_size=32, shuffle=True)
val_loader   = DataLoader(val_ds,   batch_size=32, shuffle=False)  # no shuffle needed
test_loader  = DataLoader(test_ds,  batch_size=32, shuffle=False)

# IMPORTANT GOTCHA: if your Dataset applies different transforms
# (e.g. data augmentation only for training), random_split alone
# does NOT let you apply different transforms per split, because
# all splits reference the SAME underlying Dataset object.
# Common workaround: split INDICES, then wrap with two separate
# Dataset instances using different transforms
from torch.utils.data import Subset
indices = torch.randperm(len(full_dataset), generator=generator).tolist()
train_idx = indices[:train_size]
val_idx   = indices[train_size:train_size+val_size]
# train_dataset_aug = Subset(MyDatasetWithAugmentation(...), train_idx)
# val_dataset_plain = Subset(MyDatasetPlain(...), val_idx)

Take quiz

Why is shuffle=False typically used for validation and test DataLoaders, while shuffle=True is used for training?Shuffling is technically impossible for validation data

✗ Try again.

Shuffle order does not affect what the model learns or how it is evaluated for validation/test — keeping it deterministic (False) makes results easier to reproduce and debug; training shuffling, by contrast, prevents the model from learning spurious order-dependent patterns

✓ Correct! Well done.

Validation data is always smaller, so shuffling provides no benefit

✗ Try again.

PyTorch requires shuffle=False whenever drop_last=False

✗ Try again.

What is a key limitation of torch.utils.data.random_split() when you want different data augmentation for the train vs validation split?random_split() cannot create more than 2 splits

✗ Try again.

All resulting splits reference the SAME underlying Dataset object and its single set of transforms — you cannot directly apply different augmentation pipelines per split without restructuring (e.g. splitting indices and wrapping with separate Dataset instances)

✓ Correct! Well done.

random_split() does not support reproducible seeding

✗ Try again.

random_split() only works with TensorDataset, not custom Dataset classes

✗ Try again.

46. What is Batch Normalization in PyTorch and how does it differ from Layer Normalization?

Normalization layers stabilise training by re-centring and re-scaling activations. PyTorch provides several variants; Batch Normalization (BatchNorm) and Layer Normalization (LayerNorm) are the two most widely used, but they normalise over different dimensions and suit different architectures.

BatchNorm vs LayerNorm
Feature	BatchNorm (nn.BatchNorm1d/2d)	LayerNorm (nn.LayerNorm)
Normalises over	Batch dimension (per-feature statistics)	Feature dimension (per-sample statistics)
Statistics at train	Computed from current mini-batch	Computed from current sample's features
Statistics at eval	Uses running mean/var accumulated during training	Always computed fresh from current input
Batch size dependency	Noisy with very small batches (< 8)	Independent of batch size — works with batch=1
Best for	CNNs (image models)	Transformers, RNNs, NLP models
Parameters	gamma (scale), beta (shift) per feature	Same, but normalised per sample

/div>

import torch
import torch.nn as nn

# ── BatchNorm — for feedforward / CNN models
class BNModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(20, 64)
        self.bn1 = nn.BatchNorm1d(64)   # 64 features
        self.fc2 = nn.Linear(64, 10)

    def forward(self, x):
        x = torch.relu(self.bn1(self.fc1(x)))
        return self.fc2(x)

# BatchNorm behaves differently in train vs eval mode!
# train: normalise using batch mean/var, update running stats
# eval:  use accumulated running_mean / running_var
model = BNModel()
model.train()   # must be in train mode during training!

# ── LayerNorm — for transformers and sequence models
class LNModel(nn.Module):
    def __init__(self, d_model=64):
        super().__init__()
        self.fc1 = nn.Linear(20, d_model)
        self.ln1 = nn.LayerNorm(d_model)   # normalise over last dim
        self.fc2 = nn.Linear(d_model, 10)

    def forward(self, x):
        x = torch.relu(self.ln1(self.fc1(x)))
        return self.fc2(x)

# LayerNorm produces the SAME result at train and eval
ln_model = LNModel()
ln_model.train()
x = torch.randn(8, 20)
out_train = ln_model(x)
ln_model.eval()
out_eval  = ln_model(x)
print(torch.allclose(out_train, out_eval))  # True — LayerNorm is mode-independent!

Common bug: forgetting to call model.train() before training and model.eval() before validation when using BatchNorm — at eval, it uses accumulated running statistics, and if these were never updated (because the model was always in eval mode), predictions will be incorrect.

Take quiz

Why does BatchNorm produce different outputs depending on whether the model is in train() or eval() mode?eval() mode disables the learnable gamma and beta parameters

✗ Try again.

In train mode, BatchNorm normalises using the current mini-batch's mean and variance; in eval mode, it uses accumulated running_mean and running_var from training — without calling model.eval(), inference uses noisy batch statistics rather than stable running statistics

✓ Correct! Well done.

BatchNorm applies dropout in train mode but not eval mode

✗ Try again.

eval() mode doubles the batch size internally to compute more stable statistics

✗ Try again.

For which type of architecture is LayerNorm preferred over BatchNorm, and why?CNNs — LayerNorm handles spatial dimensions better

✗ Try again.

Transformers and sequence models — LayerNorm normalises per-sample (independent of batch size), making it well-suited for variable-length sequences and tasks where batch size may be very small (e.g. language model fine-tuning)

✓ Correct! Well done.

Generative models — LayerNorm generates better image quality

✗ Try again.

Any model with more than 5 layers — BatchNorm is only stable in shallow networks

✗ Try again.

47. How do you implement and use a custom loss function in PyTorch?

When built-in loss functions do not fit your task, you can write a custom loss as either a plain function or an nn.Module subclass. As long as the loss is computed from PyTorch tensor operations with requires_grad=True parameters, autograd handles differentiation automatically.

import torch
import torch.nn as nn
import torch.nn.functional as F

# ── Option 1: Plain function (simple, no learnable parameters)
def smooth_l1_custom(pred: torch.Tensor, target: torch.Tensor, beta: float = 1.0) -> torch.Tensor:
    """Huber loss — L1 outside beta, L2 inside beta."""
    diff = torch.abs(pred - target)
    loss = torch.where(
        diff < beta,
        0.5 * diff ** 2 / beta,       # quadratic region
        diff - 0.5 * beta,            # linear region
    )
    return loss.mean()

# ── Option 2: nn.Module subclass (recommended when loss has hyper-parameters
# or learnable parameters you want saved in state_dict)
class FocalLoss(nn.Module):
    """Focal loss for class-imbalanced multi-class problems."""
    def __init__(self, gamma: float = 2.0, weight: torch.Tensor | None = None):
        super().__init__()
        self.gamma  = gamma
        self.weight = weight   # class weights (optional)

    def forward(self, logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
        # logits: (N, C)  targets: (N,) int64
        ce_loss = F.cross_entropy(logits, targets, weight=self.weight, reduction="none")
        pt      = torch.exp(-ce_loss)         # probability of correct class
        focal   = (1 - pt) ** self.gamma * ce_loss
        return focal.mean()

# Usage — identical to built-in loss functions
model     = nn.Linear(10, 5)
focal_fn  = FocalLoss(gamma=2.0)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

X      = torch.randn(16, 10)
target = torch.randint(0, 5, (16,))

optimizer.zero_grad()
logits = model(X)
loss   = focal_fn(logits, target)   # custom loss used exactly like nn.CrossEntropyLoss
loss.backward()                      # autograd differentiates through our custom ops
optimizer.step()

print(f"Focal loss: {loss.item():.4f}")

# ── Combining multiple losses
rec_loss  = F.mse_reconstruction_loss(output, target_img)  # reconstruction
kl_loss   = -0.5 * (1 + log_var - mu**2 - log_var.exp()).mean()  # KL divergence
total_loss = rec_loss + 0.001 * kl_loss   # weighted combination

Key insight: any PyTorch computation graph built from differentiable operations is automatically differentiable via autograd — you do not need to manually derive or implement gradients for custom losses. If you use standard PyTorch operations (torch.*, F.*), autograd takes care of the rest.

Take quiz

Do you need to manually implement the backward() gradient computation for a custom PyTorch loss function built from standard tensor operations?Yes — all custom losses must implement a backward() method

✗ Try again.

No — PyTorch's autograd engine automatically differentiates through any combination of standard PyTorch tensor operations, as long as the tensors being differentiated have requires_grad=True

✓ Correct! Well done.

Yes — but only for loss functions with more than one term

✗ Try again.

Only if the loss uses in-place operations

✗ Try again.

What is the advantage of implementing a custom loss as an nn.Module subclass instead of a plain function?nn.Module subclasses are always faster than plain functions

✗ Try again.

Subclassing lets you store hyperparameters and learnable parameters (if any) as proper attributes, makes the loss inspectable via repr(), and integrates naturally with model checkpointing via state_dict()

✓ Correct! Well done.

Plain functions cannot be used with autograd

✗ Try again.

PyTorch only accepts nn.Module instances as loss functions in the training loop

✗ Try again.

48. What is torch.compile() vs TorchScript and how do you export a PyTorch model for production deployment?

PyTorch offers two main paths for production deployment beyond running the Python interpreter: TorchScript (serialises the model as a language-independent IR) and torch.compile() (JIT compiles for speed within Python). For cross-language/cross-framework deployment, ONNX export is also widely used.

PyTorch deployment options
Method	Best for	Requires Python runtime?	Portable across languages?
torch.compile()	Fastest Python inference; no code changes	Yes	No
TorchScript (trace)	Production servers; models with fixed control flow	No	Yes (C++ API)
TorchScript (script)	Models with data-dependent control flow (if/loops)	No	Yes
ONNX export	Cross-framework deployment (TensorRT, ONNX Runtime, CoreML)	No	Yes (many runtimes)

/div>

import torch
import torch.nn as nn

model = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 5))
model.eval()   # ALWAYS call eval() before exporting

# ── 1. torch.compile() — fast Python-based inference (PyTorch 2.0+)
compiled = torch.compile(model)
with torch.no_grad():
    out = compiled(torch.randn(4, 10))

# ── 2. TorchScript trace — captures a concrete execution trace
#       Works best when control flow does NOT depend on input data
example_input = torch.randn(1, 10)
traced = torch.jit.trace(model, example_input)
torch.jit.save(traced, "model_traced.pt")

# Load and run without the original Python class
loaded_traced = torch.jit.load("model_traced.pt")
out = loaded_traced(torch.randn(4, 10))

# ── 3. TorchScript script — handles dynamic control flow
class DynamicModel(nn.Module):
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if x.mean() > 0:        # data-dependent branch — trace would miss this!
            return torch.relu(x)
        return torch.tanh(x)

scripted = torch.jit.script(DynamicModel())
torch.jit.save(scripted, "model_scripted.pt")

# ── 4. ONNX export — deploy with ONNX Runtime, TensorRT, CoreML
torch.onnx.export(
    model,
    example_input,
    "model.onnx",
    input_names=["features"],
    output_names=["logits"],
    dynamic_axes={"features": {0: "batch_size"}},  # variable batch size
    opset_version=17,
)

# Inference with ONNX Runtime (no PyTorch dependency on deployment host!)
# import onnxruntime as ort
# sess = ort.InferenceSession("model.onnx")
# out  = sess.run(["logits"], {"features": x.numpy()})

Take quiz

What is the key difference between torch.jit.trace() and torch.jit.script() for TorchScript export?trace() is faster at runtime; script() is faster to compile

✗ Try again.

trace() records a single concrete execution path through the model — data-dependent branches (if/else based on tensor values) that weren't triggered during tracing are not captured; script() analyses the full Python source code and handles all control flow correctly

✓ Correct! Well done.

script() supports GPU; trace() only works on CPU

✗ Try again.

trace() and script() are identical — they only differ in the input format they accept

✗ Try again.

Why should you always call model.eval() before exporting or scripting a PyTorch model for production?eval() compresses the model weights for smaller file size

✗ Try again.

eval() switches Dropout to pass-through (no random zeroing) and BatchNorm to use stable running statistics — without this, the exported model would include randomness (Dropout) or require batch-level statistics (BatchNorm) that are inappropriate and incorrect during single-sample inference

✓ Correct! Well done.

eval() is required by torch.jit.trace() as a prerequisite

✗ Try again.

eval() disables gradient computation, making the export faster

✗ Try again.

Tools

Comments & Discussions

Core Python Fundamentals Interview Questions 45 Data Science Essentials Interview Questions 45 Python Mathematical Intuition and Scikit Learn Interview Questions 36 Python Deep Learning and Neural Networks Interview Questions 38 Python Modern Generative AI and Agents Interview Questions 38 FastAPI Interview Questions 38 PyTorch Fundamentals Interview Questions 48

Recently added...

What are activation functions in PyTorch and how do you apply them?

What optimizers does PyTorch provide and how do you choose between them?

What is the computation graph in PyTorch and how does the dynamic graph differ from a static graph?

What built-in layers does PyTorch's nn module provide and how do you use the most common ones?

What are learning rate schedulers in PyTorch and how do you use them?

What loss functions does PyTorch provide and when do you use each?

What are the most important tensor operations in PyTorch?

What is autograd in PyTorch and how does it compute gradients?

What is nn.Module and how do you build a custom neural network in PyTorch?

What are nn.Sequential and other container modules in PyTorch?

What are the most important loss functions in PyTorch and when do you use each?

What optimizers does PyTorch provide and how do you configure them?

What are the most common built-in layers in torch.nn and what do they do?

How do you initialise weights in a PyTorch model?

What is PyTorch and what are its key advantages over other deep learning frameworks?

What is a PyTorch tensor and how does it differ from a NumPy array?

What are tensor data types (dtypes) in PyTorch and why do they matter?

How does broadcasting work in PyTorch and what are the rules?

How do torch.no_grad() and tensor.detach() differ, and when do you use each?

What are learning rate schedulers in PyTorch and how do you use them?

	Interviews Questions Java Spring Hibernate Maven Testing API BigData Web DataStructures AI Database Integration Cloud Scala Python Tools Golang	About Javapedia.net Javapedia.net is for Java and J2EE developers, technologist and college students who prepare of interview. Also this site includes many practical examples. This site is developed using J2EE technologies by Steve Antony, a senior Developer/lead at one of the logistics based company.
	contact: javatutorials2016[at]gmail[dot]com
Kindly consider donating for maintaining this website. Thanks.
	Copyright © 2026, javapedia.net, all rights reserved. privacy policy.