Python / PyTorch Fundamentals Interview Questions
1. What is PyTorch and what are its key advantages over other deep learning frameworks?
PyTorch is an open-source deep learning framework developed by Meta AI (Facebook), released in 2016. It is built around two core ideas: tensor computation with GPU acceleration (similar to NumPy but on the GPU) and automatic differentiation via a dynamic computation graph (called define-by-run or eager execution).
| Feature | PyTorch | TensorFlow 2.x |
|---|---|---|
| Graph style | Dynamic (eager by default) | Eager by default (was static in v1) |
| Debugging | Native Python debugger (pdb, print) | More complex — graph abstractions |
| Research adoption | Dominant in academia | Strong in production |
| Deployment | TorchScript, ONNX, TorchServe | TensorFlow Serving, TFLite, TF.js |
| API feel | Pythonic, NumPy-like | More verbose historically |
| Community | Fast-growing, most ML papers | Large, enterprise-focused |
Key advantages of PyTorch:
- Dynamic computation graph — the graph is built at runtime, making debugging with standard Python tools natural
- Pythonic API — feels like writing NumPy code; easy to mix with standard Python control flow
- Strong GPU support —
.cuda()/.to(device)moves tensors to GPU with one call - Rich ecosystem — torchvision, torchaudio, torchtext, HuggingFace Transformers, PyTorch Lightning
- Production path — TorchScript, torch.compile, and ONNX export for deployment
2. What is a PyTorch tensor and how does it differ from a NumPy array?
A tensor is PyTorch's core data structure — an n-dimensional array similar to NumPy's ndarray, but with two critical extra capabilities: it can live on a GPU for accelerated computation, and it supports automatic differentiation (autograd) for computing gradients during backpropagation.
import torch
import numpy as np
# Creating tensors
t1 = torch.tensor([1.0, 2.0, 3.0]) # from Python list
t2 = torch.zeros(3, 4) # 3×4 zeros
t3 = torch.ones(2, 3) # 2×3 ones
t4 = torch.rand(2, 3) # uniform random [0,1)
t5 = torch.randn(2, 3) # standard normal
t6 = torch.arange(0, 10, 2) # [0, 2, 4, 6, 8]
t7 = torch.linspace(0, 1, 5) # 5 evenly spaced pts
# Shape, dtype, device
print(t2.shape) # torch.Size([3, 4])
print(t1.dtype) # torch.float32
print(t1.device) # cpu
# NumPy ↔ PyTorch bridge (shares memory on CPU!)
np_array = np.array([1.0, 2.0, 3.0])
torch_from_np = torch.from_numpy(np_array) # shares memory
np_from_torch = t1.numpy() # shares memory
np_array[0] = 99
print(torch_from_np[0]) # tensor(99.) — memory is shared!| Feature | PyTorch Tensor | NumPy ndarray |
|---|---|---|
| GPU support | Yes — .to('cuda') | No |
| Autograd | Yes — requires_grad=True | No |
| Memory sharing | Yes (CPU tensors) | Yes (via from_numpy) |
| Default dtype | float32 | float64 |
| Broadcasting | Yes (same rules) | Yes |
3. What are the most important tensor operations in PyTorch?
PyTorch provides a rich set of tensor operations covering arithmetic, shape manipulation, reduction, and linear algebra. Most have both a functional form (torch.add) and a method form (tensor.add), plus in-place variants with a trailing underscore (tensor.add_).
import torch
a = torch.tensor([[1.,2.,3.],[4.,5.,6.]])
b = torch.tensor([[7.,8.,9.],[10.,11.,12.]])
# ── Arithmetic
print(a + b) # element-wise add
print(a * b) # element-wise multiply (Hadamard)
print(torch.matmul(a, b.T)) # matrix multiply (2×3) @ (3×2) → (2×2)
print(a @ b.T) # same with @ operator
# ── Shape manipulation
print(a.shape) # torch.Size([2, 3])
print(a.reshape(3, 2)) # (3, 2) — new view if possible
print(a.view(6)) # (6,) — must be contiguous
print(a.unsqueeze(0).shape) # (1, 2, 3) — add dim
print(a.squeeze(0).shape) # removes dim of size 1
print(torch.cat([a, b], dim=0)) # (4, 3) — concatenate rows
print(torch.stack([a, b], dim=0)) # (2, 2, 3) — new dim
print(a.permute(1, 0)) # (3, 2) — transpose
# ── Reduction
print(a.sum()) # scalar sum
print(a.sum(dim=1)) # sum along rows → (2,)
print(a.mean(dim=0)) # mean along columns → (3,)
print(a.max(), a.min())
print(a.argmax()) # index of max (flattened)
# ── In-place (modifies tensor, avoids memory allocation)
a.add_(1) # a += 1
a.mul_(2) # a *= 2
# Warning: in-place ops on tensors requiring grad can cause issues!Key distinction: reshape returns a view when possible (no copy) and falls back to a copy if the tensor is not contiguous. view always requires a contiguous tensor and always returns a view. Use contiguous().view() or just reshape() to be safe.
4. What are tensor data types (dtypes) in PyTorch and why do they matter?
Every tensor has a dtype that determines the numeric type and precision of its elements. Choosing the right dtype affects memory usage, computation speed, and numeric precision — a critical consideration when training on GPUs.
| dtype | Alias | Bits | Use case |
|---|---|---|---|
| torch.float32 | torch.float | 32 | Default for model weights and activations |
| torch.float64 | torch.double | 64 | High-precision numerical work |
| torch.float16 | torch.half | 16 | Mixed-precision training (GPU) |
| torch.bfloat16 | — | 16 | Modern GPUs (A100+); wider exponent than float16 |
| torch.int64 | torch.long | 64 | Indices, class labels, sequence lengths |
| torch.int32 | torch.int | 32 | General integer computation |
| torch.bool | — | 8 | Masks, boolean indexing |
| torch.uint8 | — | 8 | Image pixel values (0–255) |
import torch
# Creating tensors with specific dtypes
x = torch.tensor([1.0, 2.0], dtype=torch.float32)
y = torch.tensor([1, 2, 3], dtype=torch.long) # class labels
m = torch.tensor([True, False, True], dtype=torch.bool)
# Casting between dtypes
print(x.dtype) # torch.float32
x64 = x.double() # → float64
x16 = x.half() # → float16
xi = x.to(torch.int32) # → int32
# Default dtype (float32 for floats, int64 for ints)
print(torch.tensor([1.0]).dtype) # torch.float32
print(torch.tensor([1]).dtype) # torch.int64
# Change global default
torch.set_default_dtype(torch.float64) # rarely needed
# Why dtype matters for loss computation:
# CrossEntropyLoss expects:
# input: float32 (logits)
# target: int64 (class indices)
loss_fn = torch.nn.CrossEntropyLoss()
logits = torch.randn(4, 10) # float32
targets = torch.randint(0, 10, (4,)) # int64
loss = loss_fn(logits, targets) # works!
# targets_wrong = targets.float() # would error!Most common dtype errors: passing float64 weights into a model expecting float32, or passing float targets to a loss function expecting long (e.g. CrossEntropyLoss).
5. How does broadcasting work in PyTorch and what are the rules?
Broadcasting allows PyTorch to perform arithmetic between tensors of different shapes without explicit copying. PyTorch follows the same broadcasting rules as NumPy. Understanding broadcasting is essential to avoid subtle shape bugs.
import torch
# Rule: align shapes from the RIGHT, expand dims of size 1
a = torch.ones(3, 4) # shape (3, 4)
b = torch.ones(4) # shape (4) → treated as (1, 4) → broadcast to (3, 4)
c = a + b # works! c.shape = (3, 4)
# Adding a bias vector to a batch of activations
batch = torch.randn(32, 128) # (batch=32, features=128)
bias = torch.randn(128) # (128,) broadcasts across the batch dim
out = batch + bias # (32, 128) ✓
# Adding column and row vectors → 2D result
col = torch.arange(3).reshape(3, 1) # (3, 1)
row = torch.arange(4).reshape(1, 4) # (1, 4)
grid = col + row # (3, 4) — outer-sum
print(grid)
# tensor([[0, 1, 2, 3],
# [1, 2, 3, 4],
# [2, 3, 4, 5]])
# Common broadcasting errors:
# a = torch.ones(3, 4)
# b = torch.ones(3) # (3,) aligns to (1, 3) NOT (3, 1)
# a + b → ERROR: size 4 != size 3 in dimension 1
# Fix: b.reshape(3, 1) to make it (3, 1)| Step | Rule |
|---|---|
| 1. Align right | Pad missing leading dimensions with 1 |
| 2. Check compatibility | Each dim must be equal, or one of them must be 1 |
| 3. Expand size-1 dims | Dimension of size 1 is stretched to match the other tensor |
| 4. Error if incompatible | Raises RuntimeError if no dim is 1 and sizes differ |
6. What is autograd in PyTorch and how does it compute gradients?
PyTorch's autograd engine implements automatic differentiation. When you perform operations on tensors with requires_grad=True, PyTorch records every operation in a dynamic computation graph. Calling .backward() on a scalar loss traverses this graph in reverse using the chain rule, accumulating gradients in each tensor's .grad attribute.
import torch
# requires_grad=True tells PyTorch to track this tensor
x = torch.tensor([2.0, 3.0], requires_grad=True)
# Forward pass — operations are recorded
y = x ** 2 # y = [4.0, 9.0]
z = y.sum() # z = 13.0 (scalar)
# Backward pass — computes dz/dx using chain rule
z.backward()
print(x.grad) # tensor([4., 6.]) dz/dx = 2x
# Verify: dz/d(x[0]) = d(x[0]^2)/d(x[0]) = 2*x[0] = 4 ✓
# Gradients ACCUMULATE — always zero before next backward!
x.grad.zero_() # or optimizer.zero_grad()
# Non-leaf tensors (created by ops) have grad_fn
a = torch.tensor(3.0, requires_grad=True)
b = a * 2
print(b.grad_fn) # <MulBackward0 object>
print(b.requires_grad) # True — inherited from a
# Detach: stop tracking a tensor
c = b.detach() # c shares data with b but no grad history
print(c.requires_grad) # False
# torch.no_grad(): context manager to disable gradient tracking
with torch.no_grad():
inference = a * 2 # faster, no graph built
print(inference.requires_grad) # False| Concept | What it is |
|---|---|
| requires_grad=True | Tells autograd to track operations on this tensor |
| .grad | Accumulated gradient after .backward() — lives on leaf tensors |
| grad_fn | Reference to the function that created a non-leaf tensor |
| .backward() | Traverses graph backwards, fills .grad via chain rule |
| .detach() | Returns tensor with same data but no gradient history |
| torch.no_grad() | Context: disables gradient tracking (inference, validation) |
7. What is the computation graph in PyTorch and how does the dynamic graph differ from a static graph?
PyTorch builds a dynamic computation graph (also called eager execution or define-by-run). Every time you run the forward pass, a new graph is constructed on-the-fly based on the actual Python code paths executed. This is in contrast to TensorFlow 1.x's static graph, which is compiled once and then executed repeatedly.
import torch
# Dynamic graph: Python control flow works naturally
def dynamic_model(x, use_relu=True):
h = x @ torch.randn(4, 4)
if use_relu: # real Python if — changes the graph!
h = torch.relu(h)
else:
h = torch.tanh(h)
return h.sum()
x = torch.randn(2, 4, requires_grad=True)
# Each call may build a DIFFERENT graph depending on use_relu
loss1 = dynamic_model(x, use_relu=True)
loss1.backward() # graph includes ReLU nodes
x.grad.zero_()
loss2 = dynamic_model(x, use_relu=False)
loss2.backward() # graph includes Tanh nodes
# The graph is discarded after backward() by default
# retain_graph=True keeps it for multiple backward calls
y = (x ** 2).sum()
y.backward(retain_graph=True) # graph kept
y.backward() # can call again
# Inspecting the graph
z = x ** 3
print(z.grad_fn) # <PowBackward0>
print(z.grad_fn.next_functions) # upstream functions| Aspect | Dynamic (PyTorch eager) | Static (TF1 / torch.compile) |
|---|---|---|
| When built | At runtime, every forward pass | Once, then reused |
| Python control flow | Works natively (if/for/while) | Must use special graph ops |
| Debugging | Use pdb, print anywhere | Harder — graph is opaque |
| Performance | Slight overhead from graph construction | Faster after compilation |
| Flexibility | High — easy to change architectures | Low — recompile to change |
8. How do torch.no_grad() and tensor.detach() differ, and when do you use each?
Both torch.no_grad() and .detach() stop gradient tracking, but they work at different levels and serve different purposes.
import torch
model_param = torch.tensor(2.0, requires_grad=True)
# ── torch.no_grad(): context manager — disables ALL grad tracking
# Use for inference and validation loops
with torch.no_grad():
out = model_param * 3 # no graph built
print(out.requires_grad) # False
print(out.grad_fn) # None
# Faster + less memory — standard pattern for eval
# ── .detach(): detaches a SPECIFIC tensor from the graph
# The tensor still knows about grad, but is cut off from history
a = model_param * 4
print(a.requires_grad) # True (still tracking)
b = a.detach() # b shares data with a
print(b.requires_grad) # False (disconnected)
print(b.data_ptr() == a.data_ptr()) # True — SAME memory!
# Common use case: compute a "stop gradient" target
# in actor-critic / target networks
target = a.detach() # stop gradient through target
loss = (a - target) ** 2 # gradient only flows through a, not target
# ── @torch.no_grad() decorator variant
@torch.no_grad()
def predict(x):
return model_param * x # no grad even without with block
# Validation loop pattern
def validate(model, loader):
model.eval() # turns off dropout, batchnorm train mode
with torch.no_grad(): # no gradient computation
for x, y in loader:
pred = model(x)
# compute metrics...| Feature | torch.no_grad() | tensor.detach() |
|---|---|---|
| Scope | All ops within the context block | One specific tensor |
| Memory saved | Yes — no graph built | Partial — graph still exists upstream |
| Typical use | Inference, validation loops | Target networks, stop-gradient |
| Output requires_grad | False | False |
9. What is nn.Module and how do you build a custom neural network in PyTorch?
nn.Module is the base class for all neural network components in PyTorch. Subclassing it gives you parameter management, device placement, train/eval mode toggling, state dict serialisation, and hooks — all for free.
import torch
import torch.nn as nn
class MLP(nn.Module):
def __init__(self, in_features: int, hidden: int, out_features: int):
super().__init__() # MUST call this first!
# Layers defined as attributes are auto-registered as sub-modules
self.fc1 = nn.Linear(in_features, hidden)
self.relu = nn.ReLU()
self.drop = nn.Dropout(p=0.3)
self.fc2 = nn.Linear(hidden, out_features)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Define the forward computation."""
x = self.fc1(x)
x = self.relu(x)
x = self.drop(x)
x = self.fc2(x)
return x
# Instantiate and inspect
model = MLP(in_features=784, hidden=256, out_features=10)
# Forward pass — calls forward() via __call__
x = torch.randn(32, 784) # batch of 32
out = model(x) # shape (32, 10)
# Parameter inspection
for name, param in model.named_parameters():
print(name, param.shape, param.requires_grad)
# fc1.weight torch.Size([256, 784]) True
# fc1.bias torch.Size([256]) True
# fc2.weight torch.Size([10, 256]) True
# fc2.bias torch.Size([10]) True
total = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total:,}")Critical rules:
- Always call
super().__init__()in__init__ - Define layers as attributes (not local variables) so PyTorch registers them
- Implement the
forward()method — never call it directly; usemodel(x)which invokes hooks - Use
model(x)notmodel.forward(x)so pre/post-forward hooks fire
10. What are nn.Sequential and other container modules in PyTorch?
PyTorch provides several container modules that compose layers without requiring a custom nn.Module subclass. They are convenient for simple feedforward architectures but less flexible than full subclassing.
import torch
import torch.nn as nn
# ── nn.Sequential: layers applied in order
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, 10),
)
out = model(torch.randn(32, 784)) # (32, 10)
# Named layers in Sequential (for easier access)
model_named = nn.Sequential(
("fc1", nn.Linear(784, 256)),
("relu", nn.ReLU()),
("fc2", nn.Linear(256, 10)),
)
print(model_named.fc1.weight.shape) # torch.Size([256, 784])
# ── nn.ModuleList: list of modules (for dynamic use)
class ResNet(nn.Module):
def __init__(self, n_blocks: int):
super().__init__()
# ModuleList properly registers all contained modules
self.blocks = nn.ModuleList([
nn.Linear(64, 64) for _ in range(n_blocks)
])
def forward(self, x):
for block in self.blocks:
x = torch.relu(block(x)) + x # residual
return x
# ── nn.ModuleDict: dict of modules (for conditional routing)
class MultiHead(nn.Module):
def __init__(self):
super().__init__()
self.heads = nn.ModuleDict({
"sentiment": nn.Linear(128, 2),
"topic": nn.Linear(128, 10),
})
def forward(self, x, task: str):
return self.heads[task](x)| Container | When to use |
|---|---|
| nn.Sequential | Simple feedforward chains; no branching |
| nn.ModuleList | Dynamic or variable-length list of modules in a loop |
| nn.ModuleDict | Named modules selected conditionally (e.g. multi-task) |
| nn.ParameterList | List of nn.Parameter objects (rare) |
| nn.ParameterDict | Dict of nn.Parameter objects (rare) |
11. What built-in layers does PyTorch's nn module provide and how do you use the most common ones?
PyTorch's torch.nn module contains all the standard neural network building blocks. Understanding what each layer does mathematically helps you choose the right component and configure it correctly.
| Layer | Formula / purpose | Key parameters |
|---|---|---|
| nn.Linear | y = xW^T + b — fully connected | in_features, out_features, bias=True |
| nn.Conv2d | 2D cross-correlation — feature extraction | in_channels, out_channels, kernel_size, stride, padding |
| nn.BatchNorm1d/2d | Normalise over batch; learnable γ, β | num_features, eps, momentum |
| nn.Dropout | Zero random neurons with prob p during train | p (dropout probability) |
| nn.Embedding | Learnable lookup table for integer tokens | num_embeddings, embedding_dim |
| nn.LSTM | Long Short-Term Memory recurrent layer | input_size, hidden_size, num_layers |
| nn.MultiheadAttention | Scaled dot-product attention | embed_dim, num_heads |
| nn.LayerNorm | Normalise over feature dims per sample | normalized_shape |
import torch, torch.nn as nn
# nn.Linear
fc = nn.Linear(128, 64) # (batch, 128) → (batch, 64)
print(fc.weight.shape) # (64, 128) — transposed internally
print(fc.bias.shape) # (64,)
# nn.Conv2d
conv = nn.Conv2d(
in_channels=3,
out_channels=32,
kernel_size=3,
stride=1,
padding=1, # "same" padding preserves H, W
)
x_img = torch.randn(8, 3, 32, 32) # (batch, C, H, W)
print(conv(x_img).shape) # (8, 32, 32, 32)
# nn.BatchNorm2d
bn = nn.BatchNorm2d(32) # num_features = channels
# In train mode: normalises over (N, H, W) per channel
# In eval mode: uses running mean/var from training
# nn.Embedding
emb = nn.Embedding(num_embeddings=10000, embedding_dim=128)
tokens = torch.tensor([5, 23, 100]) # integer token ids
print(emb(tokens).shape) # (3, 128)
# nn.Dropout — active only in train mode
drop = nn.Dropout(p=0.5)
x = torch.ones(4, 8)
print(drop(x)) # ~half zeros (train), all ones after model.eval()
12. What are activation functions in PyTorch and how do you apply them?
Activation functions introduce non-linearity into neural networks, enabling them to learn complex mappings. PyTorch provides them both as nn.Module classes (usable as layers) and as functional forms in torch.nn.functional.
| Function | Formula | Typical use |
|---|---|---|
| ReLU | max(0, x) | Default for hidden layers (fast, avoids vanishing grad) |
| LeakyReLU | max(αx, x), α≈0.01 | When dying ReLU is a problem |
| Sigmoid | 1/(1+e^−x) → (0,1) | Binary classification output |
| Tanh | (e^x−e^−x)/(e^x+e^−x) → (−1,1) | RNNs, zero-centred alternative to sigmoid |
| Softmax | e^xᵢ/Σe^xⱼ → sums to 1 | Multi-class output (use with NLLLoss) |
| GELU | x·Φ(x) smooth | Transformers (BERT, GPT) |
| SiLU/Swish | x·sigmoid(x) | Modern architectures (EfficientNet) |
import torch
import torch.nn as nn
import torch.nn.functional as F
x = torch.tensor([-2., -1., 0., 1., 2.])
# ── As nn.Module (use inside nn.Sequential or __init__)
relu = nn.ReLU()
print(relu(x)) # [0, 0, 0, 1, 2]
sigmoid = nn.Sigmoid()
print(sigmoid(x)) # [0.12, 0.27, 0.50, 0.73, 0.88]
# ── As functional (use inside forward())
print(F.relu(x)) # same as nn.ReLU()(x)
print(F.gelu(x)) # smooth approximation
# Softmax: dim must be specified!
logits = torch.randn(4, 10) # (batch=4, classes=10)
probs = F.softmax(logits, dim=1) # dim=1 (classes)
print(probs.sum(dim=1)) # tensor([1., 1., 1., 1.])
# !! Never apply Softmax before CrossEntropyLoss !!
# CrossEntropyLoss = LogSoftmax + NLLLoss internally
# Applying softmax first → double-softmax = wrong!
loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, torch.randint(0, 10, (4,))) # pass raw logits!
13. What are the most important loss functions in PyTorch and when do you use each?
Choosing the right loss function is critical — it defines what the model is optimising for. PyTorch provides loss functions in torch.nn as modules and in torch.nn.functional as functions.
| Loss | Use case | Input / Target |
|---|---|---|
| nn.MSELoss | Regression — minimise squared error | pred: float, target: float |
| nn.MAELoss / L1Loss | Regression — robust to outliers | pred: float, target: float |
| nn.CrossEntropyLoss | Multi-class classification | pred: (N,C) logits, target: (N,) long |
| nn.BCEWithLogitsLoss | Binary classification (numerically stable) | pred: (N,) logits, target: (N,) float 0/1 |
| nn.NLLLoss | Used with log-softmax output | pred: (N,C) log-probs, target: (N,) long |
| nn.KLDivLoss | Distribution divergence (VAE, distillation) | pred: log-probs, target: probs |
| nn.HuberLoss | Regression robust to outliers | pred: float, target: float |
import torch, torch.nn as nn
batch = 8
# ── Regression
pred = torch.randn(batch, 1)
target = torch.randn(batch, 1)
mse = nn.MSELoss()(pred, target)
mae = nn.L1Loss()(pred, target)
print(mse, mae)
# ── Multi-class classification
logits = torch.randn(batch, 10) # raw scores, NOT softmax
labels = torch.randint(0, 10, (batch,)) # class indices, dtype=long
ce_loss = nn.CrossEntropyLoss()(logits, labels)
print(ce_loss)
# Class-weighted cross entropy (handle imbalance)
weights = torch.tensor([1.0]*9 + [5.0]) # upweight class 9
ce_weighted = nn.CrossEntropyLoss(weight=weights)(logits, labels)
# ── Binary classification (single output neuron)
bin_logits = torch.randn(batch) # single score
bin_labels = torch.randint(0, 2, (batch,)).float() # 0 or 1, float!
bce_loss = nn.BCEWithLogitsLoss()(bin_logits, bin_labels)
# BCEWithLogitsLoss = sigmoid + BCE in one numerically stable op
# ── Label smoothing (reduces overconfidence)
ce_smooth = nn.CrossEntropyLoss(label_smoothing=0.1)(logits, labels)
# reduction parameter
nn.MSELoss(reduction="mean") # default: mean over batch
nn.MSELoss(reduction="sum") # sum over batch
nn.MSELoss(reduction="none") # per-sample loss (no reduction)
14. What optimizers does PyTorch provide and how do you configure them?
Optimizers update model parameters based on computed gradients. PyTorch provides all major optimizers in torch.optim. Choosing and configuring the right optimizer significantly affects training speed and final performance.
| Optimizer | Key feature | Typical use |
|---|---|---|
| SGD | Simple, supports momentum and weight decay | Computer vision with lr scheduling |
| Adam | Adaptive lr per param; momentum + RMSProp | Default for NLP, general purpose |
| AdamW | Adam with decoupled weight decay | Transformers, fine-tuning (recommended over Adam) |
| RMSprop | Adaptive lr without momentum | RNNs |
| Adagrad | Accumulates squared gradients; rare today | Sparse features |
| LBFGS | Second-order quasi-Newton; very slow | Small networks, physics-informed NNs |
import torch, torch.nn as nn, torch.optim as optim
model = nn.Linear(128, 10)
# ── SGD with momentum and weight decay
opt_sgd = optim.SGD(
model.parameters(),
lr=0.1,
momentum=0.9, # Nesterov-style acceleration
weight_decay=1e-4, # L2 regularisation
nesterov=True,
)
# ── Adam
opt_adam = optim.Adam(
model.parameters(),
lr=1e-3,
betas=(0.9, 0.999), # (β1, β2) — momentum terms
eps=1e-8,
weight_decay=0, # Adam + L2 is suboptimal — use AdamW!
)
# ── AdamW (preferred for transformers)
opt_adamw = optim.AdamW(
model.parameters(),
lr=1e-3,
weight_decay=0.01, # decoupled from gradient update
)
# ── Per-layer learning rates
opt_layerwise = optim.Adam([
{"params": model.weight, "lr": 1e-4}, # slower for weight
{"params": model.bias, "lr": 1e-3}, # faster for bias
])
# ── Standard training step
opt = optim.AdamW(model.parameters(), lr=1e-3)
for x, y in [(torch.randn(32,128), torch.randint(0,10,(32,)))]:
opt.zero_grad() # 1. clear old gradients
loss = nn.CrossEntropyLoss()(model(x), y) # 2. forward
loss.backward() # 3. backward
opt.step() # 4. update parameters
15. What are learning rate schedulers in PyTorch and how do you use them?
A learning rate scheduler adjusts the learning rate during training — typically starting high for fast initial progress and decaying for fine-grained convergence. Schedulers wrap an optimizer and must be stepped after each epoch (or each batch for some schedulers).
| Scheduler | Behaviour | Step |
|---|---|---|
| StepLR | Multiply lr by gamma every step_size epochs | Per epoch |
| MultiStepLR | Decay at specified milestone epochs | Per epoch |
| ExponentialLR | lr *= gamma every epoch | Per epoch |
| CosineAnnealingLR | Cosine decay from lr to eta_min | Per epoch |
| OneCycleLR | Warmup then cosine decay (superconvergence) | Per batch |
| ReduceLROnPlateau | Reduce lr when metric stops improving | Per epoch (with metric |
| CosineAnnealingWarmRestarts | Cosine with periodic restarts | Per epoch |
import torch, torch.optim as optim
import torch.nn as nn
model = nn.Linear(128, 10)
optimizer = optim.SGD(model.parameters(), lr=0.1)
# ── StepLR: multiply lr by 0.1 every 30 epochs
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
# ── CosineAnnealingLR: smooth cosine decay
scheduler = optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=100, eta_min=1e-6
)
# ── OneCycleLR: requires total_steps at init
scheduler = optim.lr_scheduler.OneCycleLR(
optimizer,
max_lr=0.1,
total_steps=100 * len([1]*1000), # epochs * batches_per_epoch
)
# ── ReduceLROnPlateau: triggered by validation loss
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode="min", factor=0.5, patience=5, verbose=True
)
# ── Training loop integration
for epoch in range(100):
train_loss = 0.0 # ... train ...
val_loss = 0.0 # ... validate ...
# Most schedulers: step after epoch
scheduler.step() # for StepLR, CosineAnnealingLR etc.
# scheduler.step(val_loss) # for ReduceLROnPlateau (needs metric)
# Check current lr
current_lr = optimizer.param_groups[0]["lr"]
print(f"Epoch {epoch}: lr={current_lr:.6f}")
16. What are the most common built-in layers in torch.nn and what do they do?
PyTorch's torch.nn module provides all the standard building blocks for neural networks. Understanding what each layer does mathematically and when to use it is fundamental to building effective models.
| Layer | Formula / behaviour | Typical use |
|---|---|---|
| nn.Linear(in, out) | y = xW^T + b | Fully connected / dense layer |
| nn.Conv2d(in, out, k) | 2D convolution with kernel k×k | Image feature extraction |
| nn.BatchNorm1d/2d | Normalise per feature/channel over batch | After linear/conv, before activation |
| nn.LayerNorm | Normalise over feature dim per sample | Transformers, NLP |
| nn.Dropout(p) | Zeros random fraction p during train | Regularisation |
| nn.Embedding(V,d) | Lookup table V vocab × d dim | Word/token embeddings |
| nn.ReLU/GELU/Tanh | Element-wise activations | After linear/conv layers |
| nn.Softmax(dim) | exp(x)/Σexp(x) along dim | Output probabilities (use LogSoftmax+NLLLoss or CrossEntropyLoss directly) |
| nn.MaxPool2d | Takes max over kernel window | Spatial downsampling in CNNs |
| nn.LSTM/GRU | Gated recurrent cells | Sequence modelling |
import torch, torch.nn as nn
# Linear layer internals
fc = nn.Linear(4, 8)
print(fc.weight.shape) # (8, 4) — note: output × input
print(fc.bias.shape) # (8,)
# Embedding
emb = nn.Embedding(num_embeddings=10000, embedding_dim=128,
padding_idx=0) # index 0 gets a zero vector
tokens = torch.tensor([1, 42, 7]) # shape (3,)
out = emb(tokens) # shape (3, 128)
# BatchNorm vs LayerNorm
bn = nn.BatchNorm1d(64) # input (N, 64) — normalises across N
ln = nn.LayerNorm(64) # input (N, 64) — normalises across 64 features
x = torch.randn(16, 64)
print(bn(x).shape) # (16, 64)
print(ln(x).shape) # (16, 64)
# Dropout only active during training
drop = nn.Dropout(p=0.5)
model = nn.Sequential(nn.Linear(32,32), drop, nn.ReLU())
model.train(); x_tr = model(torch.randn(4,32)) # 50% zeros
model.eval(); x_ev = model(torch.randn(4,32)) # all active
17. How do you initialise weights in a PyTorch model?
PyTorch uses sensible default initialisations (Kaiming uniform for Linear and Conv layers), but custom initialisation is often needed to match a paper or improve convergence. The torch.nn.init module provides all standard schemes.
import torch, torch.nn as nn
# Default initialisation:
# nn.Linear → Kaiming uniform (He init) for weight, uniform for bias
# nn.Conv2d → Kaiming uniform
# nn.Embedding → Normal(0, 1)
# Custom initialisation using apply()
def init_weights(module):
if isinstance(module, nn.Linear):
nn.init.xavier_uniform_(module.weight) # Xavier/Glorot
nn.init.zeros_(module.bias)
elif isinstance(module, nn.Conv2d):
nn.init.kaiming_normal_(module.weight,
mode="fan_out",
nonlinearity="relu") # He init
if module.bias is not None:
nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
nn.init.normal_(module.weight, mean=0.0, std=0.02) # GPT-style
model = nn.Sequential(
nn.Linear(128, 64), nn.ReLU(),
nn.Linear(64, 10)
)
model.apply(init_weights) # recursively applies to all sub-modules
# Direct initialisation with torch.no_grad()
with torch.no_grad():
model[0].weight.fill_(0.01)
model[0].bias.zero_()| Scheme | API | Best for |
|---|---|---|
| Xavier / Glorot uniform | nn.init.xavier_uniform_() | Sigmoid / Tanh activations |
| Xavier / Glorot normal | nn.init.xavier_normal_() | Sigmoid / Tanh activations |
| Kaiming / He uniform | nn.init.kaiming_uniform_() | ReLU (PyTorch default) |
| Kaiming / He normal | nn.init.kaiming_normal_() | ReLU (often better than uniform) |
| Normal | nn.init.normal_(mean, std) | Embeddings (std=0.02 GPT-style) |
| Zeros / Ones | nn.init.zeros_() / ones_() | Biases, gates |
| Orthogonal | nn.init.orthogonal_() | RNNs |
18. What loss functions does PyTorch provide and when do you use each?
Loss functions (criteria) measure the difference between predictions and targets. PyTorch provides them in torch.nn. Choosing the right one for your task is critical — using the wrong loss gives poor training signal even if the architecture is correct.
| Loss | Class | Task | Target dtype |
|---|---|---|---|
| Cross-entropy | nn.CrossEntropyLoss | Multi-class classification | Long (class indices) |
| Binary cross-entropy + logits | nn.BCEWithLogitsLoss | Binary / multi-label | Float |
| MSE | nn.MSELoss | Regression | Float |
| MAE / L1 | nn.L1Loss | Robust regression | Float |
| Huber / Smooth L1 | nn.HuberLoss / nn.SmoothL1Loss | Robust regression | Float |
| NLL Loss | nn.NLLLoss | After log-softmax | Long |
| KL Divergence | nn.KLDivLoss | Distribution matching | Float |
| Triplet Margin | nn.TripletMarginLoss | Metric learning | Float |
import torch, torch.nn as nn
# Multi-class classification: CrossEntropyLoss
# Input: (N, C) logits — raw, before softmax
# Target: (N,) class indices — dtype=long
ce = nn.CrossEntropyLoss()
logits = torch.randn(4, 10) # 4 samples, 10 classes
targets = torch.tensor([2, 5, 0, 9]) # true class indices
loss = ce(logits, targets)
# Binary classification: BCEWithLogitsLoss
# Numerically stable (fuses sigmoid + BCE)
bce = nn.BCEWithLogitsLoss()
preds = torch.randn(4) # logits, NOT sigmoid output
true = torch.tensor([1.,0.,1.,0.])
loss_b = bce(preds, true)
# Class weighting for imbalanced datasets
weights = torch.tensor([1.0]*9 + [10.0]) # class 9 is rare
ce_w = nn.CrossEntropyLoss(weight=weights)
# Label smoothing (reduces overconfidence)
ce_ls = nn.CrossEntropyLoss(label_smoothing=0.1)
# Regression: MSE vs Huber
mseLoss = nn.MSELoss()
huberLoss = nn.HuberLoss(delta=1.0) # L2 near 0, L1 for large errors
pred_r = torch.randn(4)
true_r = torch.randn(4)
print(mseLoss(pred_r, true_r))
print(huberLoss(pred_r, true_r))Critical gotcha: nn.CrossEntropyLoss expects raw logits (before softmax), not probabilities. It internally applies log-softmax, so applying softmax first leads to double-softmax and incorrect training.
19. What optimizers does PyTorch provide and how do you choose between them?
An optimizer updates model parameters based on computed gradients. PyTorch provides all major optimizers in torch.optim. Choosing the right optimizer and tuning its hyperparameters has a large impact on training speed and final performance.
| Optimizer | Class | Key parameters | Best for |
|---|---|---|---|
| SGD | optim.SGD | lr, momentum, weight_decay, nesterov | Image classification (with momentum); can generalise better than Adam |
| SGD + Momentum | optim.SGD(momentum=0.9) | momentum=0.9 standard | Most vision tasks |
| Adam | optim.Adam | lr=1e-3, betas=(0.9,0.999), eps=1e-8 | Default choice; fast convergence |
| AdamW | optim.AdamW | lr, weight_decay (decoupled) | Fine-tuning transformers; correct L2 |
| RMSprop | optim.RMSprop | lr, alpha=0.99 | RNNs |
| Adagrad | optim.Adagrad | lr | Sparse features, NLP |
import torch, torch.nn as nn, torch.optim as optim
model = nn.Linear(10, 1)
# SGD with momentum (common for vision)
sgd = optim.SGD(
model.parameters(),
lr=0.01,
momentum=0.9,
weight_decay=1e-4, # L2 regularisation
nesterov=True,
)
# Adam (default for most tasks)
adam = optim.Adam(
model.parameters(),
lr=1e-3,
betas=(0.9, 0.999),
eps=1e-8,
weight_decay=0, # NOTE: weight decay in Adam is coupled (bug!)
)
# AdamW — decoupled weight decay (correct implementation)
adamw = optim.AdamW(
model.parameters(),
lr=1e-3,
weight_decay=0.01, # decoupled from gradient update
)
# Per-layer learning rates (useful for fine-tuning)
optimizer = optim.AdamW([
{"params": model.weight, "lr": 1e-4}, # lower lr for pretrained
{"params": model.bias, "lr": 1e-3}, # higher lr for new head
], weight_decay=0.01)
# Standard training step
optimizer.zero_grad()
loss = nn.MSELoss()(model(torch.randn(8,10)), torch.randn(8,1))
loss.backward()
optimizer.step()Adam vs AdamW: In standard Adam, adding weight_decay couples the regularisation with the adaptive learning rate, weakening its effect. AdamW fixes this by applying weight decay directly to the parameters, separate from the gradient update — this is the correct L2 regularisation and is now the standard for transformer fine-tuning.
20. What are learning rate schedulers in PyTorch and how do you use them?
A learning rate (LR) scheduler adjusts the learning rate during training. Starting with a high LR enables fast early progress; decaying it later allows finer convergence. PyTorch provides many schedulers in torch.optim.lr_scheduler.
| Scheduler | Behaviour | Use case |
|---|---|---|
| StepLR | Multiply lr by gamma every step_size epochs | Simple decay; quick experiments |
| MultiStepLR | Decay at specific epoch milestones | ResNet training schedules |
| CosineAnnealingLR | Cosine curve from lr to eta_min | Most modern training runs |
| OneCycleLR | Warmup to max_lr then cosine decay | Super-convergence; fast training |
| ReduceLROnPlateau | Reduce lr when metric stops improving | When training time is unknown |
| LinearLR | Linear warm-up | Transformer fine-tuning |
| CosineAnnealingWarmRestarts | Cosine + periodic restarts (SGDR) | Ensemble-style training |
import torch, torch.nn as nn, torch.optim as optim
from torch.optim import lr_scheduler
model = nn.Linear(10, 1)
optimizer = optim.AdamW(model.parameters(), lr=1e-3)
# CosineAnnealingLR — most popular modern choice
scheduler_cos = lr_scheduler.CosineAnnealingLR(
optimizer, T_max=100, eta_min=1e-6
)
# OneCycleLR — great for fast training
scheduler_1c = lr_scheduler.OneCycleLR(
optimizer,
max_lr=1e-2,
steps_per_epoch=100, # batches per epoch
epochs=10,
)
# ReduceLROnPlateau — metric-driven
scheduler_plat = lr_scheduler.ReduceLROnPlateau(
optimizer, mode="min", factor=0.5, patience=5, verbose=True
)
# Standard usage in training loop
for epoch in range(100):
train_one_epoch(model, optimizer) # forward + backward + step
# --- Epoch-based schedulers ---
scheduler_cos.step() # call AFTER optimizer.step()
# --- Metric-based scheduler ---
val_loss = validate(model)
scheduler_plat.step(val_loss)
# --- OneCycleLR is per-batch ---
# for batch in dataloader:
# optimizer.step()
# scheduler_1c.step()
print(f"lr: {optimizer.param_groups[0]['lr']:.6f}")Key rule: call scheduler.step() after optimizer.step(). For OneCycleLR and other per-batch schedulers, call scheduler.step() inside the batch loop, not the epoch loop.
21. What activation functions are commonly used in PyTorch and how do you choose between them?
Activation functions introduce non-linearity, allowing networks to model complex functions. PyTorch provides them as both nn.Module classes (for use in nn.Sequential) and functional calls in torch.nn.functional.
| Activation | nn class | Range | Typical use |
|---|---|---|---|
| ReLU | nn.ReLU() | [0, ∞) | Default for hidden layers — fast, avoids vanishing gradient for x>0 |
| LeakyReLU | nn.LeakyReLU(0.01) | (-∞, ∞) | Fixes ReLU's dying neuron problem |
| Sigmoid | nn.Sigmoid() | (0, 1) | Binary classification output layer |
| Tanh | nn.Tanh() | (-1, 1) | RNN hidden states (zero-centred) |
| Softmax | nn.Softmax(dim=-1) | (0,1), sums to 1 | Multi-class output (use with NLLLoss, not CrossEntropyLoss) |
| GELU | nn.GELU() | (-∞, ∞) | Transformers (BERT, GPT) |
import torch
import torch.nn as nn
import torch.nn.functional as F
x = torch.tensor([-2.0, -0.5, 0.0, 0.5, 2.0])
# Module form — for use inside nn.Sequential / __init__
relu = nn.ReLU()
print(relu(x)) # tensor([0.0, 0.0, 0.0, 0.5, 2.0])
# Functional form — for use directly inside forward()
print(F.relu(x))
print(F.leaky_relu(x, negative_slope=0.01))
print(F.gelu(x))
# Using inside a model
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(10, 20)
self.fc2 = nn.Linear(20, 1)
def forward(self, x):
x = F.relu(self.fc1(x)) # functional — common in forward()
return torch.sigmoid(self.fc2(x)) # binary output
# IMPORTANT: never apply softmax before CrossEntropyLoss
# CrossEntropyLoss = LogSoftmax + NLLLoss internally
logits = torch.randn(4, 10) # raw scores, NOT softmaxed
loss_fn = nn.CrossEntropyLoss()
targets = torch.randint(0, 10, (4,))
loss = loss_fn(logits, targets) # correct — pass raw logits!Common mistake: applying Softmax before CrossEntropyLoss — the loss function already applies LogSoftmax internally, so double-softmaxing produces incorrect gradients and degraded training.
22. What loss functions does PyTorch provide and how do you choose the right one?
The loss function defines the training objective. PyTorch's torch.nn module provides loss classes for classification, regression, and more specialised tasks. Choosing the wrong loss for your task is one of the most common beginner mistakes.
| Loss | Class | Input shape | Use case |
|---|---|---|---|
| MSELoss | nn.MSELoss() | pred & target same shape | Regression |
| L1Loss | nn.L1Loss() | pred & target same shape | Regression, robust to outliers |
| CrossEntropyLoss | nn.CrossEntropyLoss() | logits (N,C), target (N,) int64 | Multi-class classification |
| BCELoss | nn.BCELoss() | probabilities (N,), target (N,) float | Binary classification (after sigmoid) |
| BCEWithLogitsLoss | nn.BCEWithLogitsLoss() | raw logits (N,), target (N,) float | Binary classification (numerically stable) |
| NLLLoss | nn.NLLLoss() | log-probabilities (N,C) | Used after LogSoftmax manually |
import torch
import torch.nn as nn
# ── Regression: MSE
mse = nn.MSELoss()
pred = torch.tensor([2.5, 3.0, 4.1])
target = torch.tensor([3.0, 3.0, 4.0])
loss = mse(pred, target) # mean((pred-target)^2)
# ── Multi-class classification: CrossEntropyLoss
ce = nn.CrossEntropyLoss()
logits = torch.randn(8, 5) # batch=8, 5 classes — RAW logits
targets = torch.randint(0, 5, (8,)) # class indices, dtype long
loss = ce(logits, targets)
# ── Binary classification: BCEWithLogitsLoss (preferred over BCELoss)
bce = nn.BCEWithLogitsLoss() # combines Sigmoid + BCE, numerically stable
logits_binary = torch.randn(8, 1)
targets_binary = torch.randint(0, 2, (8, 1)).float()
loss = bce(logits_binary, targets_binary)
# ── Class-weighted CrossEntropy for imbalanced data
class_weights = torch.tensor([1.0, 1.0, 5.0, 1.0, 1.0]) # upweight class 2
ce_weighted = nn.CrossEntropyLoss(weight=class_weights)
# ── Custom loss function
class FocalLoss(nn.Module):
def __init__(self, gamma=2.0):
super().__init__()
self.gamma = gamma
self.ce = nn.CrossEntropyLoss(reduction="none")
def forward(self, logits, targets):
ce_loss = self.ce(logits, targets)
pt = torch.exp(-ce_loss)
focal = ((1 - pt) ** self.gamma * ce_loss).mean()
return focal
23. What optimizers does PyTorch provide and what is the difference between SGD, Adam, and AdamW?
Optimizers update model parameters based on computed gradients. PyTorch's torch.optim module provides many algorithms; understanding their differences helps you choose the right one and tune hyperparameters effectively.
| Optimizer | Key idea | Typical lr | Best for |
|---|---|---|---|
| SGD | Plain gradient descent, optional momentum | 0.01–0.1 | Image classification (with momentum + schedule) |
| SGD + momentum | Accumulates velocity to smooth updates | 0.01–0.1 | Often best final generalisation |
| Adam | Adaptive per-parameter learning rates + momentum | 1e-3 | Fast convergence, good default |
| AdamW | Adam with decoupled weight decay | 1e-3 to 5e-5 | Fine-tuning transformers, modern default |
| RMSprop | Adaptive lr based on recent gradient magnitude | 1e-3 | RNNs (historically popular) |
import torch
import torch.nn as nn
import torch.optim as optim
model = nn.Linear(10, 1)
# ── SGD with momentum
opt_sgd = optim.SGD(
model.parameters(),
lr=0.01,
momentum=0.9, # accelerates in consistent gradient directions
weight_decay=1e-4, # L2 regularisation
)
# ── Adam — adaptive learning rate per parameter
opt_adam = optim.Adam(
model.parameters(),
lr=1e-3,
betas=(0.9, 0.999), # momentum decay rates
eps=1e-8,
)
# ── AdamW — decoupled weight decay (recommended for fine-tuning)
opt_adamw = optim.AdamW(
model.parameters(),
lr=2e-5, # typical for fine-tuning pretrained models
weight_decay=0.01,
)
# ── Standard training step
x, y = torch.randn(16, 10), torch.randn(16, 1)
loss_fn = nn.MSELoss()
opt_adamw.zero_grad() # 1. clear old gradients
pred = model(x) # 2. forward pass
loss = loss_fn(pred, y) # 3. compute loss
loss.backward() # 4. backpropagate
opt_adamw.step() # 5. update parameters
# ── Different learning rates per parameter group
optimizer = optim.AdamW([
{"params": model.weight, "lr": 1e-3},
{"params": model.bias, "lr": 1e-4},
])
24. What is the standard PyTorch training loop and what does each step do?
The PyTorch training loop follows a fixed five-step pattern repeated for every batch. Understanding exactly what each line does — and what happens if you skip or reorder a step — is essential for debugging training issues.
import torch
import torch.nn as nn
import torch.optim as optim
model = nn.Sequential(nn.Linear(20, 64), nn.ReLU(), nn.Linear(64, 2))
optimizer = optim.AdamW(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
def train_one_epoch(model, loader, optimizer, loss_fn, device):
model.train() # 0. enables Dropout, BatchNorm train mode
total_loss = 0.0
for X_batch, y_batch in loader:
X_batch = X_batch.to(device)
y_batch = y_batch.to(device)
optimizer.zero_grad() # 1. clear gradients from previous step
logits = model(X_batch) # 2. forward pass
loss = loss_fn(logits, y_batch) # 3. compute loss
loss.backward() # 4. backpropagate — fills .grad
optimizer.step() # 5. update weights using gradients
total_loss += loss.item() * X_batch.size(0)
return total_loss / len(loader.dataset)
@torch.no_grad() # disable gradient tracking for eval
def validate(model, loader, loss_fn, device):
model.eval() # disables Dropout, BatchNorm uses running stats
total_loss, correct = 0.0, 0
for X_batch, y_batch in loader:
X_batch, y_batch = X_batch.to(device), y_batch.to(device)
logits = model(X_batch)
loss = loss_fn(logits, y_batch)
total_loss += loss.item() * X_batch.size(0)
correct += (logits.argmax(1) == y_batch).sum().item()
return total_loss / len(loader.dataset), correct / len(loader.dataset)
# Full training loop
for epoch in range(10):
train_loss = train_one_epoch(model, train_loader, optimizer, loss_fn, device)
val_loss, val_acc = validate(model, val_loader, loss_fn, device)
print(f"Epoch {epoch}: train_loss={train_loss:.4f} val_loss={val_loss:.4f} val_acc={val_acc:.4f}")| Step | Call | Purpose |
|---|---|---|
| 0 | model.train() | Enable Dropout, set BatchNorm to use batch statistics |
| 1 | optimizer.zero_grad() | Clear accumulated gradients from the previous step |
| 2 | model(x) | Forward pass — compute predictions |
| 3 | loss_fn(pred, target) | Compute scalar loss |
| 4 | loss.backward() | Backpropagate — populate .grad on each parameter |
| 5 | optimizer.step() | Update parameters using gradients and the optimizer's rule |
25. What are Dataset and DataLoader in PyTorch and how do they work together?
PyTorch's data pipeline follows a clean two-class design: Dataset defines how to access a single sample (index → data), and DataLoader wraps a Dataset to handle batching, shuffling, and parallel loading.
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np
class TabularDataset(Dataset):
def __init__(self, X: np.ndarray, y: np.ndarray):
# Convert once at construction — not inside __getitem__!
self.X = torch.tensor(X, dtype=torch.float32)
self.y = torch.tensor(y, dtype=torch.long)
def __len__(self) -> int:
"""Required — tells DataLoader how many samples exist."""
return len(self.X)
def __getitem__(self, idx: int):
"""Required — return a single (features, label) sample."""
return self.X[idx], self.y[idx]
# Synthetic data
X = np.random.randn(1000, 20).astype(np.float32)
y = np.random.randint(0, 3, size=1000)
dataset = TabularDataset(X, y)
print(len(dataset)) # 1000
print(dataset[0]) # (tensor of 20 features, tensor scalar label)
loader = DataLoader(
dataset,
batch_size=32,
shuffle=True, # shuffle each epoch — essential for training
num_workers=4, # parallel data loading subprocesses
pin_memory=True, # faster CPU→GPU transfer
drop_last=True, # drop incomplete final batch
)
# Iterate over batches
for X_batch, y_batch in loader:
print(X_batch.shape, y_batch.shape) # (32, 20) (32,)
break
# torchvision pre-built datasets
from torchvision import datasets, transforms
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
])
mnist = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
mnist_loader = DataLoader(mnist, batch_size=64, shuffle=True)| Component | Responsibility | Required methods |
|---|---|---|
| Dataset | Defines how to access ONE sample by index | __len__, __getitem__ |
| DataLoader | Batches samples, shuffles, parallelises loading | Wraps any Dataset object |
26. How do you move tensors and models between CPU and GPU in PyTorch?
PyTorch's device abstraction allows the same code to run on CPU or GPU with minimal changes. The fundamental rule: a model and its input tensors must reside on the same device before any computation, or PyTorch raises a RuntimeError.
import torch
import torch.nn as nn
# Device-agnostic pattern — always write code this way
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# Move a model to the device
model = nn.Linear(10, 1).to(device)
# Move data to the same device, every batch, inside the loop
for X_batch, y_batch in loader:
X_batch = X_batch.to(device, non_blocking=True)
y_batch = y_batch.to(device, non_blocking=True)
pred = model(X_batch) # works — both on same device
# WRONG — mismatched devices raises RuntimeError
# model_cpu = nn.Linear(10, 1) # stays on CPU
# x_gpu = torch.randn(4, 10).to("cuda")
# model_cpu(x_gpu) # RuntimeError: Expected all tensors on same device
# Checking tensor device
t = torch.randn(3)
print(t.device) # cpu
t_gpu = t.cuda() # or t.to("cuda:0")
print(t_gpu.device) # cuda:0
# GPU memory diagnostics
if torch.cuda.is_available():
print(torch.cuda.memory_allocated() / 1e9, "GB allocated")
print(torch.cuda.max_memory_allocated() / 1e9, "GB peak")
torch.cuda.empty_cache() # release unused cached memory
# Moving a tensor back to CPU (required before .numpy())
result = t_gpu.cpu().numpy() # numpy() requires a CPU tensor
# Apple Silicon (M1/M2/M3) GPU support
mps_device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")| Method | Effect |
|---|---|
| tensor.to(device) | Moves to specified device — most flexible, recommended |
| tensor.cuda() | Shorthand for .to('cuda') |
| tensor.cpu() | Moves back to CPU (required before .numpy()) |
| model.to(device) | Moves all model parameters and buffers |
| non_blocking=True | Allows async transfer when paired with pin_memory=True |
27. What is the difference between model.parameters() and model.state_dict() in PyTorch?
Both expose a model's learnable values, but they serve different purposes. parameters() returns an iterator of nn.Parameter tensor objects (used by the optimizer); state_dict() returns an OrderedDict mapping layer names to tensors (used for saving/loading and inspection).
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(10, 20),
nn.ReLU(),
nn.Linear(20, 1),
)
# ── parameters(): iterator of Parameter tensors (no names)
for p in model.parameters():
print(p.shape, p.requires_grad)
# torch.Size([20, 10]) True
# torch.Size([20]) True
# torch.Size([1, 20]) True
# torch.Size([1]) True
# Used to construct optimizers
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# ── named_parameters(): iterator of (name, Parameter) tuples
for name, p in model.named_parameters():
print(name, p.shape)
# 0.weight torch.Size([20, 10])
# 0.bias torch.Size([20])
# 2.weight torch.Size([1, 20])
# 2.bias torch.Size([1])
# ── state_dict(): OrderedDict for save/load
sd = model.state_dict()
print(type(sd)) # <class 'collections.OrderedDict'>
print(sd.keys()) # dict_keys(['0.weight', '0.bias', '2.weight', '2.bias'])
# Saving and loading via state_dict (the recommended pattern)
torch.save(model.state_dict(), "model_weights.pt")
new_model = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 1))
new_model.load_state_dict(torch.load("model_weights.pt"))
new_model.eval() # ALWAYS call after loading for inference
# Total parameter count
total_params = sum(p.numel() for p in model.parameters())
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total: {total_params}, Trainable: {trainable}")
28. How do you save and load PyTorch models correctly, including full training checkpoints?
PyTorch supports saving either the full model object or just its weights (state_dict). Saving only the state_dict is the recommended approach because it decouples weights from the Python class definition. A full training checkpoint includes the optimizer state too, so training can resume exactly where it left off.
import torch
import torch.nn as nn
model = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# ── RECOMMENDED: save/load state_dict only
torch.save(model.state_dict(), "weights.pt")
model_new = nn.Linear(10, 5) # must define the SAME architecture first
model_new.load_state_dict(torch.load("weights.pt"))
model_new.eval() # always call before inference
# ── NOT recommended: save the entire model object
# Fragile — breaks if the class definition moves or changes
torch.save(model, "full_model.pt")
loaded_model = torch.load("full_model.pt", weights_only=False)
# ── Full training checkpoint — for resuming training
def save_checkpoint(path, epoch, model, optimizer, best_val_loss):
torch.save({
"epoch": epoch,
"model_state": model.state_dict(),
"optimizer_state": optimizer.state_dict(), # Adam momentum buffers etc.
"best_val_loss": best_val_loss,
}, path)
def load_checkpoint(path, model, optimizer):
ckpt = torch.load(path, map_location="cpu") # always load to CPU first
model.load_state_dict(ckpt["model_state"])
optimizer.load_state_dict(ckpt["optimizer_state"])
return ckpt["epoch"], ckpt["best_val_loss"]
save_checkpoint("ckpt.pt", epoch=5, model=model, optimizer=optimizer, best_val_loss=0.42)
epoch, best_loss = load_checkpoint("ckpt.pt", model_new, optimizer)
# ── Loading on a different device than it was saved
model.load_state_dict(
torch.load("weights.pt", map_location="cpu") # avoid GPU OOM if GPU unavailable
)
model = model.to("cuda") # then move to the desired device
29. What is overfitting and what regularization techniques does PyTorch support to address it?
Overfitting occurs when a model memorises the training data instead of learning generalisable patterns — visible as low training loss but high validation loss. PyTorch provides several built-in tools to combat overfitting.
| Technique | How to apply | Effect |
|---|---|---|
| Dropout | nn.Dropout(p=0.5) layer | Randomly zeroes activations during training, preventing co-adaptation |
| Weight decay (L2) | optimizer weight_decay= parameter | Penalises large weights, encourages simpler models |
| Early stopping | Manual: track val_loss, stop when it plateaus | Prevents training past the point of generalisation |
| Data augmentation | torchvision.transforms | Increases effective dataset size and diversity |
| Batch Normalization | nn.BatchNorm1d/2d | Stabilises training; has a mild regularising side effect |
| Label smoothing | CrossEntropyLoss(label_smoothing=0.1) | Prevents overconfident predictions |
import torch
import torch.nn as nn
class RegularizedNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 256)
self.bn1 = nn.BatchNorm1d(256)
self.drop = nn.Dropout(p=0.5) # 50% dropout
self.fc2 = nn.Linear(256, 10)
def forward(self, x):
x = torch.relu(self.bn1(self.fc1(x)))
x = self.drop(x) # active in train(), off in eval()
return self.fc2(x)
model = RegularizedNet()
# Weight decay — L2 penalty added by the optimizer
optimizer = torch.optim.AdamW(
model.parameters(),
lr=1e-3,
weight_decay=1e-2, # penalise large weights
)
# Label smoothing — softens hard one-hot targets
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
# Early stopping pattern
best_val_loss = float("inf")
patience, patience_counter = 5, 0
for epoch in range(100):
train_loss = train_one_epoch(model, train_loader, optimizer, criterion, device)
val_loss, _ = validate(model, val_loader, criterion, device)
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
torch.save(model.state_dict(), "best_model.pt") # save best checkpoint
else:
patience_counter += 1
if patience_counter >= patience:
print(f"Early stopping at epoch {epoch}")
break
30. What is the vanishing/exploding gradient problem and how do you detect and fix it in PyTorch?
During backpropagation, gradients are computed via repeated multiplication through the chain rule. In deep networks, this can cause gradients to shrink toward zero (vanishing) or grow toward infinity (exploding) as they propagate backward through many layers, preventing effective training.
import torch
import torch.nn as nn
model = nn.LSTM(input_size=10, hidden_size=128, num_layers=3, batch_first=True)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
x = torch.randn(32, 20, 10)
output, _ = model(x)
loss = output.sum()
optimizer.zero_grad()
loss.backward()
# ── Detect: monitor gradient norms
total_norm = 0.0
for p in model.parameters():
if p.grad is not None:
total_norm += p.grad.data.norm(2).item() ** 2
total_norm = total_norm ** 0.5
print(f"Gradient norm: {total_norm:.4f}")
# Very small (~1e-6) → vanishing; very large (~1e3+) → exploding
# ── Fix 1: Gradient clipping — caps the gradient norm before the step
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
# ── Fix 2: Better weight initialisation (He init for ReLU networks)
def init_weights(m):
if isinstance(m, nn.Linear):
nn.init.kaiming_uniform_(m.weight, nonlinearity="relu")
nn.init.zeros_(m.bias)
mlp = nn.Sequential(nn.Linear(64, 64), nn.ReLU(), nn.Linear(64, 64))
mlp.apply(init_weights)
# ── Fix 3: Batch Normalization — stabilises layer input distributions
class StableNet(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(64, 64), nn.BatchNorm1d(64), nn.ReLU(),
nn.Linear(64, 64), nn.BatchNorm1d(64), nn.ReLU(),
)
def forward(self, x):
return self.net(x)
# ── Fix 4: Residual / skip connections — gradient highway
class ResidualBlock(nn.Module):
def __init__(self, dim):
super().__init__()
self.net = nn.Sequential(nn.Linear(dim, dim), nn.ReLU(), nn.Linear(dim, dim))
def forward(self, x):
return x + self.net(x) # gradient flows through x directly
31. What is weight initialization in PyTorch and why does it matter?
How a network's weights are initialised at the start of training significantly affects whether training converges quickly, slowly, or not at all. PyTorch's default initialisation (Kaiming uniform for Linear/Conv layers) works well in most cases, but understanding the principles helps when debugging training issues.
import torch
import torch.nn as nn
# PyTorch default: Linear layers use Kaiming Uniform initialisation
layer = nn.Linear(256, 128)
print(layer.weight.std().item()) # approximately sqrt(2/256) ≈ 0.088
# Explicit initialisation methods
def init_weights(m):
if isinstance(m, nn.Linear):
# Xavier/Glorot — good for Tanh/Sigmoid activations
nn.init.xavier_uniform_(m.weight)
# He/Kaiming — good for ReLU-family activations (PyTorch default)
# nn.init.kaiming_uniform_(m.weight, nonlinearity="relu")
nn.init.zeros_(m.bias)
model = nn.Sequential(
nn.Linear(784, 256), nn.ReLU(),
nn.Linear(256, 128), nn.ReLU(),
nn.Linear(128, 10),
)
model.apply(init_weights) # applies init_weights to every sub-module
# Why initialisation matters: too small → vanishing activations
# too large → exploding activations, especially in deep nets
x = torch.randn(100, 784)
for layer in model:
x = layer(x)
if hasattr(layer, "weight"):
print(f"{layer}: activation std={x.std().item():.4f}")
# With good init, std should stay roughly stable across layers
# Custom initialisation from scratch
with torch.no_grad():
layer.weight.normal_(mean=0.0, std=0.02) # common for transformer init
layer.bias.zero__()| Method | Formula (roughly) | Best for |
|---|---|---|
| Xavier/Glorot | Var = 2/(fan_in+fan_out) | Tanh, Sigmoid activations |
| Kaiming/He (PyTorch default for Linear) | Var = 2/fan_in | ReLU, LeakyReLU activations |
| Zero init | All weights = 0 | NEVER for weights — breaks symmetry; OK for biases |
| Small normal (std≈0.02) | N(0, 0.02²) | Transformer architectures (BERT, GPT) |
32. What is the difference between nn.Parameter and a regular tensor attribute in nn.Module?
nn.Parameter is a special tensor subclass that, when assigned as an attribute of an nn.Module, is automatically registered in the module's parameter list — meaning it appears in model.parameters(), gets moved by .to(device), and is saved in state_dict(). A plain tensor attribute does none of this.
import torch
import torch.nn as nn
class CustomLayer(nn.Module):
def __init__(self, dim: int):
super().__init__()
# nn.Parameter — automatically registered, tracked, trained
self.weight = nn.Parameter(torch.randn(dim, dim))
self.bias = nn.Parameter(torch.zeros(dim))
# Plain tensor — NOT registered, NOT trained, invisible to optimizer
self.scale = torch.tensor(2.0) # WRONG if meant to be learnable!
# register_buffer — for non-trainable state that SHOULD move with
# the model and be saved (e.g. BatchNorm running mean/var)
self.register_buffer("running_mean", torch.zeros(dim))
def forward(self, x):
return x @ self.weight + self.bias
layer = CustomLayer(10)
# Check what appears in parameters()
for name, p in layer.named_parameters():
print(name, p.shape)
# weight torch.Size([10, 10])
# bias torch.Size([10])
# scale and running_mean do NOT appear here!
# Check state_dict — includes parameters AND buffers, but not plain tensors
print(layer.state_dict().keys())
# odict_keys(['weight', 'bias', 'running_mean'])
# .to(device) moves Parameters and registered buffers, but NOT plain tensor attrs
layer.to("cuda") if torch.cuda.is_available() else None
# layer.scale would STILL be on CPU — a common silent bug!| Attribute type | In parameters()? | In state_dict()? | Moved by .to(device)? | Trained by optimizer? |
|---|---|---|---|---|
| nn.Parameter | Yes | Yes | Yes | Yes |
| register_buffer tensor | No | Yes | Yes | No |
| Plain tensor attribute | No | No | No (silent bug risk!) | No |
33. How do you implement and use learning rate schedulers in PyTorch?
A fixed learning rate throughout training is rarely optimal — too high late in training prevents fine convergence, while too low early on wastes time. PyTorch's torch.optim.lr_scheduler module adjusts the learning rate systematically as training progresses.
import torch
import torch.nn as nn
import torch.optim as optim
model = nn.Linear(10, 1)
optimizer = optim.AdamW(model.parameters(), lr=1e-3)
# ── StepLR: multiply lr by gamma every step_size epochs
scheduler_step = optim.lr_scheduler.StepLR(
optimizer, step_size=10, gamma=0.1
)
# ── CosineAnnealingLR: smooth decay following a cosine curve
scheduler_cos = optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=100, eta_min=1e-6
)
# ── ReduceLROnPlateau: reduce lr when a metric stops improving
scheduler_plateau = optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode="min", factor=0.5, patience=5
)
# ── OneCycleLR: warmup then decay — fast convergence ("super-convergence")
n_epochs, steps_per_epoch = 10, 100
scheduler_1cycle = optim.lr_scheduler.OneCycleLR(
optimizer,
max_lr=1e-2,
total_steps=n_epochs * steps_per_epoch,
pct_start=0.3, # 30% of steps used for warmup
)
# ── Training loop with scheduler
for epoch in range(100):
train_one_epoch(model, train_loader, optimizer, loss_fn, device)
val_loss, _ = validate(model, val_loader, loss_fn, device)
scheduler_cos.step() # epoch-based scheduler — call once per epoch
scheduler_plateau.step(val_loss) # metric-based — pass the metric value
current_lr = optimizer.param_groups[0]["lr"]
print(f"Epoch {epoch}: lr={current_lr:.6f}")
# Note: OneCycleLR and some schedulers are called PER BATCH, not per epoch
# for step in range(total_steps):
# train_step(...)
# scheduler_1cycle.step() # called inside the batch loop| Scheduler | Behaviour | Call frequency |
|---|---|---|
| StepLR | Multiply lr by gamma every N epochs | Per epoch |
| CosineAnnealingLR | Smooth cosine decay | Per epoch |
| ReduceLROnPlateau | Reduce lr when validation metric plateaus | Per epoch, after computing metric |
| OneCycleLR | Warmup then decay in one cycle | Per batch/step |
| LinearLR / warmup schedules | Linear ramp from low to target lr | Per step, common for transformers |
34. How do you debug a PyTorch training loop where the loss is not decreasing or is NaN?
Diagnosing a stuck or diverging training loop is one of the most valuable practical PyTorch skills. The shape of the loss curve and a few targeted checks usually reveal the root cause.
| Symptom | Likely cause | Fix |
|---|---|---|
| Loss is NaN from step 1 | Exploding gradients, bad data (inf/NaN inputs), lr too high | Check input data, add gradient clipping, lower lr |
| Loss never decreases | Vanishing gradients, lr too low, forgot optimizer.step() | Check gradient norms, raise lr, verify training loop order |
| Loss decreases then plateaus high | Model too small, lr too high for fine convergence | Increase capacity, add lr scheduler |
| Train loss low, val loss high | Overfitting | Add dropout, weight decay, more data, early stopping |
| Loss oscillates wildly | lr too high, batch size too small | Lower lr, increase batch size, use lr warmup |
import torch
import torch.nn as nn
model = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
for step, (X, y) in enumerate(loader):
optimizer.zero_grad()
logits = model(X)
loss = criterion(logits, y)
# ── Check 1: is the loss finite?
if not torch.isfinite(loss):
print(f"Step {step}: non-finite loss = {loss.item()}")
print("Input contains NaN:", torch.isnan(X).any().item())
print("Input contains Inf:", torch.isinf(X).any().item())
break
loss.backward()
# ── Check 2: gradient norms — are gradients flowing at all?
total_norm = sum(
p.grad.norm().item() ** 2 for p in model.parameters() if p.grad is not None
) ** 0.5
if step % 50 == 0:
print(f"Step {step}: loss={loss.item():.4f} grad_norm={total_norm:.4f}")
# ── Check 3: are any gradients None? (means that param was unused!)
for name, p in model.named_parameters():
if p.grad is None:
print(f"WARNING: {name} has no gradient — is it used in forward()?")
optimizer.step()
# ── Check 4: verify model output shape and range make sense
with torch.no_grad():
sample_out = model(X[:1])
print("Output range:", sample_out.min().item(), sample_out.max().item())
# ── Check 5: overfit a tiny batch — sanity check the architecture
# If the model cannot drive loss near zero on 5 examples, there is a bug
tiny_X, tiny_y = X[:5], y[:5]
for _ in range(200):
optimizer.zero_grad()
loss = criterion(model(tiny_X), tiny_y)
loss.backward()
optimizer.step()
print(f"Tiny-batch overfit loss: {loss.item():.6f}") # should approach 0
35. What is the difference between torch.tensor() and torch.Tensor() (capital T) for creating tensors?
This is a subtle but important PyTorch gotcha. torch.tensor() (lowercase, a function) infers dtype from the input data and copies it — the recommended way to create tensors from data. torch.Tensor() (uppercase, a class constructor) is an alias for torch.FloatTensor and behaves inconsistently depending on the argument type.
import torch
# ── torch.tensor() — RECOMMENDED, infers dtype, copies data
a = torch.tensor([1, 2, 3])
print(a.dtype) # torch.int64 — inferred from Python ints
b = torch.tensor([1.0, 2.0, 3.0])
print(b.dtype) # torch.float32 — inferred from Python floats
c = torch.tensor([1, 2, 3], dtype=torch.float32) # explicit override
print(c.dtype) # torch.float32
# ── torch.Tensor() — confusing, AVOID for creating tensors from data
d = torch.Tensor([1, 2, 3])
print(d.dtype) # torch.float32 — ALWAYS float32, ignores int input!
e = torch.Tensor(3, 4) # interprets ints as a SHAPE, not data!
print(e.shape) # torch.Size([3, 4]) — uninitialised memory, random values
# Common gotcha: these look similar but behave VERY differently
f1 = torch.tensor(3) # scalar tensor with value 3
f2 = torch.Tensor(3) # tensor of SHAPE (3,) with garbage/uninitialised values!
print(f1) # tensor(3)
print(f2) # tensor([4.6e-41, 0.0, 1.4e-45]) — random uninitialised memory!
# Recommended explicit constructors for empty/typed tensors:
g = torch.empty(3, 4) # uninitialised, explicit intent
h = torch.zeros(3, 4, dtype=torch.float32)
i = torch.ones(3, 4, dtype=torch.int64)Rule of thumb: always use lowercase torch.tensor() when creating a tensor from existing data (a list, NumPy array, or scalar). Use torch.zeros(), torch.ones(), torch.empty(), or torch.rand() when you want a new tensor of a given shape. Avoid torch.Tensor() entirely in new code.
36. How does gradient accumulation work in PyTorch and when would you use it?
Gradient accumulation simulates a larger effective batch size than fits in GPU memory by summing gradients over several smaller forward/backward passes before calling optimizer.step(). This is useful when training large models on limited GPU memory.
import torch
import torch.nn as nn
model = nn.Linear(100, 10)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
# Simulate effective batch_size=128 using micro_batch=32 (4 accumulation steps)
accumulation_steps = 4
optimizer.zero_grad()
for step, (X_micro, y_micro) in enumerate(loader): # loader yields micro-batches
logits = model(X_micro)
loss = criterion(logits, y_micro)
# CRITICAL: scale loss by 1/accumulation_steps before backward
# so the accumulated gradient matches what a single large-batch
# backward pass would have produced
loss = loss / accumulation_steps
loss.backward() # gradients ACCUMULATE (not cleared)
if (step + 1) % accumulation_steps == 0:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step() # update only every N micro-batches
optimizer.zero_grad(set_to_none=True) # clear for next accumulation cycle
# Effective batch size = micro_batch_size * accumulation_steps
# This trades extra forward/backward compute for lower peak memory usage| Aspect | Effect |
|---|---|
| GPU memory | Stays at micro-batch level — much lower peak usage |
| Wall-clock time | Slightly slower than one large batch (more Python overhead) |
| Effective batch size | micro_batch_size × accumulation_steps |
| BatchNorm caveat | Statistics computed per micro-batch, not the full effective batch — can behave differently than true large-batch training |
37. What is mixed precision training in PyTorch and how do you implement it with torch.cuda.amp?
Mixed precision training runs most operations in FP16 (or BF16) for speed while keeping a master copy of weights in FP32 for numerical stability. Modern GPUs (Volta and later) have dedicated hardware (Tensor Cores) that make FP16 matrix multiplication significantly faster than FP32.
import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler
model = nn.Linear(1024, 512).cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scaler = GradScaler() # manages loss scaling to prevent FP16 underflow
x = torch.randn(256, 1024).cuda()
y = torch.randn(256, 512).cuda()
for step in range(100):
optimizer.zero_grad()
# autocast: automatically runs eligible ops in FP16/BF16
with autocast(device_type="cuda", dtype=torch.float16):
y_hat = model(x) # matmul runs in FP16 — faster!
loss = nn.MSELoss()(y_hat, y)
# Loss scaling: inflate loss before backward to prevent small
# gradients from underflowing to zero in FP16's limited range
scaler.scale(loss).backward()
scaler.unscale_(optimizer) # restore original gradient magnitudes
nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer) # skips the step if grads are inf/NaN
scaler.update() # adjusts scale factor for next iteration
# BFloat16: no GradScaler needed (same exponent range as FP32)
with autocast(device_type="cuda", dtype=torch.bfloat16):
y_hat = model(x) # no underflow risk — scaling unnecessary
38. What is torch.compile() and how does it speed up PyTorch model execution?
Introduced in PyTorch 2.0, torch.compile() performs just-in-time compilation of a model. Instead of executing each tensor operation eagerly (PyTorch's default), it captures the computation graph, fuses operations, and generates optimised kernels — primarily reducing GPU memory round-trips.
import torch
import torch.nn as nn
import time
model = nn.Sequential(
nn.Linear(1024, 1024), nn.GELU(),
nn.Linear(1024, 512), nn.GELU(),
nn.Linear(512, 10),
).cuda()
# Compile the model — wraps it, does NOT change the API
compiled_model = torch.compile(model)
x = torch.randn(256, 1024).cuda()
# First call triggers compilation (slow — may take 10-60 seconds)
out = compiled_model(x)
# Subsequent calls use the compiled, optimised kernels (fast)
for _ in range(5):
out = compiled_model(x)
# Compilation modes — trade compile time for runtime speed
model_default = torch.compile(model) # balanced
model_reduce = torch.compile(model, mode="reduce-overhead") # less Python overhead
model_max = torch.compile(model, mode="max-autotune") # slowest compile, fastest run
# Benchmark comparison
def benchmark(fn, x, n=100):
for _ in range(5): fn(x) # warmup
torch.cuda.synchronize()
start = time.time()
for _ in range(n): fn(x)
torch.cuda.synchronize()
return time.time() - start
eager_time = benchmark(model, x)
compiled_time = benchmark(compiled_model, x)
print(f"Eager: {eager_time:.3f}s, Compiled: {compiled_time:.3f}s")
39. What is the difference between batch size, epoch, and iteration in PyTorch training?
These three terms are fundamental to understanding any training loop, and confusing them is a common source of bugs when computing metrics or setting up learning rate schedules.
| Term | Definition | Example |
|---|---|---|
| Batch size | Number of samples processed together in one forward/backward pass | 32 |
| Iteration (step) | One forward + backward + optimizer.step() call — processes one batch | 1 step = 1 batch processed |
| Epoch | One complete pass through the entire training dataset | 1 epoch = dataset_size / batch_size iterations |
import torch
from torch.utils.data import DataLoader, TensorDataset
# Example: 1000 training samples, batch size 32
X = torch.randn(1000, 20)
y = torch.randint(0, 5, (1000,))
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, shuffle=True)
iterations_per_epoch = len(loader) # = ceil(1000 / 32) = 32
print(f"Iterations per epoch: {iterations_per_epoch}")
n_epochs = 10
total_iterations = n_epochs * iterations_per_epoch
print(f"Total training iterations: {total_iterations}") # 320
global_step = 0
for epoch in range(n_epochs):
for batch_idx, (X_batch, y_batch) in enumerate(loader):
# This inner loop body executes once PER ITERATION
# X_batch.shape[0] == batch_size (32, except possibly the last batch)
global_step += 1
if global_step % 10 == 0:
print(f"Epoch {epoch}, iteration {batch_idx}, global step {global_step}")
print(f"--- Completed epoch {epoch} ---") # runs once PER EPOCH
# Common pitfall: confusing scheduler.step() granularity
# Some schedulers (StepLR) expect ONE call per epoch
# Others (OneCycleLR) expect ONE call per iteration/step
# Mixing these up silently breaks the intended learning rate schedule
40. How do you compute and track evaluation metrics like accuracy during PyTorch training?
Tracking metrics correctly requires accumulating values across all batches (not just averaging per-batch metrics naively, which can be biased if the last batch has a different size) and ensuring computations happen without gradient tracking.
import torch
import torch.nn as nn
@torch.no_grad() # disable gradient tracking for the entire evaluation function
def evaluate(model, loader, criterion, device):
model.eval() # disable dropout, use BN running stats
total_loss = 0.0
total_correct = 0
total_samples = 0
for X_batch, y_batch in loader:
X_batch, y_batch = X_batch.to(device), y_batch.to(device)
batch_size = X_batch.size(0)
logits = model(X_batch)
loss = criterion(logits, y_batch)
# Weight by batch_size — correct even if the last batch is smaller
total_loss += loss.item() * batch_size
preds = logits.argmax(dim=1)
total_correct += (preds == y_batch).sum().item()
total_samples += batch_size
avg_loss = total_loss / total_samples
accuracy = total_correct / total_samples
return avg_loss, accuracy
# WRONG pattern — naively averaging per-batch averages
# is biased if batch sizes are unequal (e.g. last batch is smaller)
def evaluate_wrong(model, loader, criterion):
losses = []
for X_batch, y_batch in loader:
loss = criterion(model(X_batch), y_batch)
losses.append(loss.item()) # all batches weighted EQUALLY — wrong!
return sum(losses) / len(losses) # biased if last batch has fewer samples
# Using torchmetrics for more complex metrics (F1, precision, AUROC)
# pip install torchmetrics
from torchmetrics import Accuracy, F1Score
acc_metric = Accuracy(task="multiclass", num_classes=5).to(device)
f1_metric = F1Score(task="multiclass", num_classes=5, average="macro").to(device)
for X_batch, y_batch in loader:
preds = model(X_batch).argmax(dim=1)
acc_metric.update(preds, y_batch) # accumulates internally across batches
f1_metric.update(preds, y_batch)
print(f"Accuracy: {acc_metric.compute():.4f}") # final correct aggregate
print(f"F1: {f1_metric.compute():.4f}")
41. What is the purpose of torch.manual_seed() and how do you ensure reproducibility in PyTorch?
PyTorch uses pseudo-random number generators for weight initialisation, dropout masks, data shuffling, and more. Setting seeds explicitly ensures experiments are reproducible — critical for debugging, comparing model variants fairly, and scientific rigor.
import torch
import numpy as np
import random
import os
def set_seed(seed: int = 42):
"""Set all relevant seeds for full reproducibility."""
random.seed(seed) # Python's random module
np.random.seed(seed) # NumPy
torch.manual_seed(seed) # PyTorch CPU
torch.cuda.manual_seed_all(seed) # PyTorch all GPUs
# Force deterministic algorithms (may be slower!)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False # disable auto-tuner (non-deterministic)
os.environ["PYTHONHASHSEED"] = str(seed)
set_seed(42)
# Verify reproducibility
model1 = torch.nn.Linear(10, 5)
set_seed(42)
model2 = torch.nn.Linear(10, 5)
print(torch.equal(model1.weight, model2.weight)) # True — identical init
# DataLoader reproducibility — also needs a worker_init_fn for num_workers > 0
def seed_worker(worker_id):
worker_seed = torch.initial_seed() % 2**32
np.random.seed(worker_seed)
random.seed(worker_seed)
generator = torch.Generator()
generator.manual_seed(42)
from torch.utils.data import DataLoader
loader = DataLoader(
dataset,
batch_size=32,
shuffle=True,
num_workers=4,
worker_init_fn=seed_worker, # seeds each worker process
generator=generator, # seeds the shuffling order
)| Source of randomness | How to control it |
|---|---|
| Weight initialisation | torch.manual_seed(seed) |
| Dropout masks | Covered by torch.manual_seed (same RNG stream) |
| Data shuffling | DataLoader(generator=torch.Generator().manual_seed(seed)) |
| Multi-worker DataLoader | worker_init_fn to seed each subprocess |
| GPU non-determinism | torch.backends.cudnn.deterministic = True |
| cuDNN auto-tuner | torch.backends.cudnn.benchmark = False |
42. How does PyTorch handle multi-dimensional indexing and slicing of tensors?
PyTorch tensor indexing follows NumPy-style conventions, including basic slicing, advanced (fancy) indexing with integer/boolean tensors, and the powerful ... (ellipsis) operator for indexing high-dimensional tensors concisely.
import torch
x = torch.arange(24).reshape(2, 3, 4) # shape (2, 3, 4)
# ── Basic slicing — same as Python lists/NumPy
print(x[0]) # shape (3, 4) — first "batch"
print(x[0, 1]) # shape (4,) — first batch, second row
print(x[0, 1, 2]) # scalar — single element
print(x[:, 0, :]) # shape (2, 4) — all batches, first row, all cols
print(x[..., 0]) # shape (2, 3) — ellipsis: all leading dims, last dim index 0
print(x[0:1, :, -1]) # shape (1, 3) — slice + negative index
# ── Boolean (mask) indexing
mask = x > 10
print(x[mask]) # 1D tensor of all elements > 10
x_clamped = x.clone()
x_clamped[x_clamped > 10] = 0 # zero out values > 10
# ── Fancy (advanced) integer indexing
idx = torch.tensor([0, 2])
print(x[:, idx, :]) # shape (2, 2, 4) — select specific indices along dim 1
# ── torch.gather: select elements using an index tensor
scores = torch.tensor([[0.1, 0.7, 0.2], [0.3, 0.3, 0.4]]) # (2, 3)
top_idx = scores.argmax(dim=1, keepdim=True) # (2, 1)
top_val = scores.gather(dim=1, index=top_idx) # (2, 1)
print(top_val) # tensor([[0.7], [0.4]])
# ── torch.where: conditional element selection
result = torch.where(x > 10, x, torch.zeros_like(x)) # keep if >10, else 0
# ── Important: most slicing returns a VIEW, not a copy!
y = x[0]
y[0, 0] = 999
print(x[0, 0, 0]) # 999 — x was modified too! (shared memory)
# Use x[0].clone() to get an independent copy| Pattern | Example | Returns |
|---|---|---|
| Basic slicing | x[:, 0] | View (shares memory) |
| Boolean mask | x[x > 0] | Copy (1D, new memory) |
| Fancy indexing | x[:, [0,2]] | Copy (new memory) |
| Ellipsis | x[..., 0] | View — skips middle dims |
| gather | x.gather(dim, index) | Copy — selects per index |
43. What is the difference between.view(),.reshape(), and.contiguous() in PyTorch, and why does it matter?
These three methods deal with how a tensor's underlying memory is interpreted as a different shape. Understanding the difference prevents a class of confusing runtime errors related to tensor memory layout.
import torch
x = torch.arange(12).reshape(3, 4) # shape (3, 4), contiguous memory
# ── .view(): ALWAYS returns a view (no copy), but requires contiguous memory
y = x.view(4, 3) # works — x is contiguous
print(y.shape) # (4, 3)
# ── Transpose breaks contiguity — the data is NOT rearranged in memory,
# only the strides describing how to read it change
xt = x.t() # transpose — x.t() is a VIEW with different strides
print(xt.is_contiguous()) # False!
# This FAILS — view() cannot reinterpret non-contiguous memory
try:
xt.view(3, 4)
except RuntimeError as e:
print(f"Error: {e}")
# RuntimeError: view size is not compatible with input tensor's size and stride
# ── .reshape(): tries view() first; falls back to copying if needed
z = xt.reshape(3, 4) # WORKS — automatically copies if necessary
print(z.shape) # (3, 4)
# ── .contiguous(): explicitly forces a contiguous copy in memory
xt_contig = xt.contiguous()
print(xt_contig.is_contiguous()) # True
xt_contig.view(3, 4) # now works, since it is contiguous
# Strides explain WHY this happens
print(x.stride()) # (4, 1) — contiguous: move 1 step = 1 memory address
print(xt.stride()) # (1, 4) — transposed: strides reflect the swap, no copy made| Method | Copies data? | Requires contiguous input? | Safety |
|---|---|---|---|
| .view() | Never — always a view | Yes — raises RuntimeError otherwise | Fails loudly on non-contiguous tensors |
| .reshape() | Only if necessary | No — handles either case automatically | Safer general-purpose choice |
| .contiguous() | Yes, if not already contiguous | N/A — this is what fixes it | Use before .view() on a transposed/permuted tensor |
44. How do you freeze layers and perform transfer learning / fine-tuning in PyTorch?
Transfer learning reuses a model pretrained on a large dataset and adapts it to a new task. Freezing layers (setting requires_grad=False) prevents their weights from updating during backpropagation — useful when you want to keep pretrained features fixed and only train a new task-specific head.
import torch
import torch.nn as nn
import torchvision.models as models
# Load a pretrained ResNet-50
backbone = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
# ── Strategy 1: Feature extraction — freeze ALL pretrained layers
for param in backbone.parameters():
param.requires_grad = False # excluded from gradient computation
# Replace the final classification layer for our task (e.g. 5 classes)
in_features = backbone.fc.in_features # 2048 for ResNet-50
backbone.fc = nn.Linear(in_features, 5) # NEW layer — requires_grad=True by default
# Only backbone.fc parameters will be updated by the optimizer
optimizer = torch.optim.AdamW(
filter(lambda p: p.requires_grad, backbone.parameters()), # only trainable params
lr=1e-3,
)
# ── Strategy 2: Full fine-tuning with layer-wise (discriminative) learning rates
backbone2 = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
backbone2.fc = nn.Linear(backbone2.fc.in_features, 5)
optimizer2 = torch.optim.AdamW([
{"params": backbone2.layer1.parameters(), "lr": 1e-5}, # earliest layers — smallest lr
{"params": backbone2.layer4.parameters(), "lr": 1e-4}, # later layers — bigger lr
{"params": backbone2.fc.parameters(), "lr": 1e-3}, # new head — largest lr
])
# ── Verify which parameters are trainable
trainable = sum(p.numel() for p in backbone.parameters() if p.requires_grad)
total = sum(p.numel() for p in backbone.parameters())
print(f"Trainable: {trainable:,} / Total: {total:,} ({100*trainable/total:.1f}%)")
# ── Common pattern: train head first, then unfreeze and fine-tune everything
# Phase 1: only train backbone.fc for a few epochs
# Phase 2: unfreeze all layers, train with a small lr to fine-tune end-to-end
for param in backbone.parameters():
param.requires_grad = True # unfreeze for phase 2
45. What is the purpose of torch.utils.data.random_split() and how do you create train/validation/test splits in PyTorch?
Splitting a dataset into training, validation, and test subsets is a fundamental step before training. PyTorch's random_split() creates non-overlapping random subsets from a single Dataset, while preserving the lazy-loading behaviour of the original Dataset.
import torch
from torch.utils.data import Dataset, DataLoader, random_split
class MyDataset(Dataset):
def __init__(self, n=1000):
self.data = torch.randn(n, 20)
self.labels = torch.randint(0, 3, (n,))
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return self.data[idx], self.labels[idx]
full_dataset = MyDataset(n=1000)
# Split: 70% train, 15% val, 15% test
train_size = int(0.7 * len(full_dataset))
val_size = int(0.15 * len(full_dataset))
test_size = len(full_dataset) - train_size - val_size # remainder, avoids rounding loss
# Use a generator for reproducible splits
generator = torch.Generator().manual_seed(42)
train_ds, val_ds, test_ds = random_split(
full_dataset,
[train_size, val_size, test_size],
generator=generator,
)
print(len(train_ds), len(val_ds), len(test_ds)) # 700 150 150
train_loader = DataLoader(train_ds, batch_size=32, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=32, shuffle=False) # no shuffle needed
test_loader = DataLoader(test_ds, batch_size=32, shuffle=False)
# IMPORTANT GOTCHA: if your Dataset applies different transforms
# (e.g. data augmentation only for training), random_split alone
# does NOT let you apply different transforms per split, because
# all splits reference the SAME underlying Dataset object.
# Common workaround: split INDICES, then wrap with two separate
# Dataset instances using different transforms
from torch.utils.data import Subset
indices = torch.randperm(len(full_dataset), generator=generator).tolist()
train_idx = indices[:train_size]
val_idx = indices[train_size:train_size+val_size]
# train_dataset_aug = Subset(MyDatasetWithAugmentation(...), train_idx)
# val_dataset_plain = Subset(MyDatasetPlain(...), val_idx)
46. What is Batch Normalization in PyTorch and how does it differ from Layer Normalization?
Normalization layers stabilise training by re-centring and re-scaling activations. PyTorch provides several variants; Batch Normalization (BatchNorm) and Layer Normalization (LayerNorm) are the two most widely used, but they normalise over different dimensions and suit different architectures.
| Feature | BatchNorm (nn.BatchNorm1d/2d) | LayerNorm (nn.LayerNorm) |
|---|---|---|
| Normalises over | Batch dimension (per-feature statistics) | Feature dimension (per-sample statistics) |
| Statistics at train | Computed from current mini-batch | Computed from current sample's features |
| Statistics at eval | Uses running mean/var accumulated during training | Always computed fresh from current input |
| Batch size dependency | Noisy with very small batches (< 8) | Independent of batch size — works with batch=1 |
| Best for | CNNs (image models) | Transformers, RNNs, NLP models |
| Parameters | gamma (scale), beta (shift) per feature | Same, but normalised per sample |
import torch
import torch.nn as nn
# ── BatchNorm — for feedforward / CNN models
class BNModel(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(20, 64)
self.bn1 = nn.BatchNorm1d(64) # 64 features
self.fc2 = nn.Linear(64, 10)
def forward(self, x):
x = torch.relu(self.bn1(self.fc1(x)))
return self.fc2(x)
# BatchNorm behaves differently in train vs eval mode!
# train: normalise using batch mean/var, update running stats
# eval: use accumulated running_mean / running_var
model = BNModel()
model.train() # must be in train mode during training!
# ── LayerNorm — for transformers and sequence models
class LNModel(nn.Module):
def __init__(self, d_model=64):
super().__init__()
self.fc1 = nn.Linear(20, d_model)
self.ln1 = nn.LayerNorm(d_model) # normalise over last dim
self.fc2 = nn.Linear(d_model, 10)
def forward(self, x):
x = torch.relu(self.ln1(self.fc1(x)))
return self.fc2(x)
# LayerNorm produces the SAME result at train and eval
ln_model = LNModel()
ln_model.train()
x = torch.randn(8, 20)
out_train = ln_model(x)
ln_model.eval()
out_eval = ln_model(x)
print(torch.allclose(out_train, out_eval)) # True — LayerNorm is mode-independent!Common bug: forgetting to call model.train() before training and model.eval() before validation when using BatchNorm — at eval, it uses accumulated running statistics, and if these were never updated (because the model was always in eval mode), predictions will be incorrect.
47. How do you implement and use a custom loss function in PyTorch?
When built-in loss functions do not fit your task, you can write a custom loss as either a plain function or an nn.Module subclass. As long as the loss is computed from PyTorch tensor operations with requires_grad=True parameters, autograd handles differentiation automatically.
import torch
import torch.nn as nn
import torch.nn.functional as F
# ── Option 1: Plain function (simple, no learnable parameters)
def smooth_l1_custom(pred: torch.Tensor, target: torch.Tensor, beta: float = 1.0) -> torch.Tensor:
"""Huber loss — L1 outside beta, L2 inside beta."""
diff = torch.abs(pred - target)
loss = torch.where(
diff < beta,
0.5 * diff ** 2 / beta, # quadratic region
diff - 0.5 * beta, # linear region
)
return loss.mean()
# ── Option 2: nn.Module subclass (recommended when loss has hyper-parameters
# or learnable parameters you want saved in state_dict)
class FocalLoss(nn.Module):
"""Focal loss for class-imbalanced multi-class problems."""
def __init__(self, gamma: float = 2.0, weight: torch.Tensor | None = None):
super().__init__()
self.gamma = gamma
self.weight = weight # class weights (optional)
def forward(self, logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
# logits: (N, C) targets: (N,) int64
ce_loss = F.cross_entropy(logits, targets, weight=self.weight, reduction="none")
pt = torch.exp(-ce_loss) # probability of correct class
focal = (1 - pt) ** self.gamma * ce_loss
return focal.mean()
# Usage — identical to built-in loss functions
model = nn.Linear(10, 5)
focal_fn = FocalLoss(gamma=2.0)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
X = torch.randn(16, 10)
target = torch.randint(0, 5, (16,))
optimizer.zero_grad()
logits = model(X)
loss = focal_fn(logits, target) # custom loss used exactly like nn.CrossEntropyLoss
loss.backward() # autograd differentiates through our custom ops
optimizer.step()
print(f"Focal loss: {loss.item():.4f}")
# ── Combining multiple losses
rec_loss = F.mse_reconstruction_loss(output, target_img) # reconstruction
kl_loss = -0.5 * (1 + log_var - mu**2 - log_var.exp()).mean() # KL divergence
total_loss = rec_loss + 0.001 * kl_loss # weighted combinationKey insight: any PyTorch computation graph built from differentiable operations is automatically differentiable via autograd — you do not need to manually derive or implement gradients for custom losses. If you use standard PyTorch operations (torch.*, F.*), autograd takes care of the rest.
48. What is torch.compile() vs TorchScript and how do you export a PyTorch model for production deployment?
PyTorch offers two main paths for production deployment beyond running the Python interpreter: TorchScript (serialises the model as a language-independent IR) and torch.compile() (JIT compiles for speed within Python). For cross-language/cross-framework deployment, ONNX export is also widely used.
| Method | Best for | Requires Python runtime? | Portable across languages? |
|---|---|---|---|
| torch.compile() | Fastest Python inference; no code changes | Yes | No |
| TorchScript (trace) | Production servers; models with fixed control flow | No | Yes (C++ API) |
| TorchScript (script) | Models with data-dependent control flow (if/loops) | No | Yes |
| ONNX export | Cross-framework deployment (TensorRT, ONNX Runtime, CoreML) | No | Yes (many runtimes) |
import torch
import torch.nn as nn
model = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 5))
model.eval() # ALWAYS call eval() before exporting
# ── 1. torch.compile() — fast Python-based inference (PyTorch 2.0+)
compiled = torch.compile(model)
with torch.no_grad():
out = compiled(torch.randn(4, 10))
# ── 2. TorchScript trace — captures a concrete execution trace
# Works best when control flow does NOT depend on input data
example_input = torch.randn(1, 10)
traced = torch.jit.trace(model, example_input)
torch.jit.save(traced, "model_traced.pt")
# Load and run without the original Python class
loaded_traced = torch.jit.load("model_traced.pt")
out = loaded_traced(torch.randn(4, 10))
# ── 3. TorchScript script — handles dynamic control flow
class DynamicModel(nn.Module):
def forward(self, x: torch.Tensor) -> torch.Tensor:
if x.mean() > 0: # data-dependent branch — trace would miss this!
return torch.relu(x)
return torch.tanh(x)
scripted = torch.jit.script(DynamicModel())
torch.jit.save(scripted, "model_scripted.pt")
# ── 4. ONNX export — deploy with ONNX Runtime, TensorRT, CoreML
torch.onnx.export(
model,
example_input,
"model.onnx",
input_names=["features"],
output_names=["logits"],
dynamic_axes={"features": {0: "batch_size"}}, # variable batch size
opset_version=17,
)
# Inference with ONNX Runtime (no PyTorch dependency on deployment host!)
# import onnxruntime as ort
# sess = ort.InferenceSession("model.onnx")
# out = sess.run(["logits"], {"features": x.numpy()})
Comments & Discussions
Recently added...
What are activation functions in PyTorch and how do you apply them?
What optimizers does PyTorch provide and how do you choose between them?
What is the computation graph in PyTorch and how does the dynamic graph differ from a static graph?
What built-in layers does PyTorch's nn module provide and how do you use the most common ones?
What are learning rate schedulers in PyTorch and how do you use them?
What loss functions does PyTorch provide and when do you use each?
What are the most important tensor operations in PyTorch?
What is autograd in PyTorch and how does it compute gradients?
What is nn.Module and how do you build a custom neural network in PyTorch?
What are nn.Sequential and other container modules in PyTorch?
What are the most important loss functions in PyTorch and when do you use each?
What optimizers does PyTorch provide and how do you configure them?
What are the most common built-in layers in torch.nn and what do they do?
How do you initialise weights in a PyTorch model?
What is PyTorch and what are its key advantages over other deep learning frameworks?
What is a PyTorch tensor and how does it differ from a NumPy array?
What are tensor data types (dtypes) in PyTorch and why do they matter?
How does broadcasting work in PyTorch and what are the rules?
How do torch.no_grad() and tensor.detach() differ, and when do you use each?
What are learning rate schedulers in PyTorch and how do you use them?
|
Interviews Questions |
About Javapedia.net Javapedia.net is for Java and J2EE developers, technologist and college students who prepare of interview. Also this site includes many practical examples. This site is developed using J2EE technologies by Steve Antony, a senior Developer/lead at one of the logistics based company. |
||
| contact: javatutorials2016[at]gmail[dot]com | |||
| Kindly consider donating for maintaining this website. Thanks. |
|||
|
Copyright © 2026, javapedia.net, all rights reserved. privacy policy.
|
|||
