Prev Next

Python / PyTorch Fundamentals Interview Questions

How do you debug a PyTorch training loop where the loss is not decreasing or is NaN?

Diagnosing a stuck or diverging training loop is one of the most valuable practical PyTorch skills. The shape of the loss curve and a few targeted checks usually reveal the root cause.

Common training failure modes
SymptomLikely causeFix
Loss is NaN from step 1Exploding gradients, bad data (inf/NaN inputs), lr too highCheck input data, add gradient clipping, lower lr
Loss never decreasesVanishing gradients, lr too low, forgot optimizer.step()Check gradient norms, raise lr, verify training loop order
Loss decreases then plateaus highModel too small, lr too high for fine convergenceIncrease capacity, add lr scheduler
Train loss low, val loss highOverfittingAdd dropout, weight decay, more data, early stopping
Loss oscillates wildlylr too high, batch size too smallLower lr, increase batch size, use lr warmup
/div>
import torch
import torch.nn as nn

model     = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for step, (X, y) in enumerate(loader):
    optimizer.zero_grad()
    logits = model(X)
    loss = criterion(logits, y)

    # ── Check 1: is the loss finite?
    if not torch.isfinite(loss):
        print(f"Step {step}: non-finite loss = {loss.item()}")
        print("Input contains NaN:", torch.isnan(X).any().item())
        print("Input contains Inf:", torch.isinf(X).any().item())
        break

    loss.backward()

    # ── Check 2: gradient norms — are gradients flowing at all?
    total_norm = sum(
        p.grad.norm().item() ** 2 for p in model.parameters() if p.grad is not None
    ) ** 0.5
    if step % 50 == 0:
        print(f"Step {step}: loss={loss.item():.4f} grad_norm={total_norm:.4f}")

    # ── Check 3: are any gradients None? (means that param was unused!)
    for name, p in model.named_parameters():
        if p.grad is None:
            print(f"WARNING: {name} has no gradient — is it used in forward()?")

    optimizer.step()

# ── Check 4: verify model output shape and range make sense
with torch.no_grad():
    sample_out = model(X[:1])
    print("Output range:", sample_out.min().item(), sample_out.max().item())

# ── Check 5: overfit a tiny batch — sanity check the architecture
# If the model cannot drive loss near zero on 5 examples, there is a bug
tiny_X, tiny_y = X[:5], y[:5]
for _ in range(200):
    optimizer.zero_grad()
    loss = criterion(model(tiny_X), tiny_y)
    loss.backward()
    optimizer.step()
print(f"Tiny-batch overfit loss: {loss.item():.6f}")  # should approach 0
If a training loss is NaN starting from the very first step, what should you check first?
What does the 'overfit a tiny batch' sanity check (training on 5 examples until loss ≈ 0) verify?

Invest now in Acorns!!! 🚀 Join Acorns and get your $5 bonus!

Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!

Earn passively and while sleeping

Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.

Invest now!!! Get Free equity stock (US, UK only)!

Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.

The Robinhood app makes it easy to trade stocks, crypto and more.


Webull! Receive free stock by signing up using the link: Webull signup.

More Related questions...

What is PyTorch and what are its key advantages over other deep learning frameworks? What is a PyTorch tensor and how does it differ from a NumPy array? What are the most important tensor operations in PyTorch? What are tensor data types (dtypes) in PyTorch and why do they matter? How does broadcasting work in PyTorch and what are the rules? What is autograd in PyTorch and how does it compute gradients? What is the computation graph in PyTorch and how does the dynamic graph differ from a static graph? How do torch.no_grad() and tensor.detach() differ, and when do you use each? What is nn.Module and how do you build a custom neural network in PyTorch? What are nn.Sequential and other container modules in PyTorch? What built-in layers does PyTorch's nn module provide and how do you use the most common ones? What are activation functions in PyTorch and how do you apply them? What are the most important loss functions in PyTorch and when do you use each? What optimizers does PyTorch provide and how do you configure them? What are learning rate schedulers in PyTorch and how do you use them? What are the most common built-in layers in torch.nn and what do they do? How do you initialise weights in a PyTorch model? What loss functions does PyTorch provide and when do you use each? What optimizers does PyTorch provide and how do you choose between them? What are learning rate schedulers in PyTorch and how do you use them? What activation functions are commonly used in PyTorch and how do you choose between them? What loss functions does PyTorch provide and how do you choose the right one? What optimizers does PyTorch provide and what is the difference between SGD, Adam, and AdamW? What is the standard PyTorch training loop and what does each step do? What are Dataset and DataLoader in PyTorch and how do they work together? How do you move tensors and models between CPU and GPU in PyTorch? What is the difference between model.parameters() and model.state_dict() in PyTorch? How do you save and load PyTorch models correctly, including full training checkpoints? What is overfitting and what regularization techniques does PyTorch support to address it? What is the vanishing/exploding gradient problem and how do you detect and fix it in PyTorch? What is weight initialization in PyTorch and why does it matter? What is the difference between nn.Parameter and a regular tensor attribute in nn.Module? How do you implement and use learning rate schedulers in PyTorch? How do you debug a PyTorch training loop where the loss is not decreasing or is NaN? What is the difference between torch.tensor() and torch.Tensor() (capital T) for creating tensors? How does gradient accumulation work in PyTorch and when would you use it? What is mixed precision training in PyTorch and how do you implement it with torch.cuda.amp? What is torch.compile() and how does it speed up PyTorch model execution? What is the difference between batch size, epoch, and iteration in PyTorch training? How do you compute and track evaluation metrics like accuracy during PyTorch training? What is the purpose of torch.manual_seed() and how do you ensure reproducibility in PyTorch? How does PyTorch handle multi-dimensional indexing and slicing of tensors? What is the difference between.view(),.reshape(), and.contiguous() in PyTorch, and why does it matter? How do you freeze layers and perform transfer learning / fine-tuning in PyTorch? What is the purpose of torch.utils.data.random_split() and how do you create train/validation/test splits in PyTorch? What is Batch Normalization in PyTorch and how does it differ from Layer Normalization? How do you implement and use a custom loss function in PyTorch? What is torch.compile() vs TorchScript and how do you export a PyTorch model for production deployment?
Show more question and Answers...

Tools

Comments & Discussions