Python / PyTorch Fundamentals Interview Questions
How do you debug a PyTorch training loop where the loss is not decreasing or is NaN?
Diagnosing a stuck or diverging training loop is one of the most valuable practical PyTorch skills. The shape of the loss curve and a few targeted checks usually reveal the root cause.
| Symptom | Likely cause | Fix |
|---|---|---|
| Loss is NaN from step 1 | Exploding gradients, bad data (inf/NaN inputs), lr too high | Check input data, add gradient clipping, lower lr |
| Loss never decreases | Vanishing gradients, lr too low, forgot optimizer.step() | Check gradient norms, raise lr, verify training loop order |
| Loss decreases then plateaus high | Model too small, lr too high for fine convergence | Increase capacity, add lr scheduler |
| Train loss low, val loss high | Overfitting | Add dropout, weight decay, more data, early stopping |
| Loss oscillates wildly | lr too high, batch size too small | Lower lr, increase batch size, use lr warmup |
import torch
import torch.nn as nn
model = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
for step, (X, y) in enumerate(loader):
optimizer.zero_grad()
logits = model(X)
loss = criterion(logits, y)
# ── Check 1: is the loss finite?
if not torch.isfinite(loss):
print(f"Step {step}: non-finite loss = {loss.item()}")
print("Input contains NaN:", torch.isnan(X).any().item())
print("Input contains Inf:", torch.isinf(X).any().item())
break
loss.backward()
# ── Check 2: gradient norms — are gradients flowing at all?
total_norm = sum(
p.grad.norm().item() ** 2 for p in model.parameters() if p.grad is not None
) ** 0.5
if step % 50 == 0:
print(f"Step {step}: loss={loss.item():.4f} grad_norm={total_norm:.4f}")
# ── Check 3: are any gradients None? (means that param was unused!)
for name, p in model.named_parameters():
if p.grad is None:
print(f"WARNING: {name} has no gradient — is it used in forward()?")
optimizer.step()
# ── Check 4: verify model output shape and range make sense
with torch.no_grad():
sample_out = model(X[:1])
print("Output range:", sample_out.min().item(), sample_out.max().item())
# ── Check 5: overfit a tiny batch — sanity check the architecture
# If the model cannot drive loss near zero on 5 examples, there is a bug
tiny_X, tiny_y = X[:5], y[:5]
for _ in range(200):
optimizer.zero_grad()
loss = criterion(model(tiny_X), tiny_y)
loss.backward()
optimizer.step()
print(f"Tiny-batch overfit loss: {loss.item():.6f}") # should approach 0
Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!
Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.
Invest now!!! Get Free equity stock (US, UK only)!
Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.
The Robinhood app makes it easy to trade stocks, crypto and more.
Webull! Receive free stock by signing up using the link: Webull signup.
More Related questions...
