Python / Python Deep Learning and Neural Networks Interview Questions

What is mixed precision training and how does it speed up deep learning with torch.cuda.amp?

Modern GPUs (Volta and later) have dedicated hardware for 16-bit floating-point operations (FP16 / BFloat16) that can be 2–8× faster than FP32 for matrix multiplications. Mixed precision training runs the forward pass and gradient computations in FP16 (or BF16) for speed, while maintaining a master copy of the weights in FP32 for numerical precision during the optimizer update.

Loss scaling addresses a key challenge: FP16's limited dynamic range (smallest positive ≈ 6×10⁻⁸) can cause small gradient values to underflow to zero. The scaler multiplies the loss by a large scalar before backward (inflating gradients into FP16's representable range), then divides the gradients back before the optimizer step. PyTorch's GradScaler automates this and dynamically adjusts the scale factor.

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler

model     = nn.Linear(1024, 512).cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scaler    = GradScaler()           # manages loss scaling automatically

x = torch.randn(256, 1024).cuda()
y = torch.randn(256, 512).cuda()

for step in range(100):
    optimizer.zero_grad()

    # autocast: runs eligible ops in FP16 automatically
    with autocast(device_type='cuda', dtype=torch.float16):
        y_hat = model(x)           # FP16 matrix multiply
        loss  = nn.MSELoss()(y_hat, y)

    # Scale loss -> backward in FP16 -> unscale gradients -> optimizer step
    scaler.scale(loss).backward()  # inflate loss to prevent underflow
    scaler.unscale_(optimizer)     # restore original gradient magnitudes
    nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # clip after unscale
    scaler.step(optimizer)         # skip step if gradients are inf/NaN
    scaler.update()                # adjust scale factor for next step

# BFloat16 (bfloat16): available on A100+ GPUs
# - Same exponent range as FP32 (no underflow problem -> no scaler needed)
# - Less precision (7-bit mantissa vs 10-bit for FP16)
with autocast(device_type='cuda', dtype=torch.bfloat16):
    y_hat = model(x)  # no scaler needed with BF16

FP16's small dynamic range can cause small gradient values to underflow to zero; multiplying the loss by a large scalar inflates gradients into the representable FP16 range before backward, then divides them back before the optimizer update

✓ Correct! Well done.

Loss scaling prevents the gradient from exploding in very deep networks

✗ Try again.

It converts the loss to an integer for faster GPU computation

✗ Try again.

Why does BFloat16 not require a GradScaler while FP16 does?BFloat16 is always more numerically precise than FP16

✗ Try again.

BFloat16 has the same 8-bit exponent as FP32, giving it the same dynamic range and immunity to the underflow problem — it sacrifices mantissa precision instead, which is less critical for gradient values

✓ Correct! Well done.

PyTorch's autocast automatically handles BF16 scaling internally

✗ Try again.

BFloat16 is only used on CPUs where underflow is not a concern

✗ Try again.

Take quiz

What problem does loss scaling solve in FP16 mixed precision training?Loss scaling increases the model's convergence speed by amplifying updates

✗ Try again.

Invest now in Acorns!!! 🚀 Join Acorns and get your $5 bonus!

Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!

Earn passively and while sleeping

Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.

Invest now!!! Get Free equity stock (US, UK only)!

Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.

The Robinhood app makes it easy to trade stocks, crypto and more.

Webull! Receive free stock by signing up using the link: Webull signup.

More Related questions...

What is a neural network and how does forward propagation work mathematically? Explain backpropagation mathematically. How does the chain rule enable computing gradients through many layers? What are the most common activation functions and why did ReLU replace sigmoid/tanh as the default? What are vanishing and exploding gradients, and what techniques are used to address them? Why does weight initialization matter in neural networks, and what is the difference between Xavier and He initialization? How does Batch Normalization work mathematically and why does it stabilize training? Compare SGD, SGD with momentum, RMSProp, and Adam optimizers. When do you choose each? How does Dropout work mathematically, and why does it act as regularization? Explain how convolutional layers work and why they are well-suited to image data. How do RNNs work and why did LSTMs solve the long-range dependency problem? What is the self-attention mechanism in Transformers and why did it replace RNNs for sequence modeling? What loss functions does PyTorch provide for classification and regression, and which to use when? What is transfer learning and how do you fine-tune a pretrained model in PyTorch? How does PyTorch's Dataset and DataLoader pipeline work, and what are the key performance considerations? Why is learning rate scheduling important and what are the most common strategies? What are the most effective regularization strategies for deep learning and how do they differ from classical ML regularization? What are embedding layers in deep learning and how are they different from one-hot encoding? How do you save and load PyTorch models correctly, and what is included in a proper checkpoint? What is mixed precision training and how does it speed up deep learning with torch.cuda.amp? What is the difference between model.eval(), torch.no_grad(), and torch.inference_mode()? When do you use each? How do you use GPUs in PyTorch and what are the key patterns for writing device-agnostic code? What are the differences between Batch Norm, Layer Norm, Group Norm, and Instance Norm? What is an autoencoder and what can a well-trained latent space be used for? How do you diagnose a neural network that is not training correctly from its loss curves? What is the mathematical setup of a Generative Adversarial Network (GAN) and what training challenges do they have? What is torch.compile and how does it speed up PyTorch model execution? Why do Transformers need positional encodings and how does sinusoidal encoding work? What are the most impactful hyperparameters to tune in deep learning and what is the recommended search order? What is an encoder-decoder architecture and how is it used for sequence-to-sequence tasks? What is model quantization in deep learning and how does PyTorch support it? What does a production-quality PyTorch training loop look like, incorporating all best practices? How does batch size affect deep learning training mathematically and practically? How do you choose the right layer type (Linear, Conv, Attention) for a given input modality? What evaluation metrics are most commonly used in deep learning tasks and how do you implement them in PyTorch? How do you export a PyTorch model for production deployment using TorchScript or ONNX? What is knowledge distillation and how does it compress large neural networks into smaller ones? What is self-supervised learning and how do contrastive methods like SimCLR learn representations? How would you implement and train a simple feedforward neural network in PyTorch from scratch, without using nn.Sequential?

Show more question and Answers...

Python Modern Generative AI and Agents Interview Questions

Comments & Discussions

Core Python Fundamentals Interview Questions 45 Data Science Essentials Interview Questions 45 Python Mathematical Intuition and Scikit Learn Interview Questions 36 Python Deep Learning and Neural Networks Interview Questions 38 Python Modern Generative AI and Agents Interview Questions 38 FastAPI Interview Questions 38

Recently added...

What is the HNSW index in ChromaDB and what parameters can you tune?

When should you use upsert() instead of add() in ChromaDB, and what are common patterns?

What distance metrics does ChromaDB support and how do you choose between them?

How does ChromaDB's PersistentClient store data on disk, and what are its limitations?

How do you use ChromaDB as a vector store with LangChain?

How do you run ChromaDB as a standalone HTTP server and connect to it from multiple clients?

How do you add documents to a ChromaDB collection?

How do you query a ChromaDB collection for similar documents?

How do you use the OpenAI embedding function with ChromaDB?

How do you create a custom embedding function for ChromaDB?

How do you efficiently add large numbers of documents to ChromaDB using batching?

What is the where_document filter in ChromaDB and how does it differ from where?

How do you implement multi-tenancy or data isolation in ChromaDB?

What is embedding consistency and why is it critical in ChromaDB applications?

What is ChromaDB and what problem does it solve?

What are embeddings and why are they central to how ChromaDB works?

How do you use ChromaDB to detect and remove near-duplicate or semantically similar documents?

How do you reset or clear a ChromaDB collection without deleting and recreating it?

What is ChromaDB's default embedding function and how does it work?

What are best practices for structuring ChromaDB collection metadata for production use?

	Interviews Questions Java Spring Hibernate Maven Testing API BigData Web DataStructures AI Database Integration Cloud Scala Python Tools Golang	About Javapedia.net Javapedia.net is for Java and J2EE developers, technologist and college students who prepare of interview. Also this site includes many practical examples. This site is developed using J2EE technologies by Steve Antony, a senior Developer/lead at one of the logistics based company.
	contact: javatutorials2016[at]gmail[dot]com
Kindly consider donating for maintaining this website. Thanks.
	Copyright © 2026, javapedia.net, all rights reserved. privacy policy.