Python / Python Deep Learning and Neural Networks Interview Questions
What is knowledge distillation and how does it compress large neural networks into smaller ones?
Knowledge distillation (Hinton et al., 2015) trains a small student network to mimic the output distribution of a large, accurate teacher network. Instead of training only on hard labels (the correct class as a one-hot vector), the student is also trained to match the teacher's soft probabilities — the full output distribution including small probabilities assigned to incorrect classes.
The soft probabilities carry richer information than hard labels: if the teacher assigns 0.7 to 'cat' and 0.25 to 'dog', this communicates that the image looks somewhat cat-like but also dog-like — a nuanced signal the student can learn from. A temperature parameter T sharpens or softens this distribution: p_i = exp(z_i/T) / Σ exp(z_j/T). Higher T produces a softer, more uniform distribution that exposes the teacher's confidence relationships across all classes, giving the student a richer gradient signal. The distillation loss combines the cross-entropy with hard labels and the KL divergence with the teacher's soft targets.
import torch
import torch.nn as nn
import torch.nn.functional as F
teacher = BigModel().eval() # pretrained, frozen
student = SmallModel() # to be trained
T = 3.0 # temperature — soften the distributions
alpha = 0.7 # weight for distillation vs hard-label loss
ce_loss = nn.CrossEntropyLoss()
kl_div_loss = nn.KLDivLoss(reduction='batchmean')
optimizer = torch.optim.AdamW(student.parameters(), lr=1e-3)
for X, y_hard in loader:
# Teacher forward (no grad)
with torch.no_grad():
teacher_logits = teacher(X)
# Student forward
student_logits = student(X)
# Hard-label cross-entropy
loss_hard = ce_loss(student_logits, y_hard)
# Soft-target KL divergence (temperature-scaled)
student_soft = F.log_softmax(student_logits / T, dim=1)
teacher_soft = F.softmax(teacher_logits / T, dim=1)
loss_kl = kl_div_loss(student_soft, teacher_soft) * (T ** 2)
# T^2 scaling: compensates for the T-scaled gradients
loss = alpha * loss_kl + (1 - alpha) * loss_hard
optimizer.zero_grad(); loss.backward(); optimizer.step()
Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!
Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.
Invest now!!! Get Free equity stock (US, UK only)!
Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.
The Robinhood app makes it easy to trade stocks, crypto and more.
Webull! Receive free stock by signing up using the link: Webull signup.
More Related questions...
