Python / Python Deep Learning and Neural Networks Interview Questions
What is model quantization in deep learning and how does PyTorch support it?
Quantization reduces model size and inference latency by representing weights and activations in lower-precision integer formats (INT8, INT4, INT2) rather than FP32 or FP16. A 32-bit float weight is replaced by an 8-bit integer plus a scale factor and zero-point: x_float = scale × (x_int - zero_point). This yields 4× memory reduction for INT8, enabling larger models to fit on limited hardware and significantly faster integer arithmetic on CPUs and mobile accelerators.
Three main approaches: (1) Post-Training Quantization (PTQ) — quantize a trained FP32 model without retraining, using a small calibration dataset to determine optimal scale factors; (2) Quantization-Aware Training (QAT) — simulate quantization noise during training (fake quantization), allowing the model to adapt and typically recovering the accuracy lost by PTQ; (3) Dynamic quantization — weights are quantized ahead of time, activations quantized dynamically at inference (simplest, good baseline for RNNs).
import torch import torch.nn as nn from torch.quantization import quantize_dynamic, prepare, convert # ─── Dynamic Quantization (simplest — weights INT8, activations FP32) ─── model_fp32 = nn.LSTM(input_size=64, hidden_size=128) model_int8 = quantize_dynamic( model_fp32, qconfig_spec={nn.Linear, nn.LSTM}, dtype=torch.qint8 ) print('FP32 size:', sum(p.numel() * 4 for p in model_fp32.parameters()), 'bytes') # INT8 model is ~4x smaller # ─── Post-Training Static Quantization ─── model = nn.Sequential(nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 10)) model.qconfig = torch.quantization.get_default_qconfig('fbgemm') model_prepared = prepare(model) # insert observer modules # Calibrate with representative data model_prepared.eval() with torch.no_grad(): for X_cal, _ in calibration_loader: model_prepared(X_cal) model_int8 = convert(model_prepared) # convert to INT8 # ─── Modern approach: bitsandbytes / llm.int8() for LLMs ─── # 8-bit quantization of LLM weights with minimal accuracy loss # Allows running 7B+ parameter models on consumer GPUs # from transformers import AutoModelForCausalLM # model = AutoModelForCausalLM.from_pretrained('gpt2', load_in_8bit=True)
Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!
Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.
Invest now!!! Get Free equity stock (US, UK only)!
Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.
The Robinhood app makes it easy to trade stocks, crypto and more.
Webull! Receive free stock by signing up using the link: Webull signup.
More Related questions...
