Python / Python Modern Generative AI and Agents Interview Questions
How do you efficiently load large Hugging Face models for inference, including quantization and device placement?
Loading a 7B+ parameter model naively with from_pretrained() materialises the entire model in FP32 (~28 GB for 7B params), which exceeds most GPU memory budgets. Modern Hugging Face loading uses three key techniques: precision reduction (bfloat16 / float16), device mapping, and on-the-fly quantisation with bitsandbytes.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = 'mistralai/Mistral-7B-Instruct-v0.3'
# ── Option 1: Half precision (BF16) — 2x memory saving, minimal accuracy loss
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16, # half precision
device_map='auto', # automatically distribute across GPUs/CPU
)
# ── Option 2: 4-bit quantization with bitsandbytes (QLoRA-style)
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4', # NormalFloat4 quantisation
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # nested quantisation
)
model_4bit = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map='auto',
)
# 7B model now fits in ~4 GB VRAM
# ── Inference with generate()
messages = [{'role': 'user', 'content': 'What is the capital of France?'}]
inputs = tokenizer.apply_chat_template(
messages, return_tensors='pt', add_generation_prompt=True
).to(model.device)
with torch.no_grad():
output_ids = model.generate(
inputs,
max_new_tokens=200,
temperature=0.6,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
# Decode only the generated tokens (not the input prompt)
generated = tokenizer.decode(
output_ids[0][inputs.shape[1]:], skip_special_tokens=True
)
print(generated)
Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!
Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.
Invest now!!! Get Free equity stock (US, UK only)!
Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.
The Robinhood app makes it easy to trade stocks, crypto and more.
Webull! Receive free stock by signing up using the link: Webull signup.
More Related questions...
