Python / Python Modern Generative AI and Agents Interview Questions
How does tokenisation work in Hugging Face and what are the key tokenizer concepts?
Tokenisation converts raw text into integer IDs that the model can process. Modern LLMs use subword tokenisation (BPE, WordPiece, or SentencePiece) rather than word or character tokenisation, balancing vocabulary size against the number of tokens per sentence. Each model family has its own tokeniser trained alongside its vocabulary — you must always use the matching tokeniser for a given model.
Key concepts to understand: special tokens ([CLS], [SEP], <s>, </s>, <pad>) mark sentence boundaries and padding; attention masks are binary tensors that tell the model which positions are real tokens (1) vs padding (0); padding and truncation unify variable-length inputs into fixed-size batches; fast tokenizers (Rust-backed) are 10–100× faster than their Python equivalents.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# Encode a single sentence
text = 'Hugging Face makes NLP easy.'
encoding = tokenizer(text, return_tensors='pt')
print(encoding.keys())
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
print(encoding['input_ids'])
# tensor([[ 101, 17662, 2227, 3084, 17953, 2109, 1012, 102]])
# Decode back to text
print(tokenizer.decode(encoding['input_ids'][0]))
# [CLS] hugging face makes nlp easy. [SEP]
# Batch encoding with padding and truncation
texts = [
'Short text.',
'This is a much longer piece of text that goes on and on.',
]
batch = tokenizer(
texts,
padding=True, # pad shorter sequences to the length of the longest
truncation=True, # truncate sequences longer than max_length
max_length=128,
return_tensors='pt', # return PyTorch tensors
)
print(batch['input_ids'].shape) # (2, 128)
print(batch['attention_mask']) # 1 for real tokens, 0 for padding
# Token-level operations
tokens = tokenizer.tokenize('unbelievably')
print(tokens) # ['un', '##believe', '##ably'] — WordPiece subwords
# Count tokens before calling API (avoid surprises)
n_tokens = len(tokenizer.encode('Hello world'))
print(f'{n_tokens} tokens')
Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!
Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.
Invest now!!! Get Free equity stock (US, UK only)!
Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.
The Robinhood app makes it easy to trade stocks, crypto and more.
Webull! Receive free stock by signing up using the link: Webull signup.
More Related questions...
