Python / Python Modern Generative AI and Agents Interview Questions
What are the most important text splitting strategies in RAG, and how do chunk size and overlap affect retrieval quality?
Chunk size and overlap are the most impactful hyperparameters in a RAG pipeline — they directly affect both retrieval precision and answer quality. A chunk that is too small may contain only a fragment of a complete thought; a chunk that is too large may contain so much irrelevant content that the LLM's attention is diluted and cost increases.
| Splitter | Logic | Best for |
|---|---|---|
| CharacterTextSplitter | Split on a single separator character (e.g. newline) | Simple documents with clear delimiters |
| RecursiveCharacterTextSplitter | Try paragraph → sentence → word splits in order until chunks are small enough | General purpose; most common default |
| TokenTextSplitter | Split by actual model tokens, not characters | Precise context window management |
| MarkdownHeaderTextSplitter | Split at Markdown headers, preserving structure in metadata | Technical docs, wikis, README files |
| SemanticChunker | Embed sentences, split where embedding similarity drops | Dense prose without clear structure |
from langchain_text_splitters import (
RecursiveCharacterTextSplitter,
MarkdownHeaderTextSplitter,
TokenTextSplitter,
)
# ── RecursiveCharacterTextSplitter — general default
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # characters per chunk
chunk_overlap=200, # overlap to avoid cutting mid-thought
separators=['\n\n', '\n', '.', ' ', ''], # try in order
length_function=len, # can swap for token-counting function
)
# ── TokenTextSplitter — respect model context window precisely
from langchain_openai import OpenAIEmbeddings
token_splitter = TokenTextSplitter(
encoding_name='cl100k_base', # GPT-4 / text-embedding-3 encoding
chunk_size=256, # tokens per chunk
chunk_overlap=50,
)
# ── MarkdownHeaderTextSplitter — preserves document structure
md_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[
('#', 'section'),
('##', 'subsection'),
]
)
md_text = '# Introduction\nWelcome!\n## Background\nSome history...'
sections = md_splitter.split_text(md_text)
for s in sections:
print(s.page_content, s.metadata)
# Rule of thumb for chunk_size:
# - 256–512 tokens: high precision retrieval, lower recall
# - 512–1024 tokens: balanced; most common for dense docs
# - 1024–2048 tokens: higher recall, more noise per chunk
Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!
Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.
Invest now!!! Get Free equity stock (US, UK only)!
Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.
The Robinhood app makes it easy to trade stocks, crypto and more.
Webull! Receive free stock by signing up using the link: Webull signup.
More Related questions...
