Database / ChromaDB Interview Questions
What are effective document chunking strategies when indexing documents into ChromaDB for RAG?
Before adding documents to ChromaDB, long texts must be split into chunks that fit within the embedding model's token limit and contain cohesive information. Chunk size and overlap directly affect retrieval quality.
# pip install langchain-text-splitters
from langchain_text_splitters import (
RecursiveCharacterTextSplitter,
TokenTextSplitter,
)
import chromadb
# RecursiveCharacterTextSplitter — tries to split at natural boundaries
# (paragraphs → sentences → words → characters)
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # characters per chunk (aim for ~200–400 tokens)
chunk_overlap=50, # overlap prevents losing context at chunk boundaries
separators=["\n\n", "\n", ". ", " ", ""],
)
long_document = """ChromaDB is an open-source vector database.
It supports multiple embedding functions including OpenAI and HuggingFace.
ChromaDB uses HNSW for approximate nearest-neighbour search.
You can filter results using metadata fields.
Persistent storage uses SQLite under the hood.
""" * 20 # repeat to make it long
chunks = splitter.split_text(long_document)
print(f"Split into {len(chunks)} chunks")
print(f"First chunk length: {len(chunks[0])} chars")
# Add chunks to ChromaDB with source metadata
client = chromadb.Client()
col = client.create_collection("chunked_docs")
col.add(
documents=chunks,
metadatas=[{"source": "chroma_guide.txt", "chunk_idx": i}
for i in range(len(chunks))],
ids=[f"chunk-{i}" for i in range(len(chunks))],
)| Strategy | Chunk size | Overlap | Best for |
|---|---|---|---|
| Small chunks | 100–200 tokens | 10–20 tokens | Precise retrieval, FAQ-style docs |
| Medium chunks | 300–500 tokens | 50 tokens | Most RAG use cases — good balance |
| Large chunks | 800–1000 tokens | 100 tokens | Long-form prose where context matters |
| Semantic chunking | Variable | 0 | Academic papers, structured content |
Key rule: chunk overlap prevents the situation where a sentence spanning a chunk boundary gets split, losing its meaning in both halves. Typical overlap is 10–20% of chunk size.
Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!
Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.
Invest now!!! Get Free equity stock (US, UK only)!
Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.
The Robinhood app makes it easy to trade stocks, crypto and more.
Webull! Receive free stock by signing up using the link: Webull signup.
More Related questions...
