Prev Next

Database / ChromaDB Interview Questions

1. What is ChromaDB and what problem does it solve? 2. What are embeddings and why are they central to how ChromaDB works? 3. What distance metrics does ChromaDB support and how do you choose between them? 4. What is a ChromaDB collection and how do you create, list, get, and delete collections? 5. How do you add documents to a ChromaDB collection? 6. How do you query a ChromaDB collection for similar documents? 7. How do you retrieve, update, and delete specific documents in ChromaDB? 8. How do you filter query results using metadata in ChromaDB? 9. What is the difference between ChromaDB's in-memory and persistent storage modes? 10. What is ChromaDB's default embedding function and how does it work? 11. How do you use the OpenAI embedding function with ChromaDB? 12. How do you use HuggingFace models as embedding functions in ChromaDB? 13. How do you create a custom embedding function for ChromaDB? 14. How does ChromaDB's PersistentClient store data on disk, and what are its limitations? 15. What is the HNSW index in ChromaDB and what parameters can you tune? 16. How do you efficiently add large numbers of documents to ChromaDB using batching? 17. What is the where_document filter in ChromaDB and how does it differ from where? 18. How do you control what data ChromaDB returns in query and get results using include? 19. How do you design metadata schemas for effective filtering in ChromaDB? 20. How do you inspect a ChromaDB collection's contents and configuration? 21. How do you build a basic RAG (Retrieval-Augmented Generation) pipeline with ChromaDB? 22. What are effective document chunking strategies when indexing documents into ChromaDB for RAG? 23. How do you use ChromaDB as a vector store with LangChain? 24. How do you implement multi-tenancy or data isolation in ChromaDB? 25. What is embedding consistency and why is it critical in ChromaDB applications? 26. How do you run ChromaDB as a standalone HTTP server and connect to it from multiple clients? 27. When should you use upsert() instead of add() in ChromaDB, and what are common patterns? 28. What are best practices for structuring ChromaDB collection metadata for production use? 29. How does ChromaDB compare to FAISS, and when should you choose one over the other? 30. What are common ChromaDB errors and how do you handle them in production code? 31. How do you back up and restore a ChromaDB persistent database? 32. How do you ensure the correct embedding function is used when reopening a persistent ChromaDB collection? 33. How do you interpret ChromaDB query distances and convert them into meaningful relevance scores? 34. What are ChromaDB's practical size limits and performance characteristics at scale? 35. How do you use ChromaDB to detect and remove near-duplicate or semantically similar documents? 36. How do you reset or clear a ChromaDB collection without deleting and recreating it? 37. What configuration settings does ChromaDB support and how do you disable telemetry? 38. What is a production readiness checklist for a ChromaDB-based application?
Could not find what you were looking for? send us the question and we would be happy to answer your question.

1. What is ChromaDB and what problem does it solve?

ChromaDB is an open-source, AI-native vector database designed to store, index, and query high-dimensional embedding vectors efficiently. It was created specifically to make building LLM-powered applications easy — particularly for retrieval-augmented generation (RAG), semantic search, and recommendation systems.

Traditional databases store and search data using exact matches or SQL-style predicates. ChromaDB instead answers the question: "which stored items are most semantically similar to this query?" It does this by storing numerical vectors (embeddings) that represent the meaning of text, images, or other data, and finding nearest neighbours using vector similarity search.

ChromaDB at a glance
FeatureDetail
LanguagePython-first (also JS/TS client)
LicenseApache 2.0 open source
Storage modesIn-memory (ephemeral) or persistent (disk-backed)
Default embedding modelall-MiniLM-L6-v2 via sentence-transformers
Distance metricscosine, l2 (Euclidean), ip (inner product)
Primary use casesRAG pipelines, semantic search, duplicate detection, recommendation
pip install chromadb

import chromadb

# Quickstart — in-memory client
client = chromadb.Client()
collection = client.create_collection("my_docs")
collection.add(
    documents=["ChromaDB is a vector database", "Python is great"],
    ids=["doc1", "doc2"],
)
results = collection.query(query_texts=["vector store for AI"], n_results=1)
print(results["documents"])  # [["ChromaDB is a vector database"]]
What type of database is ChromaDB?
What is the primary query type that ChromaDB is designed for?
2. What are embeddings and why are they central to how ChromaDB works?

An embedding is a dense numerical vector — a list of floating-point numbers — that represents the semantic meaning of a piece of data. Text, images, audio, and code can all be converted into embeddings by a neural network (embedding model). Items with similar meanings produce vectors that are close together in the high-dimensional vector space.

ChromaDB stores these vectors alongside the original data and metadata. When you query ChromaDB with a new piece of text, the same embedding model converts it to a vector, and ChromaDB uses an approximate nearest-neighbour (ANN) algorithm to find the stored vectors that are geometrically closest — these correspond to the most semantically relevant stored documents.

# Conceptual illustration
# "ChromaDB is a vector database"  → [0.12, -0.45, 0.89, ..., 0.03]  (384 numbers)
# "Vector stores for AI apps"      → [0.14, -0.41, 0.91, ..., 0.01]  (close!)
# "My cat loves tuna fish"         → [-0.55, 0.72, -0.11, ..., 0.88] (far away)

import chromadb
from chromadb.utils import embedding_functions

# You can inspect the raw embedding vectors ChromaDB generates
client = chromadb.Client()
collection = client.create_collection("demo")
collection.add(documents=["Hello world"], ids=["1"])

# Get the stored embedding
result = collection.get(ids=["1"], include=["embeddings"])
print(len(result["embeddings"][0]))   # 384 — length of the default model's vector
print(result["embeddings"][0][:5])    # first 5 of 384 floats
Embedding dimensions by model
ModelDimensionsNotes
all-MiniLM-L6-v2 (default)384Fast, small, good for English
text-embedding-ada-002 (OpenAI)1536High quality, API call required
text-embedding-3-small (OpenAI)1536Newer, cheaper than ada-002
all-mpnet-base-v2768Higher quality than MiniLM, slower
CLIP (images)512Multimodal — text and images same space
What is an embedding in the context of ChromaDB?
Why do semantically similar texts produce vectors that are close together?
3. What distance metrics does ChromaDB support and how do you choose between them?

ChromaDB uses a distance metric to measure how similar two vectors are during nearest-neighbour search. The metric is set at collection creation time and cannot be changed afterward. Choosing the wrong metric for your embedding model can significantly degrade search quality.

ChromaDB distance metrics
Metrichnsw:space valueFormulaBest for
L2 (Euclidean)l2 (default)√Σ(aáµ¢−báµ¢)²When vector magnitude matters; general purpose
Cosine similaritycosine1 − (a·b)/(-)Text embeddings — focuses on direction not magnitude
Inner Productip−(a·b)When embeddings are pre-normalised to unit length
import chromadb

# Set metric at collection creation — cannot change later!
collection_cosine = client.create_collection(
    name="text_cosine",
    metadata={"hnsw:space": "cosine"},  # recommended for text
)

collection_l2 = client.create_collection(
    name="general_l2",
    metadata={"hnsw:space": "l2"},  # default if not specified
)

collection_ip = client.create_collection(
    name="normalised_ip",
    metadata={"hnsw:space": "ip"},  # use when vectors are unit-normalised
)

# Query returns "distances" field — interpretation depends on metric:
# cosine: 0 = identical, 2 = opposite (lower = more similar)
# l2:     0 = identical, larger = more different (lower = more similar)
# ip:     more negative = more similar (with normalised vectors)

Rule of thumb: most popular text embedding models (OpenAI, Sentence Transformers) are optimised for cosine similarity. Use "hnsw:space": "cosine" for text RAG applications. L2 is the default but is less optimal for text embeddings that vary in magnitude.

Which distance metric is generally recommended for text embedding models like those from OpenAI or Sentence Transformers?
When can you change the distance metric of an existing ChromaDB collection?
4. What is a ChromaDB collection and how do you create, list, get, and delete collections?

A collection is ChromaDB's primary organisational unit -analogous to a table in SQL or an index in a search engine. Each collection stores documents, their embeddings, IDs, and optional metadata. All items in a collection share the same embedding function and distance metric.

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")

# CREATE a collection
collection = client.create_collection(
    name="research_papers",
    metadata={"hnsw:space": "cosine"},
    # embedding_function defaults to all-MiniLM-L6-v2
)

# GET an existing collection (raises error if not found)
collection = client.get_collection("research_papers")

# GET or CREATE - idempotent, safe to call on every startup
collection = client.get_or_create_collection(
    name="research_papers",
    metadata={"hnsw:space": "cosine"},
)

# LIST all collections
collections = client.list_collections()
for col in collections:
    print(col.name)  # prints collection names

# COUNT documents in a collection
print(collection.count())  # number of items stored

# DELETE a collection and all its data
client.delete_collection("research_papers")

# MODIFY collection name or metadata
collection.modify(
    name="arxiv_papers",
    metadata={"hnsw:space": "cosine", "description": "arXiv CS papers"},
)
Collection management methods
MethodPurposeRaises if
create_collection(name)Creates new collectionName already exists
get_collection(name)Gets existing collectionName not found
get_or_create_collection(name)Idempotent get/createNever raises
list_collections()Returns all collection names-
delete_collection(name)Permanently deletes collection + dataName not found
What is the safest method to use when initialising a collection at application startup, to avoid errors whether the collection already exists or not?
What does collection.count() return?
5. How do you add documents to a ChromaDB collection?

The collection.add() method inserts items into a collection. Each item requires a unique id. You can provide raw documents (strings) and let ChromaDB embed them, or supply pre-computed embeddings directly. Optional metadatas store filterable key-value pairs alongside each document.

import chromadb

client = chromadb.Client()
collection = client.create_collection("articles")

# Basic add — ChromaDB embeds documents automatically
collection.add(
    documents=[
        "ChromaDB is an open-source vector database.",
        "Retrieval-augmented generation improves LLM accuracy.",
        "Python is a popular language for data science.",
    ],
    ids=["art-001", "art-002", "art-003"],
)

# Add with metadata — enables filtered queries later
collection.add(
    documents=[
        "FastAPI is a modern Python web framework.",
        "React is a JavaScript library for building UIs.",
    ],
    metadatas=[
        {"source": "docs", "category": "backend",  "year": 2024},
        {"source": "docs", "category": "frontend", "year": 2024},
    ],
    ids=["art-004", "art-005"],
)

# Add pre-computed embeddings (skip ChromaDB's embedding step)
import numpy as np
collection_custom = client.create_collection(
    "custom_embeddings",
    metadata={"hnsw:space": "cosine"},
)
collection_custom.add(
    embeddings=[
        [0.1, 0.5, -0.3, 0.8],  # must match embedding_function dimension
        [0.4, 0.2,  0.9, -0.1],
    ],
    documents=["Doc A", "Doc B"],  # stored as-is for retrieval
    ids=["e-1", "e-2"],
)

ID rules: IDs must be strings, must be unique within the collection, and must not be empty. Adding a duplicate ID raises a chromadb.errors.IDAlreadyExistsError.

What happens if you call collection.add() with an ID that already exists in the collection?
When would you pass embeddings= instead of documents= to collection.add()?
6. How do you query a ChromaDB collection for similar documents?

The primary query method is collection.query(). You pass either query_texts (raw strings that ChromaDB embeds automatically) or query_embeddings (pre-computed vectors). ChromaDB returns the n_results nearest neighbours for each query.

import chromadb

client = chromadb.Client()
collection = client.create_collection("knowledge_base")
collection.add(
    documents=[
        "Python is great for data science and machine learning.",
        "JavaScript is used for web development.",
        "ChromaDB stores and retrieves vector embeddings.",
        "Docker containers package applications with dependencies.",
    ],
    ids=["d1", "d2", "d3", "d4"],
)

# Basic query — returns top 2 most similar documents
results = collection.query(
    query_texts=["vector database for AI"],
    n_results=2,
)
print(results["documents"])  # [[most_similar, second_most_similar]]
print(results["ids"])        # [["d3", "d1"]]
print(results["distances"])  # [[0.18, 0.74]] — lower = more similar

# Query multiple texts at once (batch query)
results = collection.query(
    query_texts=["machine learning", "web frameworks"],
    n_results=2,
)
# results["documents"][0] = top 2 for "machine learning"
# results["documents"][1] = top 2 for "web frameworks"

# Control what is returned with include=
results = collection.query(
    query_texts=["Python programming"],
    n_results=3,
    include=["documents", "metadatas", "distances", "embeddings"],
)
# Default include: ["documents", "metadatas", "distances"]
# "embeddings" must be explicitly requested — adds response size
Query result fields
FieldTypeDescription
idslist[list[str]]IDs of matching documents, outer list = per query
documentslist[list[str]]Original text of matching documents
metadataslist[list[dict]]Metadata dicts of matching documents
distanceslist[list[float]]Similarity distances (lower = more similar for l2/cosine)
embeddingslist[list[list[float]]]Raw vectors — only if include=['embeddings']
What does n_results=5 mean in a ChromaDB query?
In a ChromaDB query result, what do the distance values represent?
7. How do you retrieve, update, and delete specific documents in ChromaDB?

Beyond querying by similarity, ChromaDB supports exact lookups by ID with get(), in-place updates with update() or upsert(), and deletion with delete().

import chromadb

client = chromadb.Client()
col = client.create_collection("items")
col.add(
    documents=["First document", "Second document", "Third document"],
    metadatas=[{"v": 1}, {"v": 2}, {"v": 3}],
    ids=["id1", "id2", "id3"],
)

# GET — fetch by specific IDs
result = col.get(ids=["id1", "id3"])
print(result["documents"])  # ["First document", "Third document"]

# GET all documents in the collection
all_docs = col.get()  # no ids= returns everything

# GET with include control
result = col.get(
    ids=["id1"],
    include=["documents", "metadatas", "embeddings"],
)

# UPDATE — must already exist, updates only specified fields
col.update(
    ids=["id1"],
    documents=["Updated first document"],
    metadatas=[{"v": 10, "updated": True}],
    # ChromaDB re-embeds the new document text automatically
)

# UPSERT — insert if not exists, update if exists (idempotent)
col.upsert(
    documents=["Brand new doc",        "Updated second doc"],
    metadatas=[{"v": 99},              {"v": 20}],
    ids=       ["id-new",              "id2"],
)
# id-new is inserted; id2 is updated

# DELETE by IDs
col.delete(ids=["id3"])
print(col.count())  # 3 (id1, id2, id-new remain)

# DELETE by metadata filter (where clause)
col.delete(where={"v": {"$gt": 15}})
CRUD methods summary
MethodBehaviour when ID existsBehaviour when ID missing
add()Raises IDAlreadyExistsErrorInserts new document
update()Updates the documentRaises error — ID must exist
upsert()Updates the documentInserts new document
delete()Removes the documentSilently ignores
What is the difference between update() and upsert() in ChromaDB?
When you call collection.update(ids=['x'], documents=['new text']), what does ChromaDB do with the stored embedding?

8. How do you filter query results using metadata in ChromaDB?

ChromaDB supports a MongoDB-style where clause for filtering by metadata fields. Filters can be applied during query() (combines semantic search with filtering) or during get() (exact retrieval with filtering). Filters run before or alongside the ANN search.

import chromadb

client = chromadb.Client()
col = client.create_collection("articles")
col.add(
    documents=["Python intro", "Python advanced", "JS basics", "Rust guide", "Go tutorial"],
    metadatas=[
        {"lang": "python", "level": "beginner", "year": 2022},
        {"lang": "python", "level": "advanced", "year": 2023},
        {"lang": "js",     "level": "beginner", "year": 2023},
        {"lang": "rust",   "level": "beginner", "year": 2024},
        {"lang": "go",     "level": "intermediate", "year": 2024},
    ],
    ids=["a1","a2","a3","a4","a5"],
)

# Equality filter
results = col.query(
    query_texts=["programming tutorial"],
    n_results=3,
    where={"lang": "python"},  # shorthand for $eq
)

# Comparison operators
results = col.query(
    query_texts=["tutorial"],
    n_results=5,
    where={"year": {"$gte": 2023}},  # year >= 2023
)

# Logical AND — all conditions must match
results = col.query(
    query_texts=["guide"],
    n_results=3,
    where={"$and": [
        {"lang":  {"$in":  ["python", "go"]}},
        {"level": {"$ne":  "advanced"}},
    ]},
)

# Logical OR
results = col.query(
    query_texts=["code"],
    n_results=3,
    where={"$or": [
        {"year": {"$eq": 2024}},
        {"level": {"$eq": "beginner"}},
    ]},
)

# Filter on document text content (where_document)
results = col.query(
    query_texts=["programming"],
    n_results=5,
    where_document={"$contains": "Python"},  # document text contains "Python"
)
ChromaDB where clause operators
OperatorMeaningExample
$eqEqual{"lang": {"$eq": "python"}} or {"lang": "python"}
$neNot equal{"level": {"$ne": "advanced"}}
$gt / $gteGreater than / or equal{"year": {"$gte": 2023}}
$lt / $lteLess than / or equal{"year": {"$lt": 2024}}
$inValue in list{"lang": {"$in": ["python", "go"]}}
$ninValue not in list{"lang": {"$nin": ["js"]}}
$andAll conditions true{"$and": [{...}, {...}]}
$orAny condition true{"$or": [{...}, {...}]}
What is the where_document parameter used for in a ChromaDB query?
How do you filter for documents where lang is 'python' AND year is 2023 or later?
9. What is the difference between ChromaDB's in-memory and persistent storage modes?

ChromaDB offers three client modes that control where data is stored. Choosing the right mode depends on whether you need data to survive restarts and whether you're running a single process or a shared service.

ChromaDB client modes
ModeClassData survives restart?Best for
Ephemeral (in-memory)chromadb.Client()No — lost when process endsTesting, prototyping, CI pipelines
Persistent (disk)chromadb.PersistentClient(path=...)Yes — written to SQLite + disk filesSingle-process apps, local dev
HTTP Clientchromadb.HttpClient(host=..., port=...)Yes — managed by serverMulti-process apps, production, shared access
import chromadb

# 1. Ephemeral — data lives only in memory, lost on exit
client_mem = chromadb.Client()

# 2. Persistent — data saved to disk at ./my_chroma_db/
client_disk = chromadb.PersistentClient(path="./my_chroma_db")
# Creates the directory if it does not exist
# Data persists across Python restarts

# 3. HTTP Client — connects to a running ChromaDB server
client_http = chromadb.HttpClient(
    host="localhost",
    port=8000,
    # ssl=True, headers={"Authorization": "Bearer token"}  # if secured
)

# Start the server separately:
# chroma run --path ./chroma_data --port 8000

# Verify connection
client_http.heartbeat()  # raises if server is unreachable

# EphemeralClient — explicit alias for chromadb.Client()
client_eph = chromadb.EphemeralClient()

# All three clients share the same collection API
collection = client_disk.get_or_create_collection("my_data")
collection.add(documents=["Persisted text"], ids=["p1"])
# Restart Python, create PersistentClient with same path → data still there

Important: the persistent client uses SQLite under the hood. It is not designed for concurrent writes from multiple processes. For multi-process or multi-container production use, run ChromaDB as an HTTP server and use HttpClient.

Which ChromaDB client should you use in production when multiple services need to read and write to the same database?
What happens to data stored with chromadb.Client() (EphemeralClient) when the Python process exits?
10. What is ChromaDB's default embedding function and how does it work?

When you create a collection without specifying an embedding function, ChromaDB uses the SentenceTransformerEmbeddingFunction backed by the all-MiniLM-L6-v2 model from the sentence-transformers library. This model is downloaded automatically on first use and cached locally.

import chromadb
from chromadb.utils import embedding_functions

# Default — uses all-MiniLM-L6-v2 automatically
client = chromadb.Client()
collection_default = client.create_collection("default_embeddings")

# Equivalent explicit usage
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2",  # 384-dim, fast, good English quality
)

# Using a different Sentence Transformer model
ef_large = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-mpnet-base-v2",  # 768-dim, higher quality, slower
)

collection_large = client.create_collection(
    name="large_model",
    embedding_function=ef_large,
    metadata={"hnsw:space": "cosine"},
)

# You can call embedding functions directly to inspect output
embed = embedding_functions.SentenceTransformerEmbeddingFunction()
vectors = embed(["Hello world", "ChromaDB is great"])
print(len(vectors))      # 2 — one vector per input
print(len(vectors[0]))   # 384 — dimensions
Default model properties
PropertyValue
Model nameall-MiniLM-L6-v2
Output dimensions384
Download size~80 MB (cached after first use)
Library requiredsentence-transformers
Runs onCPU (default) or GPU
StrengthFast, good English semantic similarity
LimitationWeaker on non-English, domain-specific text
What model does ChromaDB use as its default embedding function?
What happens the first time you create a ChromaDB collection with the default embedding function?
11. How do you use the OpenAI embedding function with ChromaDB?

ChromaDB has a built-in OpenAIEmbeddingFunction that calls the OpenAI Embeddings API. This gives higher-quality embeddings than the default local model, at the cost of API latency and usage fees. Use text-embedding-3-small for a balance of quality and cost, or text-embedding-3-large for maximum quality.

import chromadb
from chromadb.utils import embedding_functions
import os

client = chromadb.PersistentClient(path="./chroma_openai")

# Built-in OpenAI embedding function
ef_openai = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-small",  # 1536-dim, fast and cheap
    # model_name="text-embedding-3-large",  # 3072-dim, highest quality
    # model_name="text-embedding-ada-002",  # legacy, 1536-dim
)

collection = client.get_or_create_collection(
    name="openai_docs",
    embedding_function=ef_openai,
    metadata={"hnsw:space": "cosine"},
)

# Usage is identical to the default embedding function
collection.add(
    documents=[
        "FastAPI is a modern Python web framework for building APIs.",
        "ChromaDB stores embeddings for semantic search.",
    ],
    ids=["d1", "d2"],
)

# ChromaDB calls OpenAI API automatically on add() and query()
results = collection.query(
    query_texts=["vector database for retrieval"],
    n_results=1,
)
print(results["documents"])  # [["ChromaDB stores embeddings..."]]
OpenAI embedding models comparison
ModelDimensionsCostNotes
text-embedding-3-small1536~$0.02/1M tokensBest value — recommended default
text-embedding-3-large3072~$0.13/1M tokensHighest quality
text-embedding-ada-0021536~$0.10/1M tokensLegacy, superseded by v3

Important consistency rule: you must use the exact same embedding model for both storing and querying. If you embed documents with text-embedding-3-small, all queries must also use text-embedding-3-small. Mixing models produces meaningless similarity scores.

Why must you use the same embedding model for both adding documents and querying in ChromaDB?
Which OpenAI model is recommended as the best balance of cost and quality for new ChromaDB projects?
12. How do you use HuggingFace models as embedding functions in ChromaDB?

ChromaDB provides a HuggingFaceEmbeddingFunction that calls the HuggingFace Inference API (cloud-hosted), and a SentenceTransformerEmbeddingFunction for running any Sentence Transformer model locally. For production use without per-call API costs, local Sentence Transformer models are the more common choice.

import chromadb
from chromadb.utils import embedding_functions
import os

client = chromadb.Client()

# Option 1: HuggingFace Inference API (cloud, requires API key)
ef_hf_api = embedding_functions.HuggingFaceEmbeddingFunction(
    api_key=os.environ["HUGGINGFACE_API_KEY"],
    model_name="sentence-transformers/all-MiniLM-L6-v2",
)

# Option 2: Local Sentence Transformers (no API key, runs on your machine)
ef_local = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2",    # 384-dim, fast
    # model_name="all-mpnet-base-v2", # 768-dim, higher quality
    # model_name="BAAI/bge-large-en-v1.5",  # excellent quality
    device="cpu",  # or "cuda" for GPU acceleration
)

collection = client.create_collection(
    name="hf_docs",
    embedding_function=ef_local,
    metadata={"hnsw:space": "cosine"},
)

collection.add(
    documents=[
        "Open-source language models are becoming more powerful.",
        "LLaMA and Mistral are popular open-source LLMs.",
    ],
    ids=["h1", "h2"],
)

results = collection.query(
    query_texts=["free LLM models"],
    n_results=2,
)
print(results["documents"])

# Popular local models for RAG
models = {
    "BAAI/bge-small-en-v1.5":  "384-dim, excellent quality/speed ratio",
    "BAAI/bge-large-en-v1.5":  "1024-dim, top English quality",
    "intfloat/e5-base-v2":     "768-dim, strong multilingual",
    "thenlper/gte-large":      "1024-dim, great for retrieval",
}

Trade-offs: HuggingFace Inference API requires no local GPU but costs money and adds latency. Local Sentence Transformers are free, fast (especially on GPU), run offline, and are privacy-preserving — preferred for sensitive data.

What is the main advantage of using a local SentenceTransformerEmbeddingFunction over the HuggingFace Inference API?
Which local HuggingFace model family is widely considered top-quality for English retrieval tasks in ChromaDB?
13. How do you create a custom embedding function for ChromaDB?

ChromaDB defines a simple protocol for embedding functions: a class with a __call__ method that accepts a list of strings and returns a list of embedding vectors. Implementing this interface lets you plug in any model — a local transformer, a third-party API, or even a mock for testing.

import chromadb
from chromadb import Documents, EmbeddingFunction, Embeddings
from typing import List

# Custom embedding function — must implement __call__
class MyCustomEmbeddingFunction(EmbeddingFunction):
    """Wraps any embedding model in ChromaDB's interface."""

    def __init__(self, model_name: str = "my-model"):
        # Load your model here
        self.model_name = model_name
        # self.model = load_model(model_name)

    def __call__(self, input: Documents) -> Embeddings:
        """
        input:  list of strings to embed
        return: list of lists of floats (one vector per string)
        """
        embeddings = []
        for text in input:
            # Replace with your actual embedding logic
            vector = self._embed_text(text)
            embeddings.append(vector)
        return embeddings

    def _embed_text(self, text: str) -> List[float]:
        # Example: fixed-dim hash-based mock (not for production)
        import hashlib
        h = hashlib.md5(text.encode()).digest()
        return [b / 255.0 for b in h]  # 16-dim mock vector


# Use your custom function exactly like a built-in one
client = chromadb.Client()
custom_ef = MyCustomEmbeddingFunction()

collection = client.create_collection(
    name="custom_embed",
    embedding_function=custom_ef,
)
collection.add(
    documents=["Test document one", "Test document two"],
    ids=["c1", "c2"],
)
results = collection.query(
    query_texts=["test"],
    n_results=1,
)
print(results["ids"])  # [["c1"]] or [["c2"]]

When to write a custom embedding function:

  • Your company uses a proprietary or self-hosted embedding model
  • You need to embed data from a provider not in ChromaDB's built-in list
  • You want to add preprocessing (text cleaning, chunking, domain adaptation) before embedding
  • Testing — inject a deterministic mock that returns predictable vectors
What is the minimum interface a custom ChromaDB embedding function must implement?
Why might you create a custom embedding function for testing ChromaDB code?
14. How does ChromaDB's PersistentClient store data on disk, and what are its limitations?

The PersistentClient stores data in a directory you specify. Inside, ChromaDB uses SQLite for metadata (IDs, document text, metadata key-value pairs) and binary files for the HNSW vector index. All writes are flushed to disk automatically — there is no explicit save/commit step.

import chromadb
import os

# Create or open a persistent database
client = chromadb.PersistentClient(path="./my_vector_db")

# After this call, ./my_vector_db/ contains:
# - chroma.sqlite3         (metadata, documents, IDs)
# - <uuid>/               (one folder per collection)
#   - header.bin          (HNSW index configuration)
#   - data_level0.bin     (HNSW graph layer 0)
#   - length.bin          (element count)

col = client.get_or_create_collection("notes")
col.add(
    documents=["Remember to buy milk", "Meeting at 3pm tomorrow"],
    ids=["n1", "n2"],
)
# Data is persisted immediately — no commit needed

# Verify data survives restart:
del client, col  # simulate process exit
client2 = chromadb.PersistentClient(path="./my_vector_db")
col2 = client2.get_collection("notes")
print(col2.count())   # 2 — still there!
print(col2.get(ids=["n1"])["documents"])  # ["Remember to buy milk"]

# Check the files on disk
for root, dirs, files in os.walk("./my_vector_db"):
    for f in files:
        print(os.path.join(root, f))
PersistentClient limitations
LimitationDetail
Single writer onlySQLite allows only one writer at a time — concurrent writes from multiple processes cause errors
No built-in replicationThe SQLite file is a single point of failure; back it up manually
No horizontal scalingCannot distribute load across multiple machines
File lockingMoving or copying the directory while the client is open can corrupt data
MigrationUpgrading ChromaDB versions may require running migration scripts on the SQLite DB

For multi-process or production deployments, prefer running chroma run --path ./data as a server and connecting with HttpClient.

What database engine does ChromaDB's PersistentClient use to store metadata and document text?
Why is PersistentClient not suitable for concurrent writes from multiple Python processes?
15. What is the HNSW index in ChromaDB and what parameters can you tune?

ChromaDB uses HNSW (Hierarchical Navigable Small World) as its Approximate Nearest Neighbour (ANN) index. HNSW builds a layered graph structure where each node connects to its closest neighbours — queries traverse this graph efficiently to find approximate nearest neighbours in O(log n) time instead of exhaustive O(n) linear scan.

import chromadb

client = chromadb.Client()

# HNSW parameters are set as metadata at collection creation
collection = client.create_collection(
    name="tuned_collection",
    metadata={
        "hnsw:space":           "cosine",   # distance metric
        "hnsw:construction_ef": 200,         # default 100
        # Controls quality of index during construction.
        # Higher = better recall, slower inserts.

        "hnsw:search_ef":       100,         # default 10
        # Controls quality of search at query time.
        # Higher = better recall, slower queries.

        "hnsw:M":               16,          # default 16
        # Number of bi-directional links per node.
        # Higher = better recall + more memory + slower inserts.
        # Typical range: 4-64.
    },
)

# Note: HNSW parameters cannot be changed after collection creation
# You would need to recreate the collection and re-insert data

collection.add(
    documents=[f"Document number {i}" for i in range(10000)],
    ids=[str(i) for i in range(10000)],
)
HNSW tuning guide
ParameterDefaultEffect of increasingEffect of decreasing
hnsw:spacel2Changes metric (cosine/ip)—
hnsw:M16Better recall, more memory, slower insertsFaster inserts, less memory, lower recall
hnsw:construction_ef100Better index quality, slower insertsFaster inserts, lower quality graph
hnsw:search_ef10Better recall, slower queriesFaster queries, lower recall

For most RAG use cases, the defaults work well for collections under ~100K documents. For large collections or when recall matters, increase hnsw:search_ef to 50–200 and set hnsw:construction_ef to at least 200 when building the index.

What does the hnsw:search_ef parameter control in ChromaDB?
What type of algorithm is HNSW and why does ChromaDB use it instead of exact search?
16. How do you efficiently add large numbers of documents to ChromaDB using batching?

Adding tens of thousands of documents one at a time is slow because each call triggers embedding computation and index updates. The right approach is to batch documents into groups of 100–500 and add each batch with a single add() call — this amortises embedding overhead and index writes.

import chromadb
from chromadb.utils import embedding_functions
from typing import List

client = chromadb.PersistentClient(path="./bulk_db")
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)
collection = client.get_or_create_collection(
    "large_corpus", embedding_function=ef
)

# Simulate a large list of documents
documents = [f"Article about topic {i}" for i in range(10_000)]
ids       = [f"doc-{i}" for i in range(10_000)]
metadatas = [{"index": i, "batch": i // 500} for i in range(10_000)]

# Efficient batch insertion
BATCH_SIZE = 500

for start in range(0, len(documents), BATCH_SIZE):
    end = start + BATCH_SIZE
    collection.add(
        documents=documents[start:end],
        ids=ids[start:end],
        metadatas=metadatas[start:end],
    )
    print(f"Added batch {start // BATCH_SIZE + 1}, total: {collection.count()}")

print(f"Final count: {collection.count()}")  # 10000

# Alternative: provide pre-computed embeddings to skip re-embedding
# (useful when you already called the embedding API)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")

docs_batch = documents[:500]
vectors = model.encode(docs_batch, batch_size=64, show_progress_bar=True)
collection.add(
    embeddings=vectors.tolist(),
    documents=docs_batch,
    ids=ids[:500],
)
Batching tips
TipReason
Batch size 100–500Balances memory use and embedding throughput
Pre-compute embeddings externallyAvoid re-embedding if you already have vectors from an API call
Use GPU for local modelsSentenceTransformer encodes ~100x faster on GPU
Upsert instead of add in loopsupsert() is safe to re-run; add() fails on duplicate IDs
What is the main performance benefit of batching documents into groups before calling collection.add()?
What is a good batch size for bulk insertion into ChromaDB?
17. What is the where_document filter in ChromaDB and how does it differ from where?

ChromaDB provides two types of filters that can be used together or separately: where filters on metadata fields (structured key-value pairs), while where_document filters on the raw text content of the stored documents. Both can be combined in a single query.

import chromadb

client = chromadb.Client()
col = client.create_collection("mixed_docs")
col.add(
    documents=[
        "Python tutorial for beginners with examples",
        "Advanced Python decorators and metaclasses",
        "JavaScript async/await guide",
        "Python data science with pandas and numpy",
        "Rust memory safety tutorial",
    ],
    metadatas=[
        {"lang": "python", "level": "beginner"},
        {"lang": "python", "level": "advanced"},
        {"lang": "js",     "level": "intermediate"},
        {"lang": "python", "level": "intermediate"},
        {"lang": "rust",   "level": "beginner"},
    ],
    ids=["d1","d2","d3","d4","d5"],
)

# where_document: filter on text content
results = col.query(
    query_texts=["programming guide"],
    n_results=5,
    where_document={"$contains": "tutorial"},  # text must contain "tutorial"
)
print(results["ids"])  # d1, d2 (Python tut), d5 (Rust tut) — JS has no "tutorial"

# where_document with NOT
results = col.query(
    query_texts=["programming"],
    n_results=5,
    where_document={"$not_contains": "JavaScript"},
)

# Combine where (metadata) + where_document (text content)
results = col.query(
    query_texts=["learning to code"],
    n_results=3,
    where={"lang": "python"},                    # metadata filter
    where_document={"$contains": "tutorial"},    # content filter
    # Only Python docs whose text contains "tutorial"
)
print(results["documents"])
# Only matches d1: "Python tutorial for beginners with examples"
where vs where_document
FilterOperates onSupported operators
whereMetadata key-value fields$eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, $and, $or
where_documentRaw document text content$contains, $not_contains
What does the where_document={'$contains': 'Python'} filter do in a ChromaDB query?
Can where and where_document be used together in a single ChromaDB query?
18. How do you control what data ChromaDB returns in query and get results using include?

Both query() and get() accept an include parameter — a list of strings specifying which fields to return. Omitting fields you don't need reduces network payload and memory, which matters for large result sets.

import chromadb

client = chromadb.Client()
col = client.create_collection("demo")
col.add(
    documents=["Alpha document", "Beta document", "Gamma document"],
    metadatas=[{"tag": "a"}, {"tag": "b"}, {"tag": "c"}],
    ids=["id1", "id2", "id3"],
)

# Default include: documents, metadatas, distances (for query)
# Default include for get(): documents, metadatas (no distances)
results = col.query(query_texts=["document"], n_results=2)
print(results.keys())
# dict_keys(["ids", "distances", "metadatas", "embeddings", "documents", "uris", "data"])
# embeddings, uris, data are None by default

# Only return IDs and distances — smallest possible response
results = col.query(
    query_texts=["alpha"],
    n_results=2,
    include=["distances"],  # ids are always returned
)
print(results["documents"])   # None
print(results["distances"])   # [[0.05, 0.72]]

# Include raw embedding vectors (large! use only when needed)
results = col.query(
    query_texts=["beta"],
    n_results=1,
    include=["documents", "metadatas", "distances", "embeddings"],
)
print(len(results["embeddings"][0][0]))  # 384 floats per vector

# get() include — embeddings must be explicitly requested
all_data = col.get(
    include=["documents", "metadatas", "embeddings"],
)
print(len(all_data["embeddings"]))  # 3

# get() without include — minimal response
ids_only = col.get()
print(ids_only["ids"])         # ["id1", "id2", "id3"]
print(ids_only["documents"])   # ["Alpha...", "Beta...", "Gamma..."]
Available include values
ValueReturned in query()?Returned in get()?
documentsYes (default)Yes (default)
metadatasYes (default)Yes (default)
distancesYes (default)No — not applicable
embeddingsNo (must request)No (must request)
urisNo (multimodal only)No (multimodal only)
dataNo (multimodal only)No (multimodal only)
Why would you exclude 'documents' from the include list in a ChromaDB query?
Are IDs always returned in ChromaDB query and get results?
19. How do you design metadata schemas for effective filtering in ChromaDB?

Metadata in ChromaDB is stored as flat key-value dictionaries where values must be strings, integers, or floats (not nested dicts or lists). Good metadata design makes the difference between fast, precise filtered queries and slow full-collection scans.

import chromadb
from datetime import datetime

client = chromadb.Client()
col = client.create_collection("knowledge_base")

# Good metadata design — flat, filterable fields
col.add(
    documents=[
        "Introduction to transformer architecture in deep learning.",
        "BERT: Pre-training of Deep Bidirectional Transformers.",
        "GPT-4 technical report overview.",
    ],
    metadatas=[
        {
            "source":    "textbook",
            "author":    "Vaswani",
            "year":      2017,           # int — supports $gt, $lt
            "category":  "architecture",
            "citations": 50000,          # int — sortable
            "language":  "en",
            # timestamp as int for range queries
            "added_ts":  int(datetime(2024,1,1).timestamp()),
        },
        {
            "source":    "paper",
            "author":    "Devlin",
            "year":      2018,
            "category":  "pretraining",
            "citations": 40000,
            "language":  "en",
            "added_ts":  int(datetime(2024,1,2).timestamp()),
        },
        {
            "source":    "report",
            "author":    "OpenAI",
            "year":      2023,
            "category":  "LLM",
            "citations": 5000,
            "language":  "en",
            "added_ts":  int(datetime(2024,1,3).timestamp()),
        },
    ],
    ids=["p1","p2","p3"],
)

# Effective filtered queries
results = col.query(
    query_texts=["neural network architecture"],
    n_results=5,
    where={"$and": [
        {"year":     {"$gte": 2017}},
        {"citations":{"$gte": 10000}},
        {"language": "en"},
    ]},
)

# Anti-patterns to avoid in metadata:
# BAD:  {"tags": ["python", "nlp"]}  — lists not supported
# BAD:  {"author": {"name": "Vaswani", "affiliation": "Google"}}  — nested not supported
# GOOD: {"tag_python": 1, "tag_nlp": 1}  — flatten list membership to bool ints
# GOOD: {"author_name": "Vaswani", "author_org": "Google"}  — flatten nested
Metadata value types
TypeSupported?Supports range filters?
strYesOnly $eq, $ne, $in, $nin
intYesYes — $gt, $gte, $lt, $lte
floatYesYes — $gt, $gte, $lt, $lte
boolNo — use int 0/1—
listNo—
dict (nested)No—
What metadata value types does ChromaDB support?
How would you store a list of tags like ['python', 'nlp'] as ChromaDB metadata?
20. How do you inspect a ChromaDB collection's contents and configuration?

ChromaDB provides several methods to examine what is stored in a collection — useful for debugging, verifying ingestion, and monitoring collection health.

import chromadb

client = chromadb.PersistentClient(path="./inspect_demo")
col = client.get_or_create_collection(
    "articles",
    metadata={"hnsw:space": "cosine"},
)
col.add(
    documents=[f"Article {i} about topic {i%3}" for i in range(20)],
    metadatas=[{"topic": i%3, "idx": i} for i in range(20)],
    ids=[f"art-{i}" for i in range(20)],
)

# 1. Count documents
print(col.count())  # 20

# 2. Peek — quick look at first n items (default n=10)
peek = col.peek(limit=5)
print(peek["ids"])        # first 5 IDs
print(peek["documents"])  # first 5 documents

# 3. Get all (careful with large collections!)
all_items = col.get()
print(len(all_items["ids"]))  # 20

# 4. Get a page of results (offset-based)
page = col.get(
    limit=5,
    offset=10,  # skip first 10
)
print(page["ids"])  # art-10 through art-14

# 5. Inspect collection metadata and config
print(col.name)      # "articles"
print(col.id)        # UUID
print(col.metadata)  # {"hnsw:space": "cosine"}

# 6. List all collections
for c in client.list_collections():
    print(c)  # prints collection name

# 7. Check if a document exists by ID
result = col.get(ids=["art-5"])
if result["ids"]:
    print("Found:", result["documents"][0])
else:
    print("Not found")
Collection inspection methods
Method / PropertyPurpose
collection.count()Number of documents stored
collection.peek(limit=10)Quick sample of first N items
collection.get()Retrieve all items (paginate large collections)
collection.get(limit=N, offset=M)Paginate through collection
collection.nameCollection name string
collection.metadataDict of collection settings (hnsw:space etc.)
client.list_collections()Names of all collections
What does collection.peek() return?
How do you paginate through a large ChromaDB collection to avoid loading everything into memory at once?
21. How do you build a basic RAG (Retrieval-Augmented Generation) pipeline with ChromaDB?

RAG combines ChromaDB's semantic retrieval with an LLM's generation ability. The pipeline has two phases: indexing (chunk documents, embed, store in ChromaDB) and retrieval (embed the user query, fetch similar chunks, inject into LLM prompt).

import chromadb
from chromadb.utils import embedding_functions
from openai import OpenAI
import os

# --- INDEXING PHASE (run once) ---
chroma_client = chromadb.PersistentClient(path="./rag_db")
ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-small",
)
collection = chroma_client.get_or_create_collection(
    "company_docs", embedding_function=ef, metadata={"hnsw:space": "cosine"}
)

# Chunk and index your knowledge base
documents = [
    "ChromaDB supports cosine, l2, and inner-product distance metrics.",
    "Persistent storage in ChromaDB uses SQLite under the hood.",
    "The default embedding model is all-MiniLM-L6-v2 with 384 dimensions.",
    "ChromaDB collections support metadata filtering with $eq, $gt, $in operators.",
]
collection.add(
    documents=documents,
    ids=[f"doc-{i}" for i in range(len(documents))],
)

# --- RETRIEVAL + GENERATION PHASE (run per query) ---
def rag_answer(user_question: str, n_results: int = 3) -> str:
    # 1. Retrieve relevant chunks from ChromaDB
    results = collection.query(
        query_texts=[user_question],
        n_results=n_results,
        include=["documents", "distances"],
    )
    context_chunks = results["documents"][0]  # list of retrieved texts
    context = "\n\n".join(
        f"[{i+1}] {chunk}" for i, chunk in enumerate(context_chunks)
    )

    # 2. Build an augmented prompt
    prompt = f"""Answer the question using ONLY the context below.
    If the answer is not in the context, say "I don't know."

    Context:
    {context}

    Question: {user_question}
    Answer:"""

    # 3. Generate answer with LLM
    openai_client = OpenAI()
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

print(rag_answer("What distance metrics does ChromaDB support?"))
In a RAG pipeline, what role does ChromaDB play?
Why is retrieval-augmented generation (RAG) preferred over fine-tuning for adding domain knowledge to an LLM?
22. What are effective document chunking strategies when indexing documents into ChromaDB for RAG?

Before adding documents to ChromaDB, long texts must be split into chunks that fit within the embedding model's token limit and contain cohesive information. Chunk size and overlap directly affect retrieval quality.

# pip install langchain-text-splitters
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    TokenTextSplitter,
)
import chromadb

# RecursiveCharacterTextSplitter — tries to split at natural boundaries
# (paragraphs → sentences → words → characters)
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,      # characters per chunk (aim for ~200–400 tokens)
    chunk_overlap=50,    # overlap prevents losing context at chunk boundaries
    separators=["\n\n", "\n", ". ", " ", ""],
)

long_document = """ChromaDB is an open-source vector database.
It supports multiple embedding functions including OpenAI and HuggingFace.
ChromaDB uses HNSW for approximate nearest-neighbour search.
You can filter results using metadata fields.
Persistent storage uses SQLite under the hood.
""" * 20  # repeat to make it long

chunks = splitter.split_text(long_document)
print(f"Split into {len(chunks)} chunks")
print(f"First chunk length: {len(chunks[0])} chars")

# Add chunks to ChromaDB with source metadata
client = chromadb.Client()
col = client.create_collection("chunked_docs")

col.add(
    documents=chunks,
    metadatas=[{"source": "chroma_guide.txt", "chunk_idx": i}
               for i in range(len(chunks))],
    ids=[f"chunk-{i}" for i in range(len(chunks))],
)
Chunking strategy comparison
StrategyChunk sizeOverlapBest for
Small chunks100–200 tokens10–20 tokensPrecise retrieval, FAQ-style docs
Medium chunks300–500 tokens50 tokensMost RAG use cases — good balance
Large chunks800–1000 tokens100 tokensLong-form prose where context matters
Semantic chunkingVariable0Academic papers, structured content

Key rule: chunk overlap prevents the situation where a sentence spanning a chunk boundary gets split, losing its meaning in both halves. Typical overlap is 10–20% of chunk size.

Why is chunk overlap important when splitting documents for ChromaDB RAG indexing?
What is a good general-purpose chunk size (in tokens) for RAG document indexing?
23. How do you use ChromaDB as a vector store with LangChain?

LangChain provides a first-class Chroma vector store integration that wraps ChromaDB's API with LangChain's retriever interface. This enables plugging ChromaDB into LangChain RAG chains, agents, and pipelines without writing low-level ChromaDB code.

# pip install langchain langchain-chroma langchain-openai
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import os

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# --- Option 1: Create from documents ---
docs = [
    Document(page_content="ChromaDB is a vector database.", metadata={"source": "intro"}),
    Document(page_content="HNSW is used for ANN search.",  metadata={"source": "tech"}),
    Document(page_content="RAG improves LLM accuracy.",    metadata={"source": "ai"}),
]
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    collection_name="lc_demo",
    persist_directory="./lc_chroma",  # persistent storage
)

# --- Option 2: Load existing ChromaDB ---
vectorstore = Chroma(
    collection_name="lc_demo",
    embedding_function=embeddings,
    persist_directory="./lc_chroma",
)

# Similarity search
results = vectorstore.similarity_search("vector databases", k=2)
for doc in results:
    print(doc.page_content)

# As retriever (for use in chains)
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3, "filter": {"source": "tech"}},
)

# Build a simple RAG chain with LCEL
llm = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate.from_template(
    "Answer based on context:\n\n{context}\n\nQuestion: {question}"
)

def format_docs(docs):
    return "\n\n".join(d.page_content for d in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt | llm | StrOutputParser()
)
print(rag_chain.invoke("What search algorithm does ChromaDB use?"))
What does vectorstore.as_retriever() return in LangChain?
What is the main advantage of using LangChain's Chroma integration over using chromadb directly?
24. How do you implement multi-tenancy or data isolation in ChromaDB?

ChromaDB does not have built-in user-level access control, but you can implement logical isolation between tenants using separate collections per tenant (strong isolation) or metadata-based filtering (lighter weight). Choose based on your security and scale requirements.

import chromadb

client = chromadb.PersistentClient(path="./multi_tenant")

# --- Strategy 1: Separate collection per tenant ---
# Strong isolation — one tenant cannot accidentally access another's data
def get_tenant_collection(tenant_id: str):
    collection_name = f"tenant_{tenant_id}"  # e.g. "tenant_acme_corp"
    return client.get_or_create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine", "tenant": tenant_id},
    )

col_acme  = get_tenant_collection("acme_corp")
col_globex = get_tenant_collection("globex_inc")

col_acme.add(
    documents=["ACME internal policy v1"],
    ids=["acme-doc-1"],
)
col_globex.add(
    documents=["Globex product catalogue"],
    ids=["globex-doc-1"],
)
# ACME queries can never return Globex data — total isolation

# --- Strategy 2: Metadata filtering (shared collection) ---
# Lighter weight — all tenants share one collection, filtered at query time
shared_col = client.get_or_create_collection("shared_docs")

shared_col.add(
    documents=["ACME policy", "Globex catalogue"],
    metadatas=[{"tenant_id": "acme"}, {"tenant_id": "globex"}],
    ids=["s1", "s2"],
)

def tenant_query(tenant_id: str, query: str, n: int = 3):
    return shared_col.query(
        query_texts=[query],
        n_results=n,
        where={"tenant_id": tenant_id},  # ALWAYS filter by tenant
    )

results = tenant_query("acme", "company policies")
print(results["documents"])  # Only ACME docs returned
Multi-tenancy strategies
StrategyIsolationOverheadBest for
Separate collectionsStrong — no cross-tenant riskMore collections to manageHigh-security, regulated industries
Metadata filterLogical — relies on query disciplineSingle collection, simpler opsMany small tenants, lower risk
What is the risk of using metadata filtering for multi-tenancy in ChromaDB instead of separate collections?
How do you name collections for per-tenant isolation in ChromaDB?
25. What is embedding consistency and why is it critical in ChromaDB applications?

Embedding consistency means using the exact same embedding model and version for both indexing (adding documents) and querying. If you embed documents with model A but query with model B, the resulting vectors live in incompatible geometric spaces — similarity distances become meaningless and retrieval quality collapses.

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.PersistentClient(path="./consistency_demo")

# CORRECT: same embedding function for add and query
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)
collection = client.get_or_create_collection(
    "correct_usage",
    embedding_function=ef,  # stored on collection
)
collection.add(
    documents=["Hello world"],
    ids=["d1"],
)
# query() automatically uses the same ef stored on the collection
results = collection.query(query_texts=["greetings"], n_results=1)
# Works correctly — ef is applied to both document and query

# ---
# PITFALL 1: switching models between sessions
# Session 1: add with all-MiniLM-L6-v2 (384 dims)
# Session 2: accidentally use all-mpnet-base-v2 (768 dims) → dimension mismatch error!

# PITFALL 2: updating embedding model version
# Model v1.0 and v1.1 may produce different vector spaces
# Always re-embed ALL documents when upgrading the embedding model

# BEST PRACTICE: store the model name in collection metadata
collection_safe = client.get_or_create_collection(
    "safe_collection",
    embedding_function=ef,
    metadata={
        "hnsw:space": "cosine",
        "embedding_model": "all-MiniLM-L6-v2",  # document which model was used
        "embedding_dim":   "384",
    },
)
# On load, verify the model matches what is stored:
meta = collection_safe.metadata
print(meta["embedding_model"])  # "all-MiniLM-L6-v2"
print(meta["embedding_dim"])    # "384"

# When you need to upgrade the embedding model:
# 1. Create a NEW collection with the new model
# 2. Re-embed and re-insert all documents
# 3. Run validation queries to confirm quality
# 4. Delete the old collection
Embedding consistency checklist
CheckWhy
Same model nameDifferent models produce vectors in different spaces
Same model versionEven minor version updates can shift the vector space
Same preprocessingLowercasing, truncation, etc. must be identical
Store model name in metadataDocuments which model was used for future reference
Re-embed on model upgradeOld and new vectors cannot coexist in the same collection
What happens if you add documents to ChromaDB with one embedding model and query with a different one?
What is the best practice for recording which embedding model was used for a ChromaDB collection?
26. How do you run ChromaDB as a standalone HTTP server and connect to it from multiple clients?

For production or multi-process environments, run ChromaDB as a persistent HTTP server and connect all clients via chromadb.HttpClient(). This removes the single-writer SQLite limitation and allows any number of clients — including different languages — to share the same database.

# --- SERVER SIDE ---
# Install: pip install chromadb
# Start the server from the command line:
# chroma run --path ./chroma_data --port 8000 --host 0.0.0.0

# Or run programmatically (e.g. in tests):
import chromadb
from chromadb.config import Settings

# --- CLIENT SIDE ---
client = chromadb.HttpClient(
    host="localhost",
    port=8000,
)

# Verify server is reachable
client.heartbeat()  # raises ConnectionError if server is down

# Usage is identical to PersistentClient
collection = client.get_or_create_collection(
    "shared_docs",
    metadata={"hnsw:space": "cosine"},
)
collection.add(
    documents=["Shared document from client 1"],
    ids=["s1"],
)
results = collection.query(query_texts=["shared content"], n_results=1)
print(results["documents"])

# With authentication (chromadb server configured with auth)
client_auth = chromadb.HttpClient(
    host="my-server.example.com",
    port=443,
    ssl=True,
    headers={"Authorization": "Bearer my-token"},
)
# docker-compose.yml — containerised ChromaDB server
# version: "3.9"
# services:
#   chromadb:
#     image: chromadb/chroma:latest
#     ports:
#       - "8000:8000"
#     volumes:
#       - chroma_data:/chroma/chroma
#     environment:
#       - IS_PERSISTENT=TRUE
#       - ANONYMIZED_TELEMETRY=FALSE
# volumes:
#   chroma_data:
Client modes comparison
ModeConcurrencyNetworkUse case
EphemeralClientSingle process onlyNoneTests, notebooks
PersistentClientSingle writer onlyNoneLocal scripts, dev
HttpClientMultiple clientsHTTP/HTTPSProduction, microservices
What command starts a ChromaDB HTTP server from the terminal?
Why is HttpClient preferred over PersistentClient in a multi-container deployment?
27. When should you use upsert() instead of add() in ChromaDB, and what are common patterns?

upsert() is the idempotent write operation in ChromaDB: it inserts a document if the ID does not exist, or updates it if the ID already exists. This makes it safe to call repeatedly without checking whether a document has been indexed before — a critical property for ETL pipelines, scheduled sync jobs, and incremental indexing.

import chromadb
from datetime import datetime

client = chromadb.PersistentClient(path="./upsert_demo")
col = client.get_or_create_collection("products")

# Pattern 1: Safe initial load
# Can re-run the script without duplicate ID errors
def sync_products(products: list[dict]):
    col.upsert(
        documents=[p["description"] for p in products],
        ids=       [str(p["id"])   for p in products],
        metadatas= [{"name": p["name"], "price": p["price"], "updated": int(datetime.now().timestamp())}
                    for p in products],
    )

products_v1 = [
    {"id": 1, "name": "Widget", "description": "A blue widget", "price": 9.99},
    {"id": 2, "name": "Gadget", "description": "A red gadget",  "price": 14.99},
]
sync_products(products_v1)  # inserts both
print(col.count())  # 2

# Product 1 description changed — upsert handles it cleanly
products_v2 = [
    {"id": 1, "name": "Widget", "description": "An improved blue widget v2", "price": 11.99},
    {"id": 3, "name": "Doohickey", "description": "A green doohickey", "price": 4.99},
]
sync_products(products_v2)  # updates id=1, inserts id=3
print(col.count())          # 3

# Verify the update
result = col.get(ids=["1"])
print(result["documents"][0])   # "An improved blue widget v2"
print(result["metadatas"][0]["price"])  # 11.99

# Pattern 2: Incremental indexing — only upsert changed documents
def incremental_sync(items, last_sync_ts: int):
    changed = [i for i in items if i["updated_at"] > last_sync_ts]
    if changed:
        col.upsert(
            documents=[i["body"] for i in changed],
            ids=       [i["id"]  for i in changed],
            metadatas= [{"updated_at": i["updated_at"]} for i in changed],
        )
add vs upsert decision guide
ScenarioUse
First-time bulk load with guaranteed unique IDsadd() — faster, errors catch duplicate bugs
Recurring sync job (daily/hourly)upsert() — safe to re-run without cleanup
User-triggered document updateupsert() — don't need to check if doc exists first
Append-only event logadd() — duplicates should be errors, not updates
Why is upsert() preferred over add() for a nightly ETL job that syncs a product catalogue into ChromaDB?
What happens to the stored embedding when you upsert() a document with updated text?
28. What are best practices for structuring ChromaDB collection metadata for production use?

Collection-level metadata (set via create_collection(metadata=...)) stores configuration about the collection itself. Document-level metadata (set per document via add(metadatas=[...])) enables filtered retrieval. Both need thoughtful design for maintainable production systems.

import chromadb
from datetime import datetime

client = chromadb.PersistentClient(path="./prod_db")

# Good collection-level metadata: document operational details
collection = client.get_or_create_collection(
    name="support_tickets_v2",
    metadata={
        # HNSW config
        "hnsw:space":           "cosine",
        "hnsw:construction_ef": 200,
        "hnsw:search_ef":       100,
        # Operational metadata
        "embedding_model":      "text-embedding-3-small",
        "embedding_dims":       "1536",
        "schema_version":       "2",
        "created_at":           "2024-01-15",
        "description":          "Customer support ticket embeddings for semantic search",
    },
)

# Good document-level metadata: filterable, flat, typed
def add_ticket(ticket: dict):
    collection.upsert(
        documents=[ticket["description"]],
        ids=[f"ticket-{ticket['id']}"],
        metadatas=[{
            # Filterable dimensions
            "status":    ticket["status"],         # "open"/"closed"/"pending"
            "priority":  ticket["priority"],       # "low"/"medium"/"high"
            "category":  ticket["category"],       # "billing"/"technical"/"general"
            "agent_id":  ticket["agent_id"],       # str identifier
            # Date as Unix timestamp (int) — enables $gt/$lt range queries
            "created_ts": int(datetime.fromisoformat(ticket["created_at"]).timestamp()),
            "year":       int(ticket["created_at"][:4]),
            # Boolean as int — ChromaDB does not support bool type
            "is_escalated": int(ticket.get("escalated", False)),
        }],
    )

# Effective compound filter
results = collection.query(
    query_texts=["payment failed cannot checkout"],
    n_results=10,
    where={"$and": [
        {"status":    "open"},
        {"priority":  {"$in": ["high", "medium"]}},
        {"category":  "billing"},
        {"created_ts":{"$gte": int(datetime(2024, 1, 1).timestamp())}},
    ]},
)

Key rules: store dates as Unix timestamps (int) for range filtering. Store booleans as 0/1 integers. Keep metadata keys short and snake_case. Document your schema in collection-level metadata so future developers know what fields exist.

Why should dates be stored as Unix timestamps (integers) rather than ISO date strings in ChromaDB metadata?
How do you store a boolean value (True/False) in ChromaDB document metadata?
29. How does ChromaDB compare to FAISS, and when should you choose one over the other?

FAISS (Facebook AI Similarity Search) and ChromaDB both store and search embedding vectors, but they are designed for very different use cases. FAISS is a low-level library optimised for raw performance; ChromaDB is a higher-level database designed for developer ergonomics and full-stack AI applications.

ChromaDB vs FAISS
FeatureChromaDBFAISS
TypeVector database (full-stack)Vector index library (low-level)
StoragePersistent SQLite + HNSW filesIn-memory or flat files (manual)
MetadataBuilt-in key-value filteringNo metadata — must manage separately
DocumentsStores original text alongside vectorsStores vectors only — text management is manual
PersistenceBuilt-in PersistentClientManual save/load with faiss.write_index()
CRUDadd, get, update, delete, upsertAdd only — no update/delete without rebuilding
APIHigh-level Python + RESTLow-level Python/C++ bindings
PerformanceGood for <10M docsExcellent for 10M+ docs (GPU-accelerated)
Embedding functionBuilt-in (auto-embed text)You must manage embeddings yourself
Best forRAG apps, prototyping, small-medium scaleHigh-throughput ML systems, research, scale
# FAISS — lower level, manage everything manually
import faiss
import numpy as np

# Build index manually
dim = 384
index = faiss.IndexFlatIP(dim)           # inner product
vectors = np.random.rand(1000, dim).astype("float32")
faiss.normalize_L2(vectors)
index.add(vectors)                        # add vectors
D, I = index.search(query_vec, k=5)      # search
faiss.write_index(index, "index.faiss")  # save manually

# ChromaDB — higher level, text in, results out
import chromadb
client = chromadb.Client()
col = client.create_collection("demo")
col.add(documents=["text one", "text two"], ids=["1","2"])
results = col.query(query_texts=["similar text"], n_results=2)
# Embeddings, persistence, metadata all handled automatically
What is the key limitation of FAISS compared to ChromaDB for RAG application development?
In what scenario would you choose FAISS over ChromaDB?
30. What are common ChromaDB errors and how do you handle them in production code?

ChromaDB raises specific exception types that should be caught and handled gracefully in production applications. Understanding the error hierarchy helps you write resilient ingestion pipelines and retrieval code.

import chromadb
from chromadb.errors import (
    InvalidCollectionException,
    IDAlreadyExistsError,
    InvalidDimensionException,
)

client = chromadb.PersistentClient(path="./error_demo")

# --- Error 1: Collection not found ---
try:
    col = client.get_collection("does_not_exist")
except InvalidCollectionException as e:
    print(f"Collection missing: {e}")
    col = client.create_collection("does_not_exist")  # create it

# --- Error 2: Duplicate ID ---
col.add(documents=["Original doc"], ids=["doc-1"])
try:
    col.add(documents=["Duplicate doc"], ids=["doc-1"])
except IDAlreadyExistsError:
    print("ID already exists — use upsert() instead")
    col.upsert(documents=["Updated doc"], ids=["doc-1"])  # safe

# --- Error 3: Dimension mismatch ---
# Occurs when pre-computed embeddings don't match collection's embedding dimensions
col2 = client.create_collection("fixed_dim")
col2.add(embeddings=[[0.1, 0.2, 0.3]], documents=["Doc"], ids=["x"])
try:
    col2.add(embeddings=[[0.1, 0.2]], documents=["Wrong dim"], ids=["y"])  # 2-dim
except InvalidDimensionException as e:
    print(f"Dimension mismatch: {e}")

# --- Error 4: Connection error (HttpClient) ---
try:
    remote = chromadb.HttpClient(host="bad-host", port=9999)
    remote.heartbeat()
except Exception as e:
    print(f"Server unreachable: {e}")

# --- Production pattern: retry wrapper ---
import time
from functools import wraps

def with_retry(max_attempts=3, delay=1.0):
    def decorator(fn):
        @wraps(fn)
        def wrapper(*args, **kwargs):
            for attempt in range(max_attempts):
                try:
                    return fn(*args, **kwargs)
                except Exception as e:
                    if attempt == max_attempts - 1:
                        raise
                    print(f"Attempt {attempt+1} failed: {e}. Retrying...")
                    time.sleep(delay * (attempt + 1))
        return wrapper
    return decorator

@with_retry(max_attempts=3)
def safe_add(collection, documents, ids):
    collection.upsert(documents=documents, ids=ids)
Common ChromaDB exceptions
ExceptionCauseFix
InvalidCollectionExceptionget_collection() on non-existent nameUse get_or_create_collection()
IDAlreadyExistsErroradd() with duplicate IDsUse upsert() for idempotent writes
InvalidDimensionExceptionPre-computed embeddings wrong sizeMatch dimensions to collection's model
ValueErrorEmpty IDs, bad metadata typesValidate inputs before calling ChromaDB
ConnectionError / requests exceptionHttpClient cannot reach serverCheck server health, retry with backoff
What is the correct fix when ChromaDB raises IDAlreadyExistsError during an add() call?
Which exception does client.get_collection('nonexistent') raise?
31. How do you back up and restore a ChromaDB persistent database?

A PersistentClient database is simply a directory on disk. Backing it up is as straightforward as copying that directory — but you must ensure no writes are occurring during the copy to avoid a corrupted SQLite file.

import chromadb
import shutil
import os
from datetime import datetime

DB_PATH    = "./my_chroma_db"
BACKUP_DIR = "./backups"

# --- Backup strategy 1: Simple directory copy ---
# SAFE when: no active PersistentClient writes during the copy
os.makedirs(BACKUP_DIR, exist_ok=True)
timestamp   = datetime.now().strftime("%Y%m%d_%H%M%S")
backup_path = os.path.join(BACKUP_DIR, f"chroma_backup_{timestamp}")
shutil.copytree(DB_PATH, backup_path)
print(f"Backup saved to {backup_path}")

# --- Backup strategy 2: SQLite online backup (safe during reads) ---
import sqlite3

def backup_sqlite(db_path: str, backup_path: str):
    """SQLite online backup — safe even with active readers."""
    src = sqlite3.connect(os.path.join(db_path, "chroma.sqlite3"))
    dst = sqlite3.connect(os.path.join(backup_path, "chroma.sqlite3"))
    os.makedirs(backup_path, exist_ok=True)
    with dst:
        src.backup(dst, pages=100, progress=lambda s,p,r: print(f"Backed up {p} pages"))
    dst.close()
    src.close()
    # Also copy the HNSW index binary files
    for root, dirs, files in os.walk(db_path):
        for f in files:
            if f != "chroma.sqlite3":
                rel = os.path.relpath(root, db_path)
                dest_dir = os.path.join(backup_path, rel)
                os.makedirs(dest_dir, exist_ok=True)
                shutil.copy2(os.path.join(root, f), os.path.join(dest_dir, f))

# --- Restore ---
def restore_backup(backup_path: str, restore_path: str):
    if os.path.exists(restore_path):
        shutil.rmtree(restore_path)  # remove current
    shutil.copytree(backup_path, restore_path)
    print(f"Restored from {backup_path} to {restore_path}")

# Verify restored database
client = chromadb.PersistentClient(path=restore_path)
for col in client.list_collections():
    print(f"  {col}: {client.get_collection(col).count()} documents")

For the HttpClient / server mode: stop the ChromaDB server before copying the data directory, or use SQLite's online backup API. Never copy a SQLite file while it has active writers — this can produce a corrupted backup.

What files make up a ChromaDB PersistentClient database that must be backed up together?
Why is it unsafe to copy a ChromaDB PersistentClient directory while the client has active write operations?
32. How do you ensure the correct embedding function is used when reopening a persistent ChromaDB collection?

ChromaDB stores document text and vectors persistently, but it does not store which embedding function was used. When you reopen a PersistentClient, you must re-supply the same embedding function to the collection — otherwise ChromaDB may default to a different model, producing embedding mismatches.

import chromadb
from chromadb.utils import embedding_functions
import os

DB_PATH = "./persistent_ef_demo"

# === SESSION 1: Create and populate collection ===
client1 = chromadb.PersistentClient(path=DB_PATH)
ef_openai = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-small",
)
col1 = client1.get_or_create_collection(
    name="my_docs",
    embedding_function=ef_openai,                   # set the EF
    metadata={"hnsw:space": "cosine",
              "embedding_model": "text-embedding-3-small"},  # document it
)
col1.add(documents=["ChromaDB is great"], ids=["d1"])
print("Session 1 done, process exits...")
del client1, col1

# === SESSION 2: Reopen — MUST re-supply the same embedding function ===
client2 = chromadb.PersistentClient(path=DB_PATH)

# WRONG: ChromaDB defaults to all-MiniLM-L6-v2 (384-dim)
# Querying with a different model produces wrong results!
# col_wrong = client2.get_collection("my_docs")  # DO NOT DO THIS

# CORRECT: Re-supply the exact same embedding function
ef_openai_v2 = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-small",  # must match session 1
)
col2 = client2.get_collection(
    name="my_docs",
    embedding_function=ef_openai_v2,  # required!
)
results = col2.query(query_texts=["vector databases"], n_results=1)
print(results["documents"])  # correct result

# TIP: Read model name from collection metadata to avoid hardcoding
saved_model = col2.metadata.get("embedding_model", "all-MiniLM-L6-v2")
print(f"Using model: {saved_model}")
EF persistence gotchas
ScenarioProblemSolution
Reopen collection without EFDefaults to all-MiniLM-L6-v2, mismatches stored vectorsAlways pass embedding_function= on get_collection()
Upgrade embedding modelOld vectors incompatible with new modelCreate new collection, re-embed all docs, migrate
Team member uses different EFSilent quality degradationStore model name in collection metadata, document in README
What does ChromaDB use as the embedding function when you call get_collection() without specifying one?
What is the safest pattern for ensuring embedding function consistency across application restarts?
33. How do you interpret ChromaDB query distances and convert them into meaningful relevance scores?

ChromaDB query results include a distances field. The interpretation depends on the distance metric. Raw distances are not directly comparable across metrics, but they can be normalised into a [0, 1] relevance score for display or thresholding.

import chromadb

client = chromadb.Client()
col = client.create_collection("relevance_demo", metadata={"hnsw:space": "cosine"})
col.add(
    documents=[
        "ChromaDB is an open-source vector database",
        "Python is a popular programming language",
        "The Eiffel Tower is in Paris France",
    ],
    ids=["d1","d2","d3"],
)

results = col.query(
    query_texts=["vector database for AI"],
    n_results=3,
    include=["documents","distances"],
)

raw_distances = results["distances"][0]
print("Raw cosine distances:", raw_distances)
# e.g. [0.18, 0.72, 1.31]
# cosine distance: 0 = identical, 2 = completely opposite

# Convert cosine distance to similarity score [0, 1]
def cosine_distance_to_score(distance: float) -> float:
    """cosine distance [0,2] → relevance score [0,1]"""
    return 1 - (distance / 2)

for doc, dist in zip(results["documents"][0], raw_distances):
    score = cosine_distance_to_score(dist)
    print(f"  Score: {score:.3f} | {doc[:50]}")
# Score: 0.910 | ChromaDB is an open-source vector database
# Score: 0.640 | Python is a popular programming language
# Score: 0.345 | The Eiffel Tower is in Paris France

# Threshold: only return results above minimum relevance
MIN_SCORE = 0.7
filtered = [
    (doc, cosine_distance_to_score(dist))
    for doc, dist in zip(results["documents"][0], raw_distances)
    if cosine_distance_to_score(dist) >= MIN_SCORE
]
print(f"\nResults above {MIN_SCORE} threshold: {len(filtered)}")
for doc, score in filtered:
    print(f"  {score:.3f}: {doc}")
Distance metric interpretation
MetricRangeMost similarConversion to [0,1] score
cosine0 to 20 (identical)score = 1 - distance/2
l2 (Euclidean)0 to ∞0 (identical)score = 1 / (1 + distance)
ip (inner product)-∞ to 0 (normalised)Most negative = most similarscore = -distance (normalised vecs)
For cosine distance in ChromaDB, what does a distance value of 0 indicate?
How do you convert a ChromaDB cosine distance value of 0.4 to a relevance score on a 0–1 scale?
34. What are ChromaDB's practical size limits and performance characteristics at scale?

ChromaDB does not impose hard document count limits, but practical performance degrades at different thresholds depending on storage mode, hardware, and HNSW configuration. Understanding these helps you plan capacity and know when to consider alternatives.

ChromaDB scale guidelines
Collection sizeStorage modeTypical behaviour
< 100K docsPersistentClient or HttpClientExcellent — sub-10ms query latency
100K – 1M docsHttpClient (server mode)Good — 10–100ms queries with default settings
1M – 10M docsHttpClient + HNSW tuningAcceptable — tune hnsw:M and hnsw:search_ef
> 10M docsConsider FAISS or WeaviateChromaDB may struggle — these are better at extreme scale
import chromadb
import time

client = chromadb.Client()
col = client.create_collection(
    "scale_test",
    metadata={
        "hnsw:space":           "cosine",
        "hnsw:construction_ef": 200,   # higher quality index
        "hnsw:search_ef":       100,   # higher recall at query time
        "hnsw:M":               32,    # more connections per node
    },
)

# Batch insert 50,000 documents
BATCH = 500
for i in range(0, 50_000, BATCH):
    col.add(
        documents=[f"Document about topic {j % 100}" for j in range(i, i+BATCH)],
        ids=[str(j) for j in range(i, i+BATCH)],
    )

print(f"Collection has {col.count()} documents")

# Measure query latency
start = time.perf_counter()
results = col.query(query_texts=["topic 42"], n_results=10)
elapsed = time.perf_counter() - start
print(f"Query latency: {elapsed*1000:.1f}ms")

# Memory footprint estimate:
# 384-dim float32 vectors: 384 * 4 bytes = 1.5 KB per doc
# 50K docs * 1.5 KB = ~75 MB just for vectors
# HNSW graph adds ~20-30% overhead → ~100 MB total for 50K docs

Memory rule of thumb: each 384-dim vector requires ~1.5 KB. A 1M document collection with 384-dim embeddings needs ~1.5 GB just for vectors, plus HNSW graph overhead (~25%). Plan memory accordingly when deploying the ChromaDB server.

What is the approximate memory footprint per document for 384-dimensional float32 embeddings in ChromaDB?
At what approximate collection size does ChromaDB start to show performance degradation without tuning?
35. How do you use ChromaDB to detect and remove near-duplicate or semantically similar documents?

ChromaDB's similarity search makes it straightforward to detect semantic duplicates — documents that express the same idea with different wording. Before inserting a new document, query ChromaDB to see if a highly similar document already exists and decide whether to skip or replace it.

import chromadb

client = chromadb.Client()
col = client.create_collection(
    "dedup_store",
    metadata={"hnsw:space": "cosine"},
)

# Similarity threshold — tune based on your use case
DUPLICATE_THRESHOLD = 0.95  # cosine similarity >= 0.95 → treat as duplicate

def cosine_dist_to_score(d: float) -> float:
    return 1 - d / 2

def add_if_unique(
    collection,
    document: str,
    doc_id: str,
    metadata: dict = None,
    threshold: float = DUPLICATE_THRESHOLD,
) -> bool:
    """Returns True if document was added, False if it was a duplicate."""
    if collection.count() == 0:
        collection.add(documents=[document], ids=[doc_id],
                       metadatas=[metadata or {}])
        return True

    # Query for the nearest existing document
    results = collection.query(
        query_texts=[document],
        n_results=1,
        include=["documents", "distances"],
    )
    nearest_dist  = results["distances"][0][0]
    nearest_score = cosine_dist_to_score(nearest_dist)
    nearest_doc   = results["documents"][0][0]

    if nearest_score >= threshold:
        print(f"DUPLICATE detected (score={nearest_score:.3f}):")
        print(f"  New:      {document[:60]}")
        print(f"  Existing: {nearest_doc[:60]}")
        return False  # skip insertion

    collection.add(documents=[document], ids=[doc_id],
                   metadatas=[metadata or {}])
    return True

# Test deduplication
phrases = [
    ("ChromaDB is a vector database for AI apps.", "p1"),
    ("Chroma DB is a vector store built for AI applications.", "p2"),  # near-dup of p1
    ("Python is great for machine learning.", "p3"),
]
for text, pid in phrases:
    added = add_if_unique(col, text, pid)
    print(f"Added: {added} — {text[:40]}")

print(f"\nFinal collection size: {col.count()}")  # 2 (p2 was duplicate of p1)

Use cases: deduplication during web scraping, preventing duplicate knowledge base entries in RAG systems, clustering similar customer support tickets, and identifying near-identical product descriptions in e-commerce catalogues.

What cosine similarity score range would you typically use to classify two documents as near-duplicates?
What ChromaDB operation is at the core of semantic deduplication before inserting a new document?
36. How do you reset or clear a ChromaDB collection without deleting and recreating it?

ChromaDB does not have a direct clear() or truncate() method. The idiomatic way to reset a collection is to delete it and recreate it with the same parameters. For selective deletion, use delete() with ID lists or where filters.

import chromadb

client = chromadb.PersistentClient(path="./reset_demo")

# Setup
col = client.get_or_create_collection(
    "my_col",
    metadata={"hnsw:space": "cosine", "version": "1"},
)
col.add(
    documents=[f"Document {i}" for i in range(100)],
    ids=[str(i) for i in range(100)],
    metadatas=[{"batch": i // 10} for i in range(100)],
)
print(col.count())  # 100

# --- Option 1: Reset (delete all + recreate) ---
def reset_collection(client, name: str, metadata: dict = None):
    """Delete and recreate a collection, preserving its configuration."""
    saved_meta = {}
    try:
        saved_meta = client.get_collection(name).metadata or {}
    except Exception:
        pass
    client.delete_collection(name)
    return client.create_collection(
        name=name,
        metadata=metadata or saved_meta,
    )

col = reset_collection(client, "my_col")
print(col.count())  # 0

# Re-add fresh data after reset
col.add(documents=["Fresh start"], ids=["new-1"])

# --- Option 2: Selective delete by filter ---
col2 = client.get_or_create_collection("selective")
col2.add(
    documents=[f"Doc {i}" for i in range(20)],
    ids=[str(i) for i in range(20)],
    metadatas=[{"batch": i // 5} for i in range(20)],
)

# Delete only batch 0 (documents 0-4)
col2.delete(where={"batch": 0})
print(col2.count())  # 15 remaining

# Delete specific IDs
col2.delete(ids=["5","6","7"])
print(col2.count())  # 12 remaining

# Delete ALL via get + delete (when no useful metadata filter exists)
all_ids = col2.get(include=[])["ids"]  # get all IDs
if all_ids:
    col2.delete(ids=all_ids)
print(col2.count())  # 0
Collection reset options
MethodWhen to usePreserves schema?
delete_collection + create_collectionFull reset — cleanest approachYes (manual)
delete(where={...})Selective clear by metadata conditionYes
delete(ids=[...])Remove specific known documentsYes
get all IDs then deleteClear all without metadataYes
Why does ChromaDB not have a built-in clear() or truncate() method?
What is the most efficient way to delete all documents matching a metadata condition from a ChromaDB collection?
37. What configuration settings does ChromaDB support and how do you disable telemetry?

By default, ChromaDB sends anonymised usage telemetry to help the development team understand how the product is used. In enterprise or privacy-sensitive environments this should be disabled. ChromaDB also supports several configuration settings via environment variables and the Settings class.

import chromadb
from chromadb.config import Settings
import os

# --- Option 1: Disable telemetry via environment variable ---
os.environ["ANONYMIZED_TELEMETRY"] = "False"

# --- Option 2: Disable via Settings class ---
client = chromadb.PersistentClient(
    path="./my_db",
    settings=Settings(
        anonymized_telemetry=False,
        allow_reset=True,           # enables client.reset() — wipes all data!
    ),
)

# --- Option 3: Disable telemetry for HttpClient ---
client_http = chromadb.HttpClient(
    host="localhost",
    port=8000,
    settings=Settings(anonymized_telemetry=False),
)

# Settings available via Settings class
all_settings = Settings(
    anonymized_telemetry=False,
    allow_reset=False,          # default: False — prevents accidental wipe
    # chroma_db_impl="duckdb+parquet",  # legacy v0.3 setting (not used in v0.4+)
)

# allow_reset=True enables client.reset() — DELETES ALL DATA
# Only use in testing environments!
client_test = chromadb.EphemeralClient(
    settings=Settings(allow_reset=True)
)
client_test.create_collection("temp")
client_test.reset()  # wipes everything — use in test fixtures
print(client_test.list_collections())  # []
Key ChromaDB settings
SettingDefaultNotes
anonymized_telemetryTrueSet False in production for privacy
allow_resetFalseSet True only in test environments — reset() wipes all data
ANONYMIZED_TELEMETRY env varTrueEnvironment variable alternative to Settings class
How do you disable ChromaDB's anonymised telemetry in a production application?
What does allow_reset=True in ChromaDB Settings enable and why is it dangerous?
38. What is a production readiness checklist for a ChromaDB-based application?

Moving a ChromaDB application from prototype to production involves several architectural decisions around storage, concurrency, reliability, and observability. This checklist covers the key concerns.

ChromaDB production checklist
AreaRecommendation
Storage modeUse HttpClient connecting to a ChromaDB server — not PersistentClient in multi-process apps
Embedding consistencyStore embedding model name in collection metadata; always re-supply EF on get_collection()
Distance metricSet hnsw:space='cosine' at collection creation for text; cannot change later
BackupsSchedule regular directory snapshots or SQLite online backups; test restore procedure
TelemetrySet ANONYMIZED_TELEMETRY=False for privacy
BatchingInsert in batches of 100–500; use upsert() for idempotent pipelines
Error handlingCatch IDAlreadyExistsError, InvalidCollectionException; implement retry logic for HttpClient
HNSW tuningIncrease hnsw:construction_ef to 200 and hnsw:search_ef to 50–100 for large collections
Metadata schemaUse ints for dates/booleans; document schema in collection metadata
SecurityRun server behind a reverse proxy with TLS; add auth headers for HttpClient
MonitoringLog query latency, collection size, and embedding function errors
Scale planningPlan ~1.5 KB/doc for 384-dim vectors + 25% HNSW overhead; consider alternatives above 10M docs
# Minimal production-ready ChromaDB setup
import chromadb
from chromadb.utils import embedding_functions
from chromadb.config import Settings
import os
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

EMBEDDING_MODEL = "text-embedding-3-small"
COLLECTION_NAME = "prod_knowledge_base"

def create_client():
    return chromadb.HttpClient(
        host=os.environ["CHROMA_HOST"],
        port=int(os.environ.get("CHROMA_PORT", 8000)),
        settings=Settings(anonymized_telemetry=False),
    )

def get_collection(client):
    ef = embedding_functions.OpenAIEmbeddingFunction(
        api_key=os.environ["OPENAI_API_KEY"],
        model_name=EMBEDDING_MODEL,
    )
    return client.get_or_create_collection(
        name=COLLECTION_NAME,
        embedding_function=ef,
        metadata={
            "hnsw:space":           "cosine",
            "hnsw:construction_ef": 200,
            "hnsw:search_ef":       100,
            "embedding_model":      EMBEDDING_MODEL,
        },
    )

client = create_client()
client.heartbeat()   # fail fast if server is unreachable
collection = get_collection(client)
logger.info(f"Connected to collection with {collection.count()} documents")
Which ChromaDB client should a production multi-service application use?
What is the recommended first check after connecting to a production ChromaDB HttpClient?
«
»
Integration

Comments & Discussions