Database / ChromaDB Interview Questions
How do you use ChromaDB to detect and remove near-duplicate or semantically similar documents?
ChromaDB's similarity search makes it straightforward to detect semantic duplicates — documents that express the same idea with different wording. Before inserting a new document, query ChromaDB to see if a highly similar document already exists and decide whether to skip or replace it.
import chromadb
client = chromadb.Client()
col = client.create_collection(
"dedup_store",
metadata={"hnsw:space": "cosine"},
)
# Similarity threshold — tune based on your use case
DUPLICATE_THRESHOLD = 0.95 # cosine similarity >= 0.95 → treat as duplicate
def cosine_dist_to_score(d: float) -> float:
return 1 - d / 2
def add_if_unique(
collection,
document: str,
doc_id: str,
metadata: dict = None,
threshold: float = DUPLICATE_THRESHOLD,
) -> bool:
"""Returns True if document was added, False if it was a duplicate."""
if collection.count() == 0:
collection.add(documents=[document], ids=[doc_id],
metadatas=[metadata or {}])
return True
# Query for the nearest existing document
results = collection.query(
query_texts=[document],
n_results=1,
include=["documents", "distances"],
)
nearest_dist = results["distances"][0][0]
nearest_score = cosine_dist_to_score(nearest_dist)
nearest_doc = results["documents"][0][0]
if nearest_score >= threshold:
print(f"DUPLICATE detected (score={nearest_score:.3f}):")
print(f" New: {document[:60]}")
print(f" Existing: {nearest_doc[:60]}")
return False # skip insertion
collection.add(documents=[document], ids=[doc_id],
metadatas=[metadata or {}])
return True
# Test deduplication
phrases = [
("ChromaDB is a vector database for AI apps.", "p1"),
("Chroma DB is a vector store built for AI applications.", "p2"), # near-dup of p1
("Python is great for machine learning.", "p3"),
]
for text, pid in phrases:
added = add_if_unique(col, text, pid)
print(f"Added: {added} — {text[:40]}")
print(f"\nFinal collection size: {col.count()}") # 2 (p2 was duplicate of p1)Use cases: deduplication during web scraping, preventing duplicate knowledge base entries in RAG systems, clustering similar customer support tickets, and identifying near-identical product descriptions in e-commerce catalogues.
Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!
Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.
Invest now!!! Get Free equity stock (US, UK only)!
Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.
The Robinhood app makes it easy to trade stocks, crypto and more.
Webull! Receive free stock by signing up using the link: Webull signup.
More Related questions...
