= 0.95 → treat as duplicate def cosine_dist_to_score(d: float) -> float: return 1 - d / 2 def add_if_unique( collection, document: str, doc_id: str, metadata: dict = None, threshold: float = DUPLICATE_THRESHOLD, ) -> bool: """Returns True if document was added, False if it was a duplicate.""" if collection.count() == 0: collection.add(documents=[document], ids=[doc_id], metadatas=[metadata or {}]) return True # Query for the nearest existing document results = collection.query( query_texts=[document], n_results=1, include=["documents", "distances"], ) nearest_dist = results["distances"][0][0] nearest_score = cosine_dist_to_score(nearest_dist) nearest_doc = results["documents"][0][0] if nearest_score >= threshold: print(f"DUPLICATE detected (score={nearest_score:.3f}):") print(f" New: {document[:60]}") print(f" Existing: {nearest_doc[:60]}") return False # skip insertion collection.add(documents=[document], ids=[doc_id], metadatas=[metadata or {}]) return True # Test deduplication phrases = [ ("ChromaDB is a vector database for AI apps.", "p1"), ("Chroma DB is a vector store built for AI applications.", "p2"), # near-dup of p1 ("Python is great for machine learning.", "p3"), ] for text, pid in phrases: added = add_if_unique(col, text, pid) print(f"Added: {added} — {text[:40]}") print(f"\nFinal collection size: {col.count()}") # 2 (p2 was duplicate of p1) Use cases: deduplication during web scraping, preventing duplicate knowledge base entries in RAG systems, clustering similar customer support tickets, and identifying near-identical product descriptions in e-commerce catalogues. Score >= 0.90–0.95 — very high similarity indicates the same idea expressed differently"> = 0.95 → treat as duplicate def cosine_dist_to_score(d: float) -> float: return 1 - d / 2 def add_if_unique( collection, document: str, doc_id: str, metadata: dict = None, threshold: float = DUPLICATE_THRESHOLD, ) -> bool: """Returns True if document was added, False if it was a duplicate.""" if collection.count() == 0: collection.add(documents=[document], ids=[doc_id], metadatas=[metadata or {}]) return True # Query for the nearest existing document results = collection.query( query_texts=[document], n_results=1, include=["documents", "distances"], ) nearest_dist = results["distances"][0][0] nearest_score = cosine_dist_to_score(nearest_dist) nearest_doc = results["documents"][0][0] if nearest_score >= threshold: print(f"DUPLICATE detected (score={nearest_score:.3f}):") print(f" New: {document[:60]}") print(f" Existing: {nearest_doc[:60]}") return False # skip insertion collection.add(documents=[document], ids=[doc_id], metadatas=[metadata or {}]) return True # Test deduplication phrases = [ ("ChromaDB is a vector database for AI apps.", "p1"), ("Chroma DB is a vector store built for AI applications.", "p2"), # near-dup of p1 ("Python is great for machine learning.", "p3"), ] for text, pid in phrases: added = add_if_unique(col, text, pid) print(f"Added: {added} — {text[:40]}") print(f"\nFinal collection size: {col.count()}") # 2 (p2 was duplicate of p1) Use cases: deduplication during web scraping, preventing duplicate knowledge base entries in RAG systems, clustering similar customer support tickets, and identifying near-identical product descriptions in e-commerce catalogues. Score >= 0.90–0.95 — very high similarity indicates the same idea expressed differently" />

Prev Next

Database / ChromaDB Interview Questions

How do you use ChromaDB to detect and remove near-duplicate or semantically similar documents?

ChromaDB's similarity search makes it straightforward to detect semantic duplicates — documents that express the same idea with different wording. Before inserting a new document, query ChromaDB to see if a highly similar document already exists and decide whether to skip or replace it.

import chromadb

client = chromadb.Client()
col = client.create_collection(
    "dedup_store",
    metadata={"hnsw:space": "cosine"},
)

# Similarity threshold — tune based on your use case
DUPLICATE_THRESHOLD = 0.95  # cosine similarity >= 0.95 → treat as duplicate

def cosine_dist_to_score(d: float) -> float:
    return 1 - d / 2

def add_if_unique(
    collection,
    document: str,
    doc_id: str,
    metadata: dict = None,
    threshold: float = DUPLICATE_THRESHOLD,
) -> bool:
    """Returns True if document was added, False if it was a duplicate."""
    if collection.count() == 0:
        collection.add(documents=[document], ids=[doc_id],
                       metadatas=[metadata or {}])
        return True

    # Query for the nearest existing document
    results = collection.query(
        query_texts=[document],
        n_results=1,
        include=["documents", "distances"],
    )
    nearest_dist  = results["distances"][0][0]
    nearest_score = cosine_dist_to_score(nearest_dist)
    nearest_doc   = results["documents"][0][0]

    if nearest_score >= threshold:
        print(f"DUPLICATE detected (score={nearest_score:.3f}):")
        print(f"  New:      {document[:60]}")
        print(f"  Existing: {nearest_doc[:60]}")
        return False  # skip insertion

    collection.add(documents=[document], ids=[doc_id],
                   metadatas=[metadata or {}])
    return True

# Test deduplication
phrases = [
    ("ChromaDB is a vector database for AI apps.", "p1"),
    ("Chroma DB is a vector store built for AI applications.", "p2"),  # near-dup of p1
    ("Python is great for machine learning.", "p3"),
]
for text, pid in phrases:
    added = add_if_unique(col, text, pid)
    print(f"Added: {added} — {text[:40]}")

print(f"\nFinal collection size: {col.count()}")  # 2 (p2 was duplicate of p1)

Use cases: deduplication during web scraping, preventing duplicate knowledge base entries in RAG systems, clustering similar customer support tickets, and identifying near-identical product descriptions in e-commerce catalogues.

What ChromaDB operation is at the core of semantic deduplication before inserting a new document?
What cosine similarity score range would you typically use to classify two documents as near-duplicates?

Invest now in Acorns!!! 🚀 Join Acorns and get your $5 bonus!

Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!

Earn passively and while sleeping

Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.

Invest now!!! Get Free equity stock (US, UK only)!

Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.

The Robinhood app makes it easy to trade stocks, crypto and more.


Webull! Receive free stock by signing up using the link: Webull signup.

More Related questions...

What is ChromaDB and what problem does it solve? What are embeddings and why are they central to how ChromaDB works? What distance metrics does ChromaDB support and how do you choose between them? What is a ChromaDB collection and how do you create, list, get, and delete collections? How do you add documents to a ChromaDB collection? How do you query a ChromaDB collection for similar documents? How do you retrieve, update, and delete specific documents in ChromaDB? How do you filter query results using metadata in ChromaDB? What is the difference between ChromaDB's in-memory and persistent storage modes? What is ChromaDB's default embedding function and how does it work? How do you use the OpenAI embedding function with ChromaDB? How do you use HuggingFace models as embedding functions in ChromaDB? How do you create a custom embedding function for ChromaDB? How does ChromaDB's PersistentClient store data on disk, and what are its limitations? What is the HNSW index in ChromaDB and what parameters can you tune? How do you efficiently add large numbers of documents to ChromaDB using batching? What is the where_document filter in ChromaDB and how does it differ from where? How do you control what data ChromaDB returns in query and get results using include? How do you design metadata schemas for effective filtering in ChromaDB? How do you inspect a ChromaDB collection's contents and configuration? How do you build a basic RAG (Retrieval-Augmented Generation) pipeline with ChromaDB? What are effective document chunking strategies when indexing documents into ChromaDB for RAG? How do you use ChromaDB as a vector store with LangChain? How do you implement multi-tenancy or data isolation in ChromaDB? What is embedding consistency and why is it critical in ChromaDB applications? How do you run ChromaDB as a standalone HTTP server and connect to it from multiple clients? When should you use upsert() instead of add() in ChromaDB, and what are common patterns? What are best practices for structuring ChromaDB collection metadata for production use? How does ChromaDB compare to FAISS, and when should you choose one over the other? What are common ChromaDB errors and how do you handle them in production code? How do you back up and restore a ChromaDB persistent database? How do you ensure the correct embedding function is used when reopening a persistent ChromaDB collection? How do you interpret ChromaDB query distances and convert them into meaningful relevance scores? What are ChromaDB's practical size limits and performance characteristics at scale? How do you use ChromaDB to detect and remove near-duplicate or semantically similar documents? How do you reset or clear a ChromaDB collection without deleting and recreating it? What configuration settings does ChromaDB support and how do you disable telemetry? What is a production readiness checklist for a ChromaDB-based application?
Show more question and Answers...

Integration

Comments & Discussions