Prev Next

AI / LangGraph LangChain Interview questions II

Could not find what you were looking for? send us the question and we would be happy to answer your question.

1. What are the different memory types in LangChain?

LangChain provides several memory classes that differ in how they store and compress conversation history. Choosing the right one involves balancing context quality, token cost, and retrieval precision.

Memory TypeHow It WorksBest For
ConversationBufferMemoryStores every message verbatimShort conversations where full context matters
ConversationBufferWindowMemoryKeeps only the last k messagesLong conversations; avoids context overflow
ConversationSummaryMemoryUses an LLM to summarise older messagesVery long sessions; quality over token savings
ConversationSummaryBufferMemorySummarises messages beyond a token limit; keeps recent messages verbatimBalance between detail and cost
VectorStoreRetrieverMemoryStores messages as embeddings; retrieves semantically relevant past contextLong-running assistants that need to recall specific facts
ConversationEntityMemoryExtracts and tracks named entities (people, places, concepts) from conversationPersonal assistants that must remember facts about people/topics

In LCEL-based applications, the memory pattern has shifted from these classes towards explicitly managing a messages list in chain state (with RunnableWithMessageHistory for automatic persistence per session ID), or using LangGraph's checkpointing for full state persistence.

Which memory type keeps only the last k conversation messages to avoid context overflow?
When would VectorStoreRetrieverMemory be more useful than ConversationBufferMemory?
2. How do you implement conversation memory?

The recommended modern approach for conversation memory uses RunnableWithMessageHistory which wraps an LCEL chain and automatically loads and saves message history per session ID from a configurable store — without any manual history tracking in application code.

from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

# In-memory store (swap for Redis, DynamoDB etc. in production)
store = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    MessagesPlaceholder("history"),
    ("human", "{input}"),
])

chain = prompt | ChatOpenAI()

chain_with_memory = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history",
)

# session_id identifies the conversation thread
chain_with_memory.invoke(
    {"input": "My name is Alice."},
    config={"configurable": {"session_id": "alice-123"}},
)
chain_with_memory.invoke(
    {"input": "What is my name?"},
    config={"configurable": {"session_id": "alice-123"}},
)
What does the get_session_history function provide to RunnableWithMessageHistory?
In a ChatPromptTemplate with conversation memory, what is the purpose of MessagesPlaceholder?
3. How do vector stores work in LangChain?

A vector store in LangChain stores text (documents, chunks) as high-dimensional embedding vectors so you can perform semantic similarity search — finding documents whose meaning is close to a query, even if the exact words don't match. Every vector store integrates an embedding model and a storage backend.

The standard workflow:

  1. Embed documents with an embedding model (OpenAIEmbeddings, HuggingFaceEmbeddings, etc.)
  2. Store the vectors in a vector database (FAISS, Chroma, Pinecone, Weaviate, PGVector)
  3. At query time, embed the query and retrieve the k nearest vectors
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# Create store from documents
vectorstore = FAISS.from_documents(docs, embeddings)

# Similarity search
results = vectorstore.similarity_search("How does LangChain work?", k=4)

# Use as a retriever in a chain
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 5})

Search types: similarity returns the k most similar documents; mmr (Maximal Marginal Relevance) balances similarity with diversity to avoid returning near-duplicate chunks. Most production vector stores (Pinecone, Weaviate, Qdrant) support metadata filtering so you can scope searches to a subset of documents.

What does MMR (Maximal Marginal Relevance) retrieval prioritise over simple similarity search?
What does vectorstore.as_retriever() return?
4. How do you build RAG pipelines with LangChain?

A RAG (Retrieval-Augmented Generation) pipeline enriches LLM responses with external knowledge by retrieving relevant documents at query time and injecting them into the prompt. A complete LangChain RAG pipeline has five stages:

  1. Load — ingest source documents with a DocumentLoader
  2. Split — chunk documents with a TextSplitter for efficient retrieval
  3. Embed & Store — embed chunks and store in a vector store
  4. Retrieve — at query time, fetch the most relevant chunks
  5. Generate — inject retrieved context into the prompt and generate an answer
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# 1. Load
loader = WebBaseLoader("https://python.langchain.com/docs/get_started/introduction")
docs = loader.load()

# 2. Split
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)

# 3. Embed & Store
vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings())

# 4 & 5. Retrieve + Generate
retriever = vectorstore.as_retriever()
rag_prompt = hub.pull("rlm/rag-prompt")

rag_chain = (
    {"context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
     "question": RunnablePassthrough()}
    | rag_prompt
    | ChatOpenAI()
    | StrOutputParser()
)

answer = rag_chain.invoke("What is LangChain?")
What is the purpose of chunk_overlap in RecursiveCharacterTextSplitter?
In a RAG LCEL chain, why is RunnablePassthrough used for the 'question' key?
5. What are document loaders and splitters?

Document loaders ingest content from various sources and return a list of Document objects (each containing page_content and metadata). Text splitters then divide those documents into smaller chunks suitable for embedding and retrieval.

Common document loaders:

  • PyPDFLoader — extracts text from PDF files, one page per Document
  • WebBaseLoader — scrapes a web page, returns its text content
  • CSVLoader — each row becomes a Document
  • DirectoryLoader — recursively loads all files in a directory
  • UnstructuredFileLoader — handles Word, PowerPoint, HTML, email, and more
  • GitHubLoader — loads files from a GitHub repository

Common text splitters:

  • RecursiveCharacterTextSplitter — splits on paragraphs, then sentences, then words until chunks fit the target size. Most commonly used.
  • CharacterTextSplitter — splits on a single character separator
  • TokenTextSplitter — splits by token count, precise for context window budgeting
  • MarkdownHeaderTextSplitter — splits Markdown by header sections, preserving structure
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,    # max chars per chunk
    chunk_overlap=200,  # overlap to preserve context at boundaries
    length_function=len,
)
chunks = splitter.split_documents(documents)
Why is RecursiveCharacterTextSplitter preferred over CharacterTextSplitter?
What does the chunk_overlap parameter in a text splitter do?
6. How do retrievers work in LangChain?

A Retriever in LangChain is a Runnable that takes a string query and returns a list of Document objects. It is the standard abstraction that decouples the RAG chain from the specific search mechanism — you can swap a vector store retriever for a keyword search retriever or a hybrid retriever without changing the chain.

Types of retrievers available in LangChain:

  • VectorStoreRetriever — most common; wraps a vector store and performs similarity (or MMR) search. Created via vectorstore.as_retriever()
  • MultiQueryRetriever — uses an LLM to generate multiple query variants, retrieves for each, deduplicates results
  • ContextualCompressionRetriever — post-processes retrieved documents to extract only the relevant sentences, reducing noise injected into the prompt
  • SelfQueryRetriever — parses natural language queries to extract both a semantic search string and metadata filters (e.g. 'articles from 2024 about Python')
  • ParentDocumentRetriever — retrieves small chunks for precision but returns their larger parent documents for fuller context
  • EnsembleRetriever — combines results from multiple retrievers (e.g. BM25 keyword + vector) using reciprocal rank fusion
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

bm25 = BM25Retriever.from_documents(docs, k=4)
vector = vectorstore.as_retriever(search_kwargs={"k": 4})

hybrid = EnsembleRetriever(retrievers=[bm25, vector], weights=[0.5, 0.5])
results = hybrid.invoke("How does LangChain memory work?")
What does ContextualCompressionRetriever do to retrieved documents?
What does EnsembleRetriever combine, and why is this useful?
7. What is multi-query retrieval?

Multi-query retrieval addresses a key weakness of single-vector search: a user's question may be phrased in a way that doesn't closely match how the relevant information is worded in the document store. MultiQueryRetriever solves this by using an LLM to automatically generate several alternative phrasings of the query, running each against the vector store, and deduplicating the union of all results.

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0)

retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=llm,
)

# For a query like "What is LangChain memory?"
# The LLM might generate:
#   1. "How does LangChain handle conversation state?"
#   2. "What memory classes are available in LangChain?"
#   3. "How do you persist context between LangChain calls?"
# Then retrieves for all 3 and deduplicates
results = retriever.invoke("What is LangChain memory?")

Multi-query retrieval improves recall — you're less likely to miss relevant documents due to vocabulary mismatch — but it increases latency and cost since it makes multiple LLM calls (for query generation) and multiple vector search calls per user query. It works best for knowledge bases with varied terminology or when users ask high-level questions that could be answered by multiple document sections.

What problem does MultiQueryRetriever primarily solve?
What is the main trade-off of using MultiQueryRetriever?

8. What are parent document retrieval patterns?

The parent document retrieval pattern addresses a fundamental tension in RAG systems: small chunks improve retrieval precision (the embedding closely matches the query), but large chunks provide richer context for the LLM to answer from. ParentDocumentRetriever resolves this by indexing small child chunks for search but returning their larger parent documents (or the full original documents) to the LLM.

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Child splitter: small chunks for precise retrieval
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

# Parent splitter: larger chunks returned to LLM
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

vectorstore = Chroma(embedding_function=OpenAIEmbeddings())
docstore = InMemoryStore()  # stores parent documents

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

retriever.add_documents(docs)

# Query retrieves small child chunks but returns their 2000-char parents
results = retriever.invoke("What is LCEL?")

This pattern significantly improves answer quality for knowledge-intensive tasks because the LLM receives enough surrounding context to reason about the answer, while the vector search remains precise.

In ParentDocumentRetriever, what are the child chunks used for?
What problem does ParentDocumentRetriever solve compared to using uniform-size chunks?
9. What are production deployment patterns for LangChain?

Moving a LangChain application from prototype to production requires addressing reliability, scalability, observability, and cost. The key patterns are:

  • LangServe + Docker — wrap chains as FastAPI endpoints with add_routes(), containerise with Docker, deploy to a managed container service (AWS ECS, GCP Cloud Run, Kubernetes). Expose via an API gateway with rate limiting.
  • Async endpoints — use ainvoke() / astream() with FastAPI async routes (async def) to handle concurrent requests without blocking worker threads. Pair with uvicorn --workers N or Gunicorn.
  • Response caching — use InMemoryCache for same-process caching or SQLiteCache / Redis-backed cache for multi-process. Cache key is the full prompt + model parameters, so identical requests skip the LLM call entirely.
  • Observability — enable LangSmith tracing with LANGCHAIN_TRACING_V2=true. Set up alerts on p95 latency and error rate. Track token usage per request to control costs.
  • Resilience — apply .with_retry() for transient API errors and .with_fallbacks([cheaper_model]) for budget management under load.
  • Secrets management — never hardcode API keys; use environment variables or a secrets manager (AWS Secrets Manager, HashiCorp Vault).
What LangChain library wraps a chain as a FastAPI REST API with /invoke and /stream endpoints?
Why should LangChain chains use async (ainvoke/astream) in production FastAPI deployments?
10. How do you implement caching in LangChain?

LangChain supports LLM response caching at the global level, so any chain that calls an LLM automatically benefits from cache hits without modifying individual chains. The cache key is the serialised prompt plus model parameters — if the same prompt is sent twice, the second call returns the cached response without hitting the API.

In-memory cache — fastest, lost on process restart, suitable for development and single-request deduplication:

from langchain.globals import set_llm_cache
from langchain_community.cache import InMemoryCache

set_llm_cache(InMemoryCache())

SQLite cache — persists across restarts, suitable for single-process production servers or CLIs:

from langchain_community.cache import SQLiteCache
set_llm_cache(SQLiteCache(database_path=".langchain_cache.db"))

Semantic cache — uses embedding similarity to serve cached responses for queries that are semantically equivalent but not character-identical:

from langchain_community.cache import GPTCache
# Or use RedisSemanticCache:
from langchain_community.cache import RedisSemanticCache
from langchain_openai import OpenAIEmbeddings

set_llm_cache(RedisSemanticCache(
    redis_url="redis://localhost:6379",
    embedding=OpenAIEmbeddings(),
    score_threshold=0.1,
))

Caching is most effective for knowledge base Q&A where many users ask similar questions, and for evaluation pipelines where the same prompts are run repeatedly.

What is the cache key used to identify a cached LLM response in LangChain?
What makes RedisSemanticCache different from SQLiteCache?
11. What are cost optimization techniques for LangChain?

LLM API costs are primarily driven by token usage. LangChain applications can apply several techniques at different layers to reduce costs without significantly degrading quality:

  • Response caching — the single highest-impact technique for repetitive queries. InMemoryCache or RedisSemanticCache returns stored responses for identical or semantically similar prompts, paying zero tokens for cache hits.
  • Model tiering — use cheaper models (GPT-4o-mini, Claude Haiku) for simple classification, routing, and extraction tasks; reserve expensive models (GPT-4o, Claude Sonnet) for complex reasoning. Implement this with RunnableBranch routing.
  • Prompt compression — use LLMLingua (via langchain-community) to compress retrieved context by removing low-information tokens before injecting into the prompt.
  • Token counting before calling — use llm.get_num_tokens(text) to check prompt size; truncate or summarise if it exceeds a budget:
llm = ChatOpenAI()
token_count = llm.get_num_tokens(prompt_text)
if token_count > 3000:
    # Summarise or truncate before proceeding
    ...
  • Streaming — stream responses to clients early; use max_tokens to cap output length for use cases where truncation is acceptable.
  • Batch processing — use .batch() with appropriate concurrency for offline workloads to maximise throughput per dollar.
  • Avoid over-engineering with agents — a simple RAG chain is 10-50x cheaper per query than a multi-step agent. Only use agents when the task genuinely requires dynamic decision-making.
Which cost optimisation technique has the highest impact for applications with repetitive queries?
What does model tiering mean in the context of LangChain cost optimisation?
12. What security best practices should you follow for LangChain applications?

LangChain applications interact with LLMs, external tools, and user-supplied data, creating several attack surfaces that require explicit mitigation:

  • Prompt injection prevention — the most critical LLM-specific risk. Malicious users craft inputs that override system instructions (e.g. 'Ignore all previous instructions and...'). Mitigate with input sanitisation, structural separation of user input from system context, and output validation that rejects responses that claim to override system behaviour.
  • Secrets management — never hardcode API keys in source code or commit them to version control. Use environment variables, .env files (excluded from git), or a secrets manager (AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager).
  • Tool permission minimisation — agents with write access to databases, file systems, or APIs can cause significant damage if manipulated via prompt injection. Grant tools the minimum permissions required: read-only where possible, scoped API tokens.
  • Human-in-the-loop for irreversible actions — use LangGraph's interrupt_before to pause before any tool that modifies data, deletes files, or sends emails, requiring human approval.
  • Output filtering — validate and filter LLM outputs for PII, harmful content, or off-topic responses before returning to users. Libraries like Guardrails AI or NeMo Guardrails integrate with LangChain.
  • Rate limiting on LangServe endpoints — prevent abuse and runaway costs from unauthenticated requests using API gateway rate limiting or middleware.
What is prompt injection in the context of LangChain applications?
What is the recommended approach for agents that perform irreversible actions (delete, send email)?
13. How do you test LangChain applications?

Testing LangChain applications requires strategies for both unit testing individual components without real LLM calls, and end-to-end evaluation of response quality.

Unit testing with fake LLMs — use FakeListLLM or FakeListChatModel to return predetermined responses so tests run fast and deterministically without API calls:

from langchain_community.llms.fake import FakeListLLM
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

fake_llm = FakeListLLM(responses=["Paris", "Berlin", "Tokyo"])
chain = ChatPromptTemplate.from_template("{q}") | fake_llm | StrOutputParser()

def test_capital_chain():
    result = chain.invoke({"q": "Capital of France?"})
    assert result == "Paris"

LangSmith evaluations — create a dataset of input/expected-output pairs in LangSmith and run evaluations using built-in evaluators (qa, criteria, labeled_score_string) or custom LLM-as-judge evaluators:

from langsmith.evaluation import evaluate
results = evaluate(
    my_chain.invoke,
    data="my-golden-dataset",
    evaluators=["qa"],
    experiment_prefix="rag-v2-test",
)

For integration tests, use pytest with responses or httpx mocks to simulate LLM API responses. Always test that your chain handles empty outputs, malformed JSON from the LLM, and retriever returning zero documents.

What does FakeListLLM do in a LangChain unit test?
What is an LLM-as-judge evaluator in LangSmith?
14. How do you monitor LangChain applications?

Monitoring LangChain applications in production means tracking latency, error rates, token usage, and response quality over time. LangSmith is the primary tool, but you can also integrate with standard observability infrastructure.

LangSmith tracing — enabled with two env vars, it captures every run automatically with full context:

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=ls__...
export LANGCHAIN_PROJECT=production-chat-v2

In LangSmith you get: latency distribution per chain step, error rate trends, token cost per request, feedback scores from users or evaluators, and the ability to filter/search runs by any metadata tag you add.

Custom metadata tagging — tag runs with user ID, feature flag, model version, etc. to enable filtering in LangSmith dashboards:

chain.invoke(
    {"input": user_query},
    config={
        "metadata": {"user_id": user_id, "ab_group": "control"},
        "tags": ["production", "rag-v2"],
    }
)

Custom callbacks for metrics — implement a callback handler that pushes latency, token counts, and error flags to your existing metrics backend (Prometheus, Datadog, CloudWatch) on each LLM call end:

class MetricsCallback(BaseCallbackHandler):
    def on_llm_end(self, response, **kwargs):
        tokens = response.llm_output.get("token_usage", {})
        prometheus_counter.inc(tokens.get("total_tokens", 0))
What two environment variables are required to enable LangSmith tracing?
What is the purpose of adding metadata tags to a chain invocation config in production?
15. What are common pitfalls in LangChain/LangGraph development?

Developers new to LangChain and LangGraph frequently encounter the same set of issues. Knowing them in advance saves significant debugging time:

  • Context window overflow — injecting the full conversation history into every prompt causes failures on long conversations. Fix: use ConversationBufferWindowMemory, summarisation memory, or LangGraph's message trimming.
  • Agent infinite loops — an agent can keep calling tools indefinitely if it never reaches a satisfying answer. Fix: always set max_iterations in AgentExecutor or add a loop count check in LangGraph conditional edges.
  • Prompt injection from user inputs — if raw user text is inserted into system-level prompts, attackers can override your instructions. Fix: sanitise inputs, use structured message roles, never directly concatenate user text into the system message.
  • Over-engineering with agents — using a 5-step agent for a task that a single RAG call handles. Agents are slower, more expensive, and less predictable. Fix: start with the simplest approach and only add agent complexity when necessary.
  • Ignoring async in high-concurrency servers — using invoke() instead of ainvoke() in FastAPI handlers blocks the event loop and degrades performance under load.
  • Hallucinated tool calls — ReAct agents can sometimes hallucinate tool calls or their inputs. Fix: use structured output (OpenAI Tools Agent) instead of text-parsed ReAct, and add input validation to tool functions.
  • Pinning package versions — LangChain releases frequently; unpinned dependencies in production cause unexpected breaking changes. Always use a lockfile.
How do you prevent LangChain agents from running indefinitely?
Why is using invoke() instead of ainvoke() a pitfall in FastAPI-based LangChain servers?
«
»
Database

Comments & Discussions