AI / LangGraph LangChain Interview questions II
LangChain provides several memory classes that differ in how they store and compress conversation history. Choosing the right one involves balancing context quality, token cost, and retrieval precision.
| Memory Type | How It Works | Best For |
|---|---|---|
| ConversationBufferMemory | Stores every message verbatim | Short conversations where full context matters |
| ConversationBufferWindowMemory | Keeps only the last k messages | Long conversations; avoids context overflow |
| ConversationSummaryMemory | Uses an LLM to summarise older messages | Very long sessions; quality over token savings |
| ConversationSummaryBufferMemory | Summarises messages beyond a token limit; keeps recent messages verbatim | Balance between detail and cost |
| VectorStoreRetrieverMemory | Stores messages as embeddings; retrieves semantically relevant past context | Long-running assistants that need to recall specific facts |
| ConversationEntityMemory | Extracts and tracks named entities (people, places, concepts) from conversation | Personal assistants that must remember facts about people/topics |
In LCEL-based applications, the memory pattern has shifted from these classes towards explicitly managing a messages list in chain state (with RunnableWithMessageHistory for automatic persistence per session ID), or using LangGraph's checkpointing for full state persistence.
The recommended modern approach for conversation memory uses RunnableWithMessageHistory which wraps an LCEL chain and automatically loads and saves message history per session ID from a configurable store — without any manual history tracking in application code.
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
# In-memory store (swap for Redis, DynamoDB etc. in production)
store = {}
def get_session_history(session_id: str) -> BaseChatMessageHistory:
if session_id not in store:
store[session_id] = ChatMessageHistory()
return store[session_id]
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant."),
MessagesPlaceholder("history"),
("human", "{input}"),
])
chain = prompt | ChatOpenAI()
chain_with_memory = RunnableWithMessageHistory(
chain,
get_session_history,
input_messages_key="input",
history_messages_key="history",
)
# session_id identifies the conversation thread
chain_with_memory.invoke(
{"input": "My name is Alice."},
config={"configurable": {"session_id": "alice-123"}},
)
chain_with_memory.invoke(
{"input": "What is my name?"},
config={"configurable": {"session_id": "alice-123"}},
)
A vector store in LangChain stores text (documents, chunks) as high-dimensional embedding vectors so you can perform semantic similarity search — finding documents whose meaning is close to a query, even if the exact words don't match. Every vector store integrates an embedding model and a storage backend.
The standard workflow:
- Embed documents with an embedding model (
OpenAIEmbeddings,HuggingFaceEmbeddings, etc.) - Store the vectors in a vector database (FAISS, Chroma, Pinecone, Weaviate, PGVector)
- At query time, embed the query and retrieve the k nearest vectors
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
# Create store from documents
vectorstore = FAISS.from_documents(docs, embeddings)
# Similarity search
results = vectorstore.similarity_search("How does LangChain work?", k=4)
# Use as a retriever in a chain
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 5})
Search types: similarity returns the k most similar documents; mmr (Maximal Marginal Relevance) balances similarity with diversity to avoid returning near-duplicate chunks. Most production vector stores (Pinecone, Weaviate, Qdrant) support metadata filtering so you can scope searches to a subset of documents.
A RAG (Retrieval-Augmented Generation) pipeline enriches LLM responses with external knowledge by retrieving relevant documents at query time and injecting them into the prompt. A complete LangChain RAG pipeline has five stages:
- Load — ingest source documents with a DocumentLoader
- Split — chunk documents with a TextSplitter for efficient retrieval
- Embed & Store — embed chunks and store in a vector store
- Retrieve — at query time, fetch the most relevant chunks
- Generate — inject retrieved context into the prompt and generate an answer
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
# 1. Load
loader = WebBaseLoader("https://python.langchain.com/docs/get_started/introduction")
docs = loader.load()
# 2. Split
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
# 3. Embed & Store
vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings())
# 4 & 5. Retrieve + Generate
retriever = vectorstore.as_retriever()
rag_prompt = hub.pull("rlm/rag-prompt")
rag_chain = (
{"context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
"question": RunnablePassthrough()}
| rag_prompt
| ChatOpenAI()
| StrOutputParser()
)
answer = rag_chain.invoke("What is LangChain?")
Document loaders ingest content from various sources and return a list of Document objects (each containing page_content and metadata). Text splitters then divide those documents into smaller chunks suitable for embedding and retrieval.
Common document loaders:
PyPDFLoader— extracts text from PDF files, one page per DocumentWebBaseLoader— scrapes a web page, returns its text contentCSVLoader— each row becomes a DocumentDirectoryLoader— recursively loads all files in a directoryUnstructuredFileLoader— handles Word, PowerPoint, HTML, email, and moreGitHubLoader— loads files from a GitHub repository
Common text splitters:
RecursiveCharacterTextSplitter— splits on paragraphs, then sentences, then words until chunks fit the target size. Most commonly used.CharacterTextSplitter— splits on a single character separatorTokenTextSplitter— splits by token count, precise for context window budgetingMarkdownHeaderTextSplitter— splits Markdown by header sections, preserving structure
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # max chars per chunk
chunk_overlap=200, # overlap to preserve context at boundaries
length_function=len,
)
chunks = splitter.split_documents(documents)
A Retriever in LangChain is a Runnable that takes a string query and returns a list of Document objects. It is the standard abstraction that decouples the RAG chain from the specific search mechanism — you can swap a vector store retriever for a keyword search retriever or a hybrid retriever without changing the chain.
Types of retrievers available in LangChain:
- VectorStoreRetriever — most common; wraps a vector store and performs similarity (or MMR) search. Created via
vectorstore.as_retriever() - MultiQueryRetriever — uses an LLM to generate multiple query variants, retrieves for each, deduplicates results
- ContextualCompressionRetriever — post-processes retrieved documents to extract only the relevant sentences, reducing noise injected into the prompt
- SelfQueryRetriever — parses natural language queries to extract both a semantic search string and metadata filters (e.g. 'articles from 2024 about Python')
- ParentDocumentRetriever — retrieves small chunks for precision but returns their larger parent documents for fuller context
- EnsembleRetriever — combines results from multiple retrievers (e.g. BM25 keyword + vector) using reciprocal rank fusion
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
bm25 = BM25Retriever.from_documents(docs, k=4)
vector = vectorstore.as_retriever(search_kwargs={"k": 4})
hybrid = EnsembleRetriever(retrievers=[bm25, vector], weights=[0.5, 0.5])
results = hybrid.invoke("How does LangChain memory work?")
Multi-query retrieval addresses a key weakness of single-vector search: a user's question may be phrased in a way that doesn't closely match how the relevant information is worded in the document store. MultiQueryRetriever solves this by using an LLM to automatically generate several alternative phrasings of the query, running each against the vector store, and deduplicating the union of all results.
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature=0)
retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(),
llm=llm,
)
# For a query like "What is LangChain memory?"
# The LLM might generate:
# 1. "How does LangChain handle conversation state?"
# 2. "What memory classes are available in LangChain?"
# 3. "How do you persist context between LangChain calls?"
# Then retrieves for all 3 and deduplicates
results = retriever.invoke("What is LangChain memory?")
Multi-query retrieval improves recall — you're less likely to miss relevant documents due to vocabulary mismatch — but it increases latency and cost since it makes multiple LLM calls (for query generation) and multiple vector search calls per user query. It works best for knowledge bases with varied terminology or when users ask high-level questions that could be answered by multiple document sections.
The parent document retrieval pattern addresses a fundamental tension in RAG systems: small chunks improve retrieval precision (the embedding closely matches the query), but large chunks provide richer context for the LLM to answer from. ParentDocumentRetriever resolves this by indexing small child chunks for search but returning their larger parent documents (or the full original documents) to the LLM.
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Child splitter: small chunks for precise retrieval
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
# Parent splitter: larger chunks returned to LLM
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
vectorstore = Chroma(embedding_function=OpenAIEmbeddings())
docstore = InMemoryStore() # stores parent documents
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=docstore,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
retriever.add_documents(docs)
# Query retrieves small child chunks but returns their 2000-char parents
results = retriever.invoke("What is LCEL?")
This pattern significantly improves answer quality for knowledge-intensive tasks because the LLM receives enough surrounding context to reason about the answer, while the vector search remains precise.
Moving a LangChain application from prototype to production requires addressing reliability, scalability, observability, and cost. The key patterns are:
- LangServe + Docker — wrap chains as FastAPI endpoints with
add_routes(), containerise with Docker, deploy to a managed container service (AWS ECS, GCP Cloud Run, Kubernetes). Expose via an API gateway with rate limiting. - Async endpoints — use
ainvoke()/astream()with FastAPI async routes (async def) to handle concurrent requests without blocking worker threads. Pair withuvicorn --workers Nor Gunicorn. - Response caching — use
InMemoryCachefor same-process caching orSQLiteCache/ Redis-backed cache for multi-process. Cache key is the full prompt + model parameters, so identical requests skip the LLM call entirely. - Observability — enable LangSmith tracing with
LANGCHAIN_TRACING_V2=true. Set up alerts on p95 latency and error rate. Track token usage per request to control costs. - Resilience — apply
.with_retry()for transient API errors and.with_fallbacks([cheaper_model])for budget management under load. - Secrets management — never hardcode API keys; use environment variables or a secrets manager (AWS Secrets Manager, HashiCorp Vault).
LangChain supports LLM response caching at the global level, so any chain that calls an LLM automatically benefits from cache hits without modifying individual chains. The cache key is the serialised prompt plus model parameters — if the same prompt is sent twice, the second call returns the cached response without hitting the API.
In-memory cache — fastest, lost on process restart, suitable for development and single-request deduplication:
from langchain.globals import set_llm_cache
from langchain_community.cache import InMemoryCache
set_llm_cache(InMemoryCache())
SQLite cache — persists across restarts, suitable for single-process production servers or CLIs:
from langchain_community.cache import SQLiteCache
set_llm_cache(SQLiteCache(database_path=".langchain_cache.db"))
Semantic cache — uses embedding similarity to serve cached responses for queries that are semantically equivalent but not character-identical:
from langchain_community.cache import GPTCache
# Or use RedisSemanticCache:
from langchain_community.cache import RedisSemanticCache
from langchain_openai import OpenAIEmbeddings
set_llm_cache(RedisSemanticCache(
redis_url="redis://localhost:6379",
embedding=OpenAIEmbeddings(),
score_threshold=0.1,
))
Caching is most effective for knowledge base Q&A where many users ask similar questions, and for evaluation pipelines where the same prompts are run repeatedly.
LLM API costs are primarily driven by token usage. LangChain applications can apply several techniques at different layers to reduce costs without significantly degrading quality:
- Response caching — the single highest-impact technique for repetitive queries. InMemoryCache or RedisSemanticCache returns stored responses for identical or semantically similar prompts, paying zero tokens for cache hits.
- Model tiering — use cheaper models (GPT-4o-mini, Claude Haiku) for simple classification, routing, and extraction tasks; reserve expensive models (GPT-4o, Claude Sonnet) for complex reasoning. Implement this with
RunnableBranchrouting. - Prompt compression — use
LLMLingua(vialangchain-community) to compress retrieved context by removing low-information tokens before injecting into the prompt. - Token counting before calling — use
llm.get_num_tokens(text)to check prompt size; truncate or summarise if it exceeds a budget:
llm = ChatOpenAI()
token_count = llm.get_num_tokens(prompt_text)
if token_count > 3000:
# Summarise or truncate before proceeding
...
- Streaming — stream responses to clients early; use
max_tokensto cap output length for use cases where truncation is acceptable. - Batch processing — use
.batch()with appropriate concurrency for offline workloads to maximise throughput per dollar. - Avoid over-engineering with agents — a simple RAG chain is 10-50x cheaper per query than a multi-step agent. Only use agents when the task genuinely requires dynamic decision-making.
LangChain applications interact with LLMs, external tools, and user-supplied data, creating several attack surfaces that require explicit mitigation:
- Prompt injection prevention — the most critical LLM-specific risk. Malicious users craft inputs that override system instructions (e.g. 'Ignore all previous instructions and...'). Mitigate with input sanitisation, structural separation of user input from system context, and output validation that rejects responses that claim to override system behaviour.
- Secrets management — never hardcode API keys in source code or commit them to version control. Use environment variables,
.envfiles (excluded from git), or a secrets manager (AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager). - Tool permission minimisation — agents with write access to databases, file systems, or APIs can cause significant damage if manipulated via prompt injection. Grant tools the minimum permissions required: read-only where possible, scoped API tokens.
- Human-in-the-loop for irreversible actions — use LangGraph's
interrupt_beforeto pause before any tool that modifies data, deletes files, or sends emails, requiring human approval. - Output filtering — validate and filter LLM outputs for PII, harmful content, or off-topic responses before returning to users. Libraries like
Guardrails AIorNeMo Guardrailsintegrate with LangChain. - Rate limiting on LangServe endpoints — prevent abuse and runaway costs from unauthenticated requests using API gateway rate limiting or middleware.
Testing LangChain applications requires strategies for both unit testing individual components without real LLM calls, and end-to-end evaluation of response quality.
Unit testing with fake LLMs — use FakeListLLM or FakeListChatModel to return predetermined responses so tests run fast and deterministically without API calls:
from langchain_community.llms.fake import FakeListLLM
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
fake_llm = FakeListLLM(responses=["Paris", "Berlin", "Tokyo"])
chain = ChatPromptTemplate.from_template("{q}") | fake_llm | StrOutputParser()
def test_capital_chain():
result = chain.invoke({"q": "Capital of France?"})
assert result == "Paris"
LangSmith evaluations — create a dataset of input/expected-output pairs in LangSmith and run evaluations using built-in evaluators (qa, criteria, labeled_score_string) or custom LLM-as-judge evaluators:
from langsmith.evaluation import evaluate
results = evaluate(
my_chain.invoke,
data="my-golden-dataset",
evaluators=["qa"],
experiment_prefix="rag-v2-test",
)
For integration tests, use pytest with responses or httpx mocks to simulate LLM API responses. Always test that your chain handles empty outputs, malformed JSON from the LLM, and retriever returning zero documents.
Monitoring LangChain applications in production means tracking latency, error rates, token usage, and response quality over time. LangSmith is the primary tool, but you can also integrate with standard observability infrastructure.
LangSmith tracing — enabled with two env vars, it captures every run automatically with full context:
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=ls__...
export LANGCHAIN_PROJECT=production-chat-v2
In LangSmith you get: latency distribution per chain step, error rate trends, token cost per request, feedback scores from users or evaluators, and the ability to filter/search runs by any metadata tag you add.
Custom metadata tagging — tag runs with user ID, feature flag, model version, etc. to enable filtering in LangSmith dashboards:
chain.invoke(
{"input": user_query},
config={
"metadata": {"user_id": user_id, "ab_group": "control"},
"tags": ["production", "rag-v2"],
}
)
Custom callbacks for metrics — implement a callback handler that pushes latency, token counts, and error flags to your existing metrics backend (Prometheus, Datadog, CloudWatch) on each LLM call end:
class MetricsCallback(BaseCallbackHandler):
def on_llm_end(self, response, **kwargs):
tokens = response.llm_output.get("token_usage", {})
prometheus_counter.inc(tokens.get("total_tokens", 0))
Developers new to LangChain and LangGraph frequently encounter the same set of issues. Knowing them in advance saves significant debugging time:
- Context window overflow — injecting the full conversation history into every prompt causes failures on long conversations. Fix: use
ConversationBufferWindowMemory, summarisation memory, or LangGraph's message trimming. - Agent infinite loops — an agent can keep calling tools indefinitely if it never reaches a satisfying answer. Fix: always set
max_iterationsin AgentExecutor or add a loop count check in LangGraph conditional edges. - Prompt injection from user inputs — if raw user text is inserted into system-level prompts, attackers can override your instructions. Fix: sanitise inputs, use structured message roles, never directly concatenate user text into the system message.
- Over-engineering with agents — using a 5-step agent for a task that a single RAG call handles. Agents are slower, more expensive, and less predictable. Fix: start with the simplest approach and only add agent complexity when necessary.
- Ignoring async in high-concurrency servers — using
invoke()instead ofainvoke()in FastAPI handlers blocks the event loop and degrades performance under load. - Hallucinated tool calls — ReAct agents can sometimes hallucinate tool calls or their inputs. Fix: use structured output (OpenAI Tools Agent) instead of text-parsed ReAct, and add input validation to tool functions.
- Pinning package versions — LangChain releases frequently; unpinned dependencies in production cause unexpected breaking changes. Always use a lockfile.
