AI / LangGraph LangChain Interview questions II

1. What are the different memory types in LangChain? 2. How do you implement conversation memory? 3. How do vector stores work in LangChain? 4. How do you build RAG pipelines with LangChain? 5. What are document loaders and splitters? 6. How do retrievers work in LangChain? 7. What is multi-query retrieval? 8. What are parent document retrieval patterns? 9. What are production deployment patterns for LangChain? 10. How do you implement caching in LangChain? 11. What are cost optimization techniques for LangChain? 12. What security best practices should you follow for LangChain applications? 13. How do you test LangChain applications? 14. How do you monitor LangChain applications? 15. What are common pitfalls in LangChain/LangGraph development?

Could not find what you were looking for? send us the question and we would be happy to answer your question.

1. What are the different memory types in LangChain?

LangChain provides several memory classes that differ in how they store and compress conversation history. Choosing the right one involves balancing context quality, token cost, and retrieval precision.

Memory Type	How It Works	Best For
ConversationBufferMemory	Stores every message verbatim	Short conversations where full context matters
ConversationBufferWindowMemory	Keeps only the last k messages	Long conversations; avoids context overflow
ConversationSummaryMemory	Uses an LLM to summarise older messages	Very long sessions; quality over token savings
ConversationSummaryBufferMemory	Summarises messages beyond a token limit; keeps recent messages verbatim	Balance between detail and cost
VectorStoreRetrieverMemory	Stores messages as embeddings; retrieves semantically relevant past context	Long-running assistants that need to recall specific facts
ConversationEntityMemory	Extracts and tracks named entities (people, places, concepts) from conversation	Personal assistants that must remember facts about people/topics

In LCEL-based applications, the memory pattern has shifted from these classes towards explicitly managing a messages list in chain state (with RunnableWithMessageHistory for automatic persistence per session ID), or using LangGraph's checkpointing for full state persistence.

Take quiz

Which memory type keeps only the last k conversation messages to avoid context overflow? ConversationBufferMemory

✗ Try again.

ConversationSummaryMemory

✗ Try again.

ConversationBufferWindowMemory

✓ Correct! Well done.

VectorStoreRetrieverMemory

✗ Try again.

When would VectorStoreRetrieverMemory be more useful than ConversationBufferMemory? For very short conversations with fewer than 5 turns

✗ Try again.

For long-running assistants that need to recall specific facts from distant past messages

✓ Correct! Well done.

When you want to reduce LLM API costs per message

✗ Try again.

When the conversation involves code generation

✗ Try again.

2. How do you implement conversation memory?

The recommended modern approach for conversation memory uses RunnableWithMessageHistory which wraps an LCEL chain and automatically loads and saves message history per session ID from a configurable store — without any manual history tracking in application code.

from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

# In-memory store (swap for Redis, DynamoDB etc. in production)
store = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    MessagesPlaceholder("history"),
    ("human", "{input}"),
])

chain = prompt | ChatOpenAI()

chain_with_memory = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history",
)

# session_id identifies the conversation thread
chain_with_memory.invoke(
    {"input": "My name is Alice."},
    config={"configurable": {"session_id": "alice-123"}},
)
chain_with_memory.invoke(
    {"input": "What is my name?"},
    config={"configurable": {"session_id": "alice-123"}},
)

Take quiz

What does the get_session_history function provide to RunnableWithMessageHistory? The LLM model to use for the session

✗ Try again.

A ChatMessageHistory object keyed by session ID to load and save conversation history

✓ Correct! Well done.

A rate limiter to prevent too many messages per session

✗ Try again.

The system prompt to use for the session

✗ Try again.

In a ChatPromptTemplate with conversation memory, what is the purpose of MessagesPlaceholder? It adds a mandatory user turn to every conversation

✗ Try again.

It holds the slot where the loaded chat history is injected into the prompt

✓ Correct! Well done.

It formats the model's response as structured messages

✗ Try again.

It limits the number of messages sent to the model

✗ Try again.

3. How do vector stores work in LangChain?

A vector store in LangChain stores text (documents, chunks) as high-dimensional embedding vectors so you can perform semantic similarity search — finding documents whose meaning is close to a query, even if the exact words don't match. Every vector store integrates an embedding model and a storage backend.

The standard workflow:

Embed documents with an embedding model (OpenAIEmbeddings, HuggingFaceEmbeddings, etc.)
Store the vectors in a vector database (FAISS, Chroma, Pinecone, Weaviate, PGVector)
At query time, embed the query and retrieve the k nearest vectors

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# Create store from documents
vectorstore = FAISS.from_documents(docs, embeddings)

# Similarity search
results = vectorstore.similarity_search("How does LangChain work?", k=4)

# Use as a retriever in a chain
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 5})

Search types: similarity returns the k most similar documents; mmr (Maximal Marginal Relevance) balances similarity with diversity to avoid returning near-duplicate chunks. Most production vector stores (Pinecone, Weaviate, Qdrant) support metadata filtering so you can scope searches to a subset of documents.

Take quiz

What does MMR (Maximal Marginal Relevance) retrieval prioritise over simple similarity search? Retrieval speed over result quality

✗ Try again.

Diversity of results in addition to similarity, reducing near-duplicate chunks

✓ Correct! Well done.

Exact keyword matching over semantic similarity

✗ Try again.

Cost reduction by fetching fewer embeddings

✗ Try again.

What does vectorstore.as_retriever() return? A list of documents from the vector store

✗ Try again.

A Retriever Runnable that can be used directly in an LCEL chain

✓ Correct! Well done.

An embedding model configured for the vector store

✗ Try again.

A cursor object for iterating through all stored documents

✗ Try again.

4. How do you build RAG pipelines with LangChain?

A RAG (Retrieval-Augmented Generation) pipeline enriches LLM responses with external knowledge by retrieving relevant documents at query time and injecting them into the prompt. A complete LangChain RAG pipeline has five stages:

Load — ingest source documents with a DocumentLoader
Split — chunk documents with a TextSplitter for efficient retrieval
Embed & Store — embed chunks and store in a vector store
Retrieve — at query time, fetch the most relevant chunks
Generate — inject retrieved context into the prompt and generate an answer

from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# 1. Load
loader = WebBaseLoader("https://python.langchain.com/docs/get_started/introduction")
docs = loader.load()

# 2. Split
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)

# 3. Embed & Store
vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings())

# 4 & 5. Retrieve + Generate
retriever = vectorstore.as_retriever()
rag_prompt = hub.pull("rlm/rag-prompt")

rag_chain = (
    {"context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
     "question": RunnablePassthrough()}
    | rag_prompt
    | ChatOpenAI()
    | StrOutputParser()
)

answer = rag_chain.invoke("What is LangChain?")

Take quiz

What is the purpose of chunk_overlap in RecursiveCharacterTextSplitter? To duplicate documents for improved retrieval recall

✗ Try again.

To ensure context is not lost at chunk boundaries by repeating a few hundred characters

✓ Correct! Well done.

To limit the maximum chunk size to fit the LLM context window

✗ Try again.

To reduce the number of embeddings computed

✗ Try again.

In a RAG LCEL chain, why is RunnablePassthrough used for the 'question' key? To preprocess the question before retrieval

✗ Try again.

To forward the original user question unchanged to the prompt while the retriever fetches documents

✓ Correct! Well done.

To cache the question for subsequent requests

✗ Try again.

To validate that the question is not empty

✗ Try again.

5. What are document loaders and splitters?

Document loaders ingest content from various sources and return a list of Document objects (each containing page_content and metadata). Text splitters then divide those documents into smaller chunks suitable for embedding and retrieval.

Common document loaders:

PyPDFLoader — extracts text from PDF files, one page per Document
WebBaseLoader — scrapes a web page, returns its text content
CSVLoader — each row becomes a Document
DirectoryLoader — recursively loads all files in a directory
UnstructuredFileLoader — handles Word, PowerPoint, HTML, email, and more
GitHubLoader — loads files from a GitHub repository

Common text splitters:

RecursiveCharacterTextSplitter — splits on paragraphs, then sentences, then words until chunks fit the target size. Most commonly used.
CharacterTextSplitter — splits on a single character separator
TokenTextSplitter — splits by token count, precise for context window budgeting
MarkdownHeaderTextSplitter — splits Markdown by header sections, preserving structure

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,    # max chars per chunk
    chunk_overlap=200,  # overlap to preserve context at boundaries
    length_function=len,
)
chunks = splitter.split_documents(documents)

Take quiz

Why is RecursiveCharacterTextSplitter preferred over CharacterTextSplitter? It is faster at processing large documents

✗ Try again.

It tries multiple separators (paragraphs, sentences, words) to find natural split points

✓ Correct! Well done.

It counts tokens instead of characters for more accurate splitting

✗ Try again.

It preserves Markdown headers in the split output

✗ Try again.

What does the chunk_overlap parameter in a text splitter do? Limits how many chunks can be stored in the vector store

✗ Try again.

Repeats a portion of the previous chunk at the start of the next to preserve boundary context

✓ Correct! Well done.

Sets the number of overlapping words between search results

✗ Try again.

Controls the maximum number of documents loaded simultaneously

✗ Try again.

6. How do retrievers work in LangChain?

A Retriever in LangChain is a Runnable that takes a string query and returns a list of Document objects. It is the standard abstraction that decouples the RAG chain from the specific search mechanism — you can swap a vector store retriever for a keyword search retriever or a hybrid retriever without changing the chain.

Types of retrievers available in LangChain:

VectorStoreRetriever — most common; wraps a vector store and performs similarity (or MMR) search. Created via vectorstore.as_retriever()
MultiQueryRetriever — uses an LLM to generate multiple query variants, retrieves for each, deduplicates results
ContextualCompressionRetriever — post-processes retrieved documents to extract only the relevant sentences, reducing noise injected into the prompt
SelfQueryRetriever — parses natural language queries to extract both a semantic search string and metadata filters (e.g. 'articles from 2024 about Python')
ParentDocumentRetriever — retrieves small chunks for precision but returns their larger parent documents for fuller context
EnsembleRetriever — combines results from multiple retrievers (e.g. BM25 keyword + vector) using reciprocal rank fusion

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

bm25 = BM25Retriever.from_documents(docs, k=4)
vector = vectorstore.as_retriever(search_kwargs={"k": 4})

hybrid = EnsembleRetriever(retrievers=[bm25, vector], weights=[0.5, 0.5])
results = hybrid.invoke("How does LangChain memory work?")

Take quiz

What does ContextualCompressionRetriever do to retrieved documents? Compresses the documents to reduce storage size

✗ Try again.

Extracts only the most relevant sentences from retrieved chunks to reduce prompt noise

✓ Correct! Well done.

Combines multiple chunks into a single large document

✗ Try again.

Re-ranks documents using a cross-encoder model

✗ Try again.

What does EnsembleRetriever combine, and why is this useful? Multiple LLM models, for better response quality

✗ Try again.

Multiple retrieval strategies (e.g. keyword BM25 + vector) to improve recall over either alone

✓ Correct! Well done.

Multiple vector stores with different embedding models

✗ Try again.

Multiple chat histories to produce a consensus answer

✗ Try again.

7. What is multi-query retrieval?

Multi-query retrieval addresses a key weakness of single-vector search: a user's question may be phrased in a way that doesn't closely match how the relevant information is worded in the document store. MultiQueryRetriever solves this by using an LLM to automatically generate several alternative phrasings of the query, running each against the vector store, and deduplicating the union of all results.

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0)

retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=llm,
)

# For a query like "What is LangChain memory?"
# The LLM might generate:
#   1. "How does LangChain handle conversation state?"
#   2. "What memory classes are available in LangChain?"
#   3. "How do you persist context between LangChain calls?"
# Then retrieves for all 3 and deduplicates
results = retriever.invoke("What is LangChain memory?")

Multi-query retrieval improves recall — you're less likely to miss relevant documents due to vocabulary mismatch — but it increases latency and cost since it makes multiple LLM calls (for query generation) and multiple vector search calls per user query. It works best for knowledge bases with varied terminology or when users ask high-level questions that could be answered by multiple document sections.

Take quiz

What problem does MultiQueryRetriever primarily solve? Slow vector search in large databases

✗ Try again.

Vocabulary mismatch between user query phrasing and document wording

✓ Correct! Well done.

Duplicate documents stored in the vector store

✗ Try again.

LLM hallucinations in the generated answer

✗ Try again.

What is the main trade-off of using MultiQueryRetriever? It requires a larger embedding model

✗ Try again.

It increases latency and cost because it generates multiple LLM calls per query

✓ Correct! Well done.

It only works with OpenAI models

✗ Try again.

It reduces recall compared to single-query retrieval

✗ Try again.

8. What are parent document retrieval patterns?

The parent document retrieval pattern addresses a fundamental tension in RAG systems: small chunks improve retrieval precision (the embedding closely matches the query), but large chunks provide richer context for the LLM to answer from. ParentDocumentRetriever resolves this by indexing small child chunks for search but returning their larger parent documents (or the full original documents) to the LLM.

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Child splitter: small chunks for precise retrieval
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

# Parent splitter: larger chunks returned to LLM
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

vectorstore = Chroma(embedding_function=OpenAIEmbeddings())
docstore = InMemoryStore()  # stores parent documents

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

retriever.add_documents(docs)

# Query retrieves small child chunks but returns their 2000-char parents
results = retriever.invoke("What is LCEL?")

This pattern significantly improves answer quality for knowledge-intensive tasks because the LLM receives enough surrounding context to reason about the answer, while the vector search remains precise.

Take quiz

In ParentDocumentRetriever, what are the child chunks used for? They are the documents returned to the LLM for context

✗ Try again.

They are embedded and stored for precise vector search

✓ Correct! Well done.

They are used to create the parent chunks by merging

✗ Try again.

They store metadata about the original documents

✗ Try again.

What problem does ParentDocumentRetriever solve compared to using uniform-size chunks? It reduces the cost of embedding the document collection

✗ Try again.

It provides precise retrieval via small chunks while giving the LLM larger context via parent chunks

✓ Correct! Well done.

It prevents duplicate documents from being indexed

✗ Try again.

It enables real-time document updates without re-indexing

✗ Try again.

9. What are production deployment patterns for LangChain?

Moving a LangChain application from prototype to production requires addressing reliability, scalability, observability, and cost. The key patterns are:

LangServe + Docker — wrap chains as FastAPI endpoints with add_routes(), containerise with Docker, deploy to a managed container service (AWS ECS, GCP Cloud Run, Kubernetes). Expose via an API gateway with rate limiting.
Async endpoints — use ainvoke() / astream() with FastAPI async routes (async def) to handle concurrent requests without blocking worker threads. Pair with uvicorn --workers N or Gunicorn.
Response caching — use InMemoryCache for same-process caching or SQLiteCache / Redis-backed cache for multi-process. Cache key is the full prompt + model parameters, so identical requests skip the LLM call entirely.
Observability — enable LangSmith tracing with LANGCHAIN_TRACING_V2=true. Set up alerts on p95 latency and error rate. Track token usage per request to control costs.
Resilience — apply .with_retry() for transient API errors and .with_fallbacks([cheaper_model]) for budget management under load.
Secrets management — never hardcode API keys; use environment variables or a secrets manager (AWS Secrets Manager, HashiCorp Vault).

Take quiz

What LangChain library wraps a chain as a FastAPI REST API with /invoke and /stream endpoints? LangSmith

✗ Try again.

LangGraph Cloud

✗ Try again.

LangServe

✓ Correct! Well done.

LangHub

✗ Try again.

Why should LangChain chains use async (ainvoke/astream) in production FastAPI deployments? Async methods are required for chains with more than 3 steps

✗ Try again.

Async allows handling multiple concurrent requests without blocking worker threads

✓ Correct! Well done.

Async methods automatically retry failed requests

✗ Try again.

Async is needed to enable LangSmith tracing

✗ Try again.

10. How do you implement caching in LangChain?

LangChain supports LLM response caching at the global level, so any chain that calls an LLM automatically benefits from cache hits without modifying individual chains. The cache key is the serialised prompt plus model parameters — if the same prompt is sent twice, the second call returns the cached response without hitting the API.

In-memory cache — fastest, lost on process restart, suitable for development and single-request deduplication:

from langchain.globals import set_llm_cache
from langchain_community.cache import InMemoryCache

set_llm_cache(InMemoryCache())

SQLite cache — persists across restarts, suitable for single-process production servers or CLIs:

from langchain_community.cache import SQLiteCache
set_llm_cache(SQLiteCache(database_path=".langchain_cache.db"))

Semantic cache — uses embedding similarity to serve cached responses for queries that are semantically equivalent but not character-identical:

from langchain_community.cache import GPTCache
# Or use RedisSemanticCache:
from langchain_community.cache import RedisSemanticCache
from langchain_openai import OpenAIEmbeddings

set_llm_cache(RedisSemanticCache(
    redis_url="redis://localhost:6379",
    embedding=OpenAIEmbeddings(),
    score_threshold=0.1,
))

Caching is most effective for knowledge base Q&A where many users ask similar questions, and for evaluation pipelines where the same prompts are run repeatedly.

Take quiz

What is the cache key used to identify a cached LLM response in LangChain? The user's session ID

✗ Try again.

The serialised prompt plus the model parameters

✓ Correct! Well done.

A hash of the LLM response content

✗ Try again.

The model name and temperature setting only

✗ Try again.

What makes RedisSemanticCache different from SQLiteCache? RedisSemanticCache stores cache entries in JSON format

✗ Try again.

RedisSemanticCache serves hits for semantically similar (not just identical) prompts

✓ Correct! Well done.

RedisSemanticCache only caches responses under 100 tokens

✗ Try again.

RedisSemanticCache requires no configuration beyond the Redis URL

✗ Try again.

11. What are cost optimization techniques for LangChain?

LLM API costs are primarily driven by token usage. LangChain applications can apply several techniques at different layers to reduce costs without significantly degrading quality:

Response caching — the single highest-impact technique for repetitive queries. InMemoryCache or RedisSemanticCache returns stored responses for identical or semantically similar prompts, paying zero tokens for cache hits.
Model tiering — use cheaper models (GPT-4o-mini, Claude Haiku) for simple classification, routing, and extraction tasks; reserve expensive models (GPT-4o, Claude Sonnet) for complex reasoning. Implement this with RunnableBranch routing.
Prompt compression — use LLMLingua (via langchain-community) to compress retrieved context by removing low-information tokens before injecting into the prompt.
Token counting before calling — use llm.get_num_tokens(text) to check prompt size; truncate or summarise if it exceeds a budget:

llm = ChatOpenAI()
token_count = llm.get_num_tokens(prompt_text)
if token_count > 3000:
    # Summarise or truncate before proceeding
    ...

Streaming — stream responses to clients early; use max_tokens to cap output length for use cases where truncation is acceptable.
Batch processing — use .batch() with appropriate concurrency for offline workloads to maximise throughput per dollar.
Avoid over-engineering with agents — a simple RAG chain is 10-50x cheaper per query than a multi-step agent. Only use agents when the task genuinely requires dynamic decision-making.

Take quiz

Which cost optimisation technique has the highest impact for applications with repetitive queries? Using smaller chunk sizes in the text splitter

✗ Try again.

Response caching (InMemoryCache or RedisSemanticCache)

✓ Correct! Well done.

Reducing the number of retrieved documents to k=1

✗ Try again.

Disabling LangSmith tracing in production

✗ Try again.

What does model tiering mean in the context of LangChain cost optimisation? Running multiple models in parallel and selecting the best response

✗ Try again.

Routing simple tasks to cheaper models and complex reasoning to expensive models

✓ Correct! Well done.

Downgrading the model version after a cost threshold is reached

✗ Try again.

Using different embedding models for different document types

✗ Try again.

12. What security best practices should you follow for LangChain applications?

LangChain applications interact with LLMs, external tools, and user-supplied data, creating several attack surfaces that require explicit mitigation:

Prompt injection prevention — the most critical LLM-specific risk. Malicious users craft inputs that override system instructions (e.g. 'Ignore all previous instructions and...'). Mitigate with input sanitisation, structural separation of user input from system context, and output validation that rejects responses that claim to override system behaviour.
Secrets management — never hardcode API keys in source code or commit them to version control. Use environment variables, .env files (excluded from git), or a secrets manager (AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager).
Tool permission minimisation — agents with write access to databases, file systems, or APIs can cause significant damage if manipulated via prompt injection. Grant tools the minimum permissions required: read-only where possible, scoped API tokens.
Human-in-the-loop for irreversible actions — use LangGraph's interrupt_before to pause before any tool that modifies data, deletes files, or sends emails, requiring human approval.
Output filtering — validate and filter LLM outputs for PII, harmful content, or off-topic responses before returning to users. Libraries like Guardrails AI or NeMo Guardrails integrate with LangChain.
Rate limiting on LangServe endpoints — prevent abuse and runaway costs from unauthenticated requests using API gateway rate limiting or middleware.

Take quiz

What is prompt injection in the context of LangChain applications? Injecting extra tokens to improve LLM response quality

✗ Try again.

A user crafting input that overrides the system prompt instructions

✓ Correct! Well done.

Injecting cached responses into the LLM context

✗ Try again.

Adding few-shot examples dynamically to the prompt

✗ Try again.

What is the recommended approach for agents that perform irreversible actions (delete, send email)? Set max_iterations=1 to limit the agent to one action

✗ Try again.

Use LangGraph interrupt_before to pause for human approval before the action executes

✓ Correct! Well done.

Only allow read-only tool access in all agents

✗ Try again.

Log all agent actions to LangSmith for post-hoc review

✗ Try again.

13. How do you test LangChain applications?

Testing LangChain applications requires strategies for both unit testing individual components without real LLM calls, and end-to-end evaluation of response quality.

Unit testing with fake LLMs — use FakeListLLM or FakeListChatModel to return predetermined responses so tests run fast and deterministically without API calls:

from langchain_community.llms.fake import FakeListLLM
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

fake_llm = FakeListLLM(responses=["Paris", "Berlin", "Tokyo"])
chain = ChatPromptTemplate.from_template("{q}") | fake_llm | StrOutputParser()

def test_capital_chain():
    result = chain.invoke({"q": "Capital of France?"})
    assert result == "Paris"

LangSmith evaluations — create a dataset of input/expected-output pairs in LangSmith and run evaluations using built-in evaluators (qa, criteria, labeled_score_string) or custom LLM-as-judge evaluators:

from langsmith.evaluation import evaluate
results = evaluate(
    my_chain.invoke,
    data="my-golden-dataset",
    evaluators=["qa"],
    experiment_prefix="rag-v2-test",
)

For integration tests, use pytest with responses or httpx mocks to simulate LLM API responses. Always test that your chain handles empty outputs, malformed JSON from the LLM, and retriever returning zero documents.

Take quiz

What does FakeListLLM do in a LangChain unit test? It makes real LLM calls but reduces token usage

✗ Try again.

It returns a predetermined list of responses without making API calls

✓ Correct! Well done.

It generates random responses for fuzz testing

✗ Try again.

It records and replays real LLM responses

✗ Try again.

What is an LLM-as-judge evaluator in LangSmith? A human reviewer who scores LLM outputs manually

✗ Try again.

A second LLM called to assess the quality of the first LLM's response against criteria

✓ Correct! Well done.

An automated spell checker for LLM outputs

✗ Try again.

A metric that counts how many tokens the answer uses

✗ Try again.

14. How do you monitor LangChain applications?

Monitoring LangChain applications in production means tracking latency, error rates, token usage, and response quality over time. LangSmith is the primary tool, but you can also integrate with standard observability infrastructure.

LangSmith tracing — enabled with two env vars, it captures every run automatically with full context:

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=ls__...
export LANGCHAIN_PROJECT=production-chat-v2

In LangSmith you get: latency distribution per chain step, error rate trends, token cost per request, feedback scores from users or evaluators, and the ability to filter/search runs by any metadata tag you add.

Custom metadata tagging — tag runs with user ID, feature flag, model version, etc. to enable filtering in LangSmith dashboards:

chain.invoke(
    {"input": user_query},
    config={
        "metadata": {"user_id": user_id, "ab_group": "control"},
        "tags": ["production", "rag-v2"],
    }
)

Custom callbacks for metrics — implement a callback handler that pushes latency, token counts, and error flags to your existing metrics backend (Prometheus, Datadog, CloudWatch) on each LLM call end:

class MetricsCallback(BaseCallbackHandler):
    def on_llm_end(self, response, **kwargs):
        tokens = response.llm_output.get("token_usage", {})
        prometheus_counter.inc(tokens.get("total_tokens", 0))

Take quiz

What two environment variables are required to enable LangSmith tracing? LANGCHAIN_DEBUG=true and LANGSMITH_URL

✗ Try again.

LANGCHAIN_TRACING_V2=true and LANGCHAIN_API_KEY

✓ Correct! Well done.

LANGCHAIN_LOG_LEVEL=trace and LANGCHAIN_PROJECT

✗ Try again.

OPENAI_TRACE=true and LANGCHAIN_PROJECT

✗ Try again.

What is the purpose of adding metadata tags to a chain invocation config in production? To change the LLM model used for that specific request

✗ Try again.

To enable filtering and grouping of runs in LangSmith dashboards by user, version, or experiment

✓ Correct! Well done.

To increase the priority of the request in LangSmith

✗ Try again.

To prevent the run from being traced in LangSmith

✗ Try again.

15. What are common pitfalls in LangChain/LangGraph development?

Developers new to LangChain and LangGraph frequently encounter the same set of issues. Knowing them in advance saves significant debugging time:

Context window overflow — injecting the full conversation history into every prompt causes failures on long conversations. Fix: use ConversationBufferWindowMemory, summarisation memory, or LangGraph's message trimming.
Agent infinite loops — an agent can keep calling tools indefinitely if it never reaches a satisfying answer. Fix: always set max_iterations in AgentExecutor or add a loop count check in LangGraph conditional edges.
Prompt injection from user inputs — if raw user text is inserted into system-level prompts, attackers can override your instructions. Fix: sanitise inputs, use structured message roles, never directly concatenate user text into the system message.
Over-engineering with agents — using a 5-step agent for a task that a single RAG call handles. Agents are slower, more expensive, and less predictable. Fix: start with the simplest approach and only add agent complexity when necessary.
Ignoring async in high-concurrency servers — using invoke() instead of ainvoke() in FastAPI handlers blocks the event loop and degrades performance under load.
Hallucinated tool calls — ReAct agents can sometimes hallucinate tool calls or their inputs. Fix: use structured output (OpenAI Tools Agent) instead of text-parsed ReAct, and add input validation to tool functions.
Pinning package versions — LangChain releases frequently; unpinned dependencies in production cause unexpected breaking changes. Always use a lockfile.

Take quiz

How do you prevent LangChain agents from running indefinitely? Set temperature=0 to make the agent more deterministic

✗ Try again.

Set max_iterations in AgentExecutor or add an iteration check in LangGraph conditional edges

✓ Correct! Well done.

Use a synchronous (non-async) executor

✗ Try again.

Limit the number of tools available to the agent

✗ Try again.

Why is using invoke() instead of ainvoke() a pitfall in FastAPI-based LangChain servers? invoke() does not support streaming output

✗ Try again.

invoke() blocks the async event loop, reducing concurrency and performance under load

✓ Correct! Well done.

invoke() is deprecated in LangChain 0.3+

✗ Try again.

invoke() does not support LangSmith tracing

✗ Try again.

Github copilot interview questions

	Interviews Questions Java Spring Hibernate Maven Testing API BigData Web DataStructures AI Database Integration Cloud Scala Python Tools Golang	About Javapedia.net Javapedia.net is for Java and J2EE developers, technologist and college students who prepare of interview. Also this site includes many practical examples. This site is developed using J2EE technologies by Steve Antony, a senior Developer/lead at one of the logistics based company.
	contact: javatutorials2016[at]gmail[dot]com
Kindly consider donating for maintaining this website. Thanks.
	Copyright © 2026, javapedia.net, all rights reserved. privacy policy.

AI / LangGraph LangChain Interview questions II

1. What are the different memory types in LangChain?

2. How do you implement conversation memory?

3. How do vector stores work in LangChain?

4. How do you build RAG pipelines with LangChain?

5. What are document loaders and splitters?

6. How do retrievers work in LangChain?

7. What is multi-query retrieval?

8. What are parent document retrieval patterns?

9. What are production deployment patterns for LangChain?

10. How do you implement caching in LangChain?

11. What are cost optimization techniques for LangChain?

12. What security best practices should you follow for LangChain applications?

13. How do you test LangChain applications?

14. How do you monitor LangChain applications?

15. What are common pitfalls in LangChain/LangGraph development?

Comments & Discussions

Recently added...