Python / Python Modern Generative AI and Agents Interview Questions
How do you evaluate LLM outputs for quality, factual accuracy, and hallucination?
Traditional NLP metrics like BLEU and ROUGE measure surface-level token overlap but correlate poorly with human quality judgments for open-ended generation. Modern LLM evaluation uses a combination of reference-based metrics, LLM-as-judge evaluation, and task-specific benchmarks.
| Method | What it measures | Limitation |
|---|---|---|
| BLEU / ROUGE | N-gram overlap with reference text | Poor correlation with quality for open-ended generation |
| BERTScore | Semantic similarity using BERT embeddings | Misses factual accuracy |
| LLM-as-judge | GPT-4 / Claude rates responses for quality, accuracy, relevance | Bias toward verbose responses; expensive |
| Faithfulness (RAG) | Is every claim in the answer supported by retrieved context? | Requires context; slow to compute |
| Hallucination detection | NLI model checks if claim entails or contradicts source | NLI models may themselves be wrong |
| Benchmark suites | MMLU, HumanEval, MT-Bench — standardised task batteries | May not reflect domain-specific needs |
# ── RAGAS: RAG evaluation framework
# pip install ragas
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from datasets import Dataset
# Prepare evaluation data
eval_data = {
'question': ['What is RAG?', 'Who created Python?'],
'answer': ['RAG is retrieval augmented generation.',
'Python was created by Guido van Rossum.'],
'contexts': [['RAG combines retrieval with generation...'],
['Guido van Rossum created Python in 1991...']],
'ground_truth': ['RAG stands for Retrieval Augmented Generation.',
'Guido van Rossum invented Python.'],
}
dataset = Dataset.from_dict(eval_data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy])
print(results) # {'faithfulness': 0.95, 'answer_relevancy': 0.91}
# ── LLM-as-judge (simple implementation)
from openai import OpenAI
client = OpenAI()
JUDGE_PROMPT = '''Rate the following answer for factual accuracy on a scale 1-5.
Question: {question}
Answer: {answer}
Return only a JSON: {{"score": <1-5>, "reason": "<brief reason>"}}'''
def llm_judge(question: str, answer: str) -> dict:
import json
resp = client.chat.completions.create(
model='gpt-4o',
messages=[{'role': 'user',
'content': JUDGE_PROMPT.format(question=question, answer=answer)}],
temperature=0,
response_format={'type': 'json_object'},
)
return json.loads(resp.choices[0].message.content)
Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!
Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.
Invest now!!! Get Free equity stock (US, UK only)!
Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.
The Robinhood app makes it easy to trade stocks, crypto and more.
Webull! Receive free stock by signing up using the link: Webull signup.
More Related questions...
