The RAG Evaluation Problem
How do you know your RAG pipeline is actually good? Human evaluation is expensive and slow. Simple string matching misses semantically correct answers. RAGAS (Retrieval Augmented Generation Assessment) is the answer: a framework of principled metrics that use an LLM as judge to evaluate RAG outputs against ground-truth datasets, described in the original paper.
The Four Core Metrics
1. Faithfulness (0–1)
Measures whether the generated answer is factually consistent with the retrieved context. The LLM decomposes the answer into atomic claims, then checks each claim against the context. A claim invented outside the context reduces the score.
Formula: faithfulness = verified_claims / total_claims
2. Answer Relevancy (0–1)
Measures whether the answer actually addresses the question. The LLM generates N hypothetical questions from the answer, then scores the cosine similarity between those questions and the original. Irrelevant padding lowers the score.
3. Context Precision (0–1)
Measures whether the retrieved chunks that are actually useful are ranked first. Higher-ranked useful chunks produce a higher score. Evaluates retriever quality.
4. Context Recall (0–1)
Measures whether the retrieved context contains all the information needed to answer the question. Requires a ground-truth answer. Checks what fraction of ground-truth claims can be found in the retrieved context.
Installation and Usage
pip install ragas datasets langchain-openai
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
# Build a dataset of Q&A pairs with retrieved contexts
data = {
"question": ["What is PagedAttention?", "What is HNSW?"],
"answer": [
"PagedAttention manages KV cache as virtual memory pages.",
"HNSW is a graph-based approximate nearest neighbor algorithm.",
],
"contexts": [
["PagedAttention divides the KV cache into fixed-size pages stored non-contiguously..."],
["Hierarchical Navigable Small World (HNSW) builds a multi-layer graph for ANN search..."],
],
"ground_truth": [
"PagedAttention is a memory management technique for LLM KV caches.",
"HNSW is an efficient algorithm for approximate nearest neighbor search.",
],
}
dataset = Dataset.from_dict(data)
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(result)
# {'faithfulness': 0.94, 'answer_relevancy': 0.88, 'context_precision': 0.91, 'context_recall': 0.87}
LangChain Integration
If you are using LangChain for your RAG pipeline, RAGAS can pull traces directly from LangSmith:
from ragas.integrations.langchain import EvaluatorChain
from ragas.metrics import faithfulness
evaluator = EvaluatorChain(metric=faithfulness)
result = evaluator({"question": "...", "answer": "...", "contexts": ["..."]})
CI Pipeline Integration
Add RAGAS scores as quality gates in your CI pipeline:
ragas test --dataset eval_dataset.json --threshold faithfulness=0.85
If faithfulness drops below 0.85 after a prompt change, the pipeline fails and the change is blocked. This is the foundation of continuous RAG quality assurance.
What RAGAS Does Not Cover
RAGAS does not evaluate latency, cost, or retriever speed. For production, combine it with Langfuse (latency/cost tracing) and RAGAS (quality metrics) for a complete picture.