RAGAS (Retrieval Augmented Generation Assessment) is the most widely adopted framework for evaluating RAG pipelines, and for good reason. It covers the four ways a RAG system can fail: generating answers not grounded in retrieved documents, answering the wrong question, retrieving irrelevant chunks, and failing to retrieve relevant chunks. Low scores on any of these four metrics point directly to the part of your pipeline that needs work.
Why RAG Evaluation Is Different
Evaluating a RAG system is more complex than evaluating a plain LLM call because there are two separate components that can fail: the retrieval step and the generation step. A generation model that works well with perfect context can give wrong answers if retrieval is poor. A retrieval system that finds the right documents cannot help if the generation model ignores them or hallucinates anyway.
RAGAS was introduced by Es et al. (2023) specifically to address this: it defines metrics that measure each component independently so you know where to focus your engineering effort.
The Four Core RAGAS Metrics
1. Faithfulness
Faithfulness measures whether the generated answer is grounded in the retrieved context. A faithful answer only makes claims that are directly supported by the retrieved documents. An unfaithful answer introduces information not present in the retrieved context — this is the hallucination signal in a RAG pipeline.
Score range: 0 to 1. A score of 1 means every claim in the answer can be traced to the retrieved documents. A score of 0.6 means 40% of the claims are not grounded.
What a low faithfulness score tells you: Your generation model is ignoring the retrieved context and relying on its parametric knowledge (what it learned during training). Fix: strengthen your system prompt to explicitly instruct the model to use only the provided context, or add a verification step that checks claims against retrieved documents.
2. Answer Relevance
Answer relevance measures whether the generated answer actually addresses the user's question. A highly relevant answer directly responds to what was asked. A low-relevance answer might be factually accurate but not address the question, or might drift into tangentially related territory.
What a low answer relevance score tells you: Your generation model is answering a different question than the one asked, or is being verbose in ways that dilute the relevant content. Fix: refine your generation prompt to be more directive about format and focus, or add post-processing to trim off-topic content.
3. Context Precision
Context precision measures what fraction of the retrieved chunks are actually relevant to answering the question. If you retrieve 10 chunks and only 2 of them contain information relevant to the query, your context precision is 0.2.
What a low context precision score tells you: Your retrieval system is returning too much irrelevant content. This is costly because it wastes context window space (and therefore money) and can confuse the generation model. Fix: tune your similarity threshold (retrieve fewer chunks with higher confidence), improve your chunking strategy, or add a reranking step after initial retrieval.
4. Context Recall
Context recall measures whether all the relevant information needed to answer the question was actually retrieved. High recall means nothing important was missed. Low recall means relevant source documents exist in your knowledge base but your retrieval system failed to surface them.
What a low context recall score tells you: Your retrieval system is missing relevant documents. Fix: check your embedding model (a domain-specific embedding model often dramatically improves recall for technical content), adjust chunk size (too small or too large both hurt recall), or try hybrid retrieval (BM25 keyword search plus dense vector search).
Setting Up RAGAS in Python
RAGAS is a Python library. Install it with:
pip install ragas
A minimal evaluation run:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
# Your evaluation data
data = {
"question": ["What is the return policy?", "How do I cancel my subscription?"],
"answer": ["You can return items within 30 days.", "Go to Settings and click Cancel."],
"contexts": [
["Our return policy allows returns within 30 days of purchase."],
["To cancel, navigate to Account Settings, then Subscription, then Cancel Plan."]
],
"ground_truth": ["Items can be returned within 30 days.", "Cancel via Account Settings > Subscription."]
}
dataset = Dataset.from_dict(data)
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(result)
The ground_truth field is required for context recall but optional for the other three metrics.
What Good Scores Look Like
There are no universal benchmarks for RAGAS scores because they depend heavily on your domain and task. That said, here are practical targets for a production RAG system:
- Faithfulness > 0.85: anything below suggests regular hallucination
- Answer relevance > 0.80: below this means your model is frequently drifting
- Context precision > 0.70: below this suggests retrieval is too noisy
- Context recall > 0.75: below this suggests important documents are being missed
If you are launching a new RAG system, run RAGAS on a 50-100 question evaluation set before going to production. A baseline run gives you something to compare against after you make changes.
RAGAS in CI/CD
For teams shipping RAG applications, running RAGAS in CI is valuable. Add a GitHub Actions step that runs the eval on every pull request that modifies retrieval logic or generation prompts. Block merges if any metric drops more than 0.05 from the baseline.
# .github/workflows/rag-eval.yml
- name: Run RAGAS evaluation
run: python scripts/run_ragas_eval.py
- name: Check score thresholds
run: python scripts/check_ragas_thresholds.py --min-faithfulness 0.85
Beyond the Four Core Metrics
RAGAS also supports additional metrics for specific scenarios:
- Answer correctness — combines faithfulness and semantic similarity to ground truth, useful when exact answers matter
- Aspect critique — uses an LLM judge to evaluate specific aspects like harmlessness or conciseness
- Context entity recall — measures whether specific named entities from the source are present in the answer
Start with the four core metrics. Add the specialized ones once you understand what your specific pipeline is failing on.
Keep Reading
- Building an LLM Eval From Zero — How to structure your evaluation pipeline before adding specialized tools.
- LM-as-Judge: Using LLMs to Evaluate LLM Outputs — The underlying technique that powers RAGAS's LLM-based metrics.
- How to Evaluate LLMs: The Complete Guide — Full framework covering every type of LLM evaluation.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.