RAGAS: The Standard Framework for Evaluating RAG Systems

RAGAS gives you four metrics that cover every major failure mode in a retrieval-augmented generation pipeline. Here is what each metric measures and how to act on low scores.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

8 min read

// tags

#rag-evaluation#ragas#llm-evaluation#retrieval-augmented-generation

FIG. ART-26

8 min read

“

RAGAS: The Standard Framework for Evaluating RAG Systems

// reading plan

sections

1,074

words

min read

// AI Scoring & Evals

LLM Red Teaming: How to Find Failure Modes Before Your Users Do

Red teaming is adversarial testing designed to find safety, reliability, and robustness failures in LLM applications before they reach production. Here is how to run a systematic red team exercise.

9 min read

// AI Scoring & Evals

How to Evaluate LLMs: Benchmarks, Vibes, and Building Your Own Evals

RAGAS (Retrieval Augmented Generation Assessment) is the most widely adopted framework for evaluating RAG pipelines, and for good reason. It covers the four ways a RAG system can fail: generating answers not grounded in retrieved documents, answering the wrong question, retrieving irrelevant chunks, and failing to retrieve relevant chunks. Low scores on any of these four metrics point directly to the part of your pipeline that needs work.

Why RAG Evaluation Is Different

Evaluating a RAG system is more complex than evaluating a plain LLM call because there are two separate components that can fail: the retrieval step and the generation step. A generation model that works well with perfect context can give wrong answers if retrieval is poor. A retrieval system that finds the right documents cannot help if the generation model ignores them or hallucinates anyway.

RAGAS was introduced by Es et al. (2023) specifically to address this: it defines metrics that measure each component independently so you know where to focus your engineering effort.

The Four Core RAGAS Metrics

1. Faithfulness

Faithfulness measures whether the generated answer is grounded in the retrieved context. A faithful answer only makes claims that are directly supported by the retrieved documents. An unfaithful answer introduces information not present in the retrieved context — this is the hallucination signal in a RAG pipeline.

Score range: 0 to 1. A score of 1 means every claim in the answer can be traced to the retrieved documents. A score of 0.6 means 40% of the claims are not grounded.

What a low faithfulness score tells you: Your generation model is ignoring the retrieved context and relying on its parametric knowledge (what it learned during training). Fix: strengthen your system prompt to explicitly instruct the model to use only the provided context, or add a verification step that checks claims against retrieved documents.

2. Answer Relevance

Answer relevance measures whether the generated answer actually addresses the user's question. A highly relevant answer directly responds to what was asked. A low-relevance answer might be factually accurate but not address the question, or might drift into tangentially related territory.

What a low answer relevance score tells you: Your generation model is answering a different question than the one asked, or is being verbose in ways that dilute the relevant content. Fix: refine your generation prompt to be more directive about format and focus, or add post-processing to trim off-topic content.

3. Context Precision

Context precision measures what fraction of the retrieved chunks are actually relevant to answering the question. If you retrieve 10 chunks and only 2 of them contain information relevant to the query, your context precision is 0.2.

What a low context precision score tells you: Your retrieval system is returning too much irrelevant content. This is costly because it wastes context window space (and therefore money) and can confuse the generation model. Fix: tune your similarity threshold (retrieve fewer chunks with higher confidence), improve your chunking strategy, or add a reranking step after initial retrieval.

4. Context Recall

Context recall measures whether all the relevant information needed to answer the question was actually retrieved. High recall means nothing important was missed. Low recall means relevant source documents exist in your knowledge base but your retrieval system failed to surface them.

What a low context recall score tells you: Your retrieval system is missing relevant documents. Fix: check your embedding model (a domain-specific embedding model often dramatically improves recall for technical content), adjust chunk size (too small or too large both hurt recall), or try hybrid retrieval (BM25 keyword search plus dense vector search).

Setting Up RAGAS in Python

RAGAS is a Python library. Install it with:

pip install ragas

A minimal evaluation run:

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

# Your evaluation data
data = {
    "question": ["What is the return policy?", "How do I cancel my subscription?"],
    "answer": ["You can return items within 30 days.", "Go to Settings and click Cancel."],
    "contexts": [
        ["Our return policy allows returns within 30 days of purchase."],
        ["To cancel, navigate to Account Settings, then Subscription, then Cancel Plan."]
    ],
    "ground_truth": ["Items can be returned within 30 days.", "Cancel via Account Settings > Subscription."]
}

dataset = Dataset.from_dict(data)

result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)

print(result)

The ground_truth field is required for context recall but optional for the other three metrics.

What Good Scores Look Like

There are no universal benchmarks for RAGAS scores because they depend heavily on your domain and task. That said, here are practical targets for a production RAG system:

Faithfulness > 0.85: anything below suggests regular hallucination
Answer relevance > 0.80: below this means your model is frequently drifting
Context precision > 0.70: below this suggests retrieval is too noisy
Context recall > 0.75: below this suggests important documents are being missed

If you are launching a new RAG system, run RAGAS on a 50-100 question evaluation set before going to production. A baseline run gives you something to compare against after you make changes.

RAGAS in CI/CD

For teams shipping RAG applications, running RAGAS in CI is valuable. Add a GitHub Actions step that runs the eval on every pull request that modifies retrieval logic or generation prompts. Block merges if any metric drops more than 0.05 from the baseline.

# .github/workflows/rag-eval.yml
- name: Run RAGAS evaluation
  run: python scripts/run_ragas_eval.py
- name: Check score thresholds
  run: python scripts/check_ragas_thresholds.py --min-faithfulness 0.85

Beyond the Four Core Metrics

RAGAS also supports additional metrics for specific scenarios:

Answer correctness — combines faithfulness and semantic similarity to ground truth, useful when exact answers matter
Aspect critique — uses an LLM judge to evaluate specific aspects like harmlessness or conciseness
Context entity recall — measures whether specific named entities from the source are present in the answer

Start with the four core metrics. Add the specialized ones once you understand what your specific pipeline is failing on.

Keep Reading

Building an LLM Eval From Zero — How to structure your evaluation pipeline before adding specialized tools.
LM-as-Judge: Using LLMs to Evaluate LLM Outputs — The underlying technique that powers RAGAS's LLM-based metrics.
How to Evaluate LLMs: The Complete Guide — Full framework covering every type of LLM evaluation.

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

RAGAS: The Standard Framework for Evaluating RAG Systems

Related Articles

LLM Red Teaming: How to Find Failure Modes Before Your Users Do

Why RAG Evaluation Is Different

The Four Core RAGAS Metrics

1. Faithfulness

2. Answer Relevance

3. Context Precision

4. Context Recall

Setting Up RAGAS in Python

What Good Scores Look Like

RAGAS in CI/CD

Beyond the Four Core Metrics

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

How to Evaluate LLMs: Benchmarks, Vibes, and Building Your Own Evals

MMLU, HumanEval, and Chatbot Arena Explained: What AI Benchmarks Actually Measure

RAGAS: The Standard Framework for Evaluating RAG Systems

Related Articles

LLM Red Teaming: How to Find Failure Modes Before Your Users Do

Why RAG Evaluation Is Different

The Four Core RAGAS Metrics

1. Faithfulness

2. Answer Relevance

3. Context Precision

4. Context Recall

Setting Up RAGAS in Python

What Good Scores Look Like

RAGAS in CI/CD

Beyond the Four Core Metrics

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

How to Evaluate LLMs: Benchmarks, Vibes, and Building Your Own Evals

MMLU, HumanEval, and Chatbot Arena Explained: What AI Benchmarks Actually Measure

The workspace your team
actually needs