RAGAS: The Standard Framework for Evaluating RAG Systems

RAGAS gives you four metrics that cover every major failure mode in a retrieval-augmented generation pipeline. Here is what each metric measures and how to act on low scores.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

8 min read

// tags

#rag-evaluation#ragas#llm-evaluation#retrieval-augmented-generation

FIG. ART-26

8 min read

“

RAGAS: The Standard Framework for Evaluating RAG Systems

// reading plan

sections

1,074

words

min read

// AI Scoring & Evals

How to Evaluate LLMs: Benchmarks, Vibes, and Building Your Own Evals

Benchmarks are gamed and vibes do not scale. Here is how to build real evaluations that tell you whether an LLM actually works for your specific use case.

13 min read

// AI Scoring & Evals

MMLU, HumanEval, and Chatbot Arena Explained: What AI Benchmarks Actually Measure

RAGAS (Retrieval Augmented Generation Assessment) is the most widely adopted framework for evaluating RAG pipelines, and for good reason. It covers the four ways a RAG system can fail: generating answers not grounded in retrieved documents, answering the wrong question, retrieving irrelevant chunks, and failing to retrieve relevant chunks. Low scores on any of these four metrics point directly to the part of your pipeline that needs work.

Why RAG Evaluation Is Different

Evaluating a RAG system is more complex than evaluating a plain LLM call because there are two separate components that can fail: the retrieval step and the generation step. A generation model that works well with perfect context can give wrong answers if retrieval is poor. A retrieval system that finds the right documents cannot help if the generation model ignores them or hallucinates anyway.

RAGAS was introduced by Es et al. (2023) specifically to address this: it defines metrics that measure each component independently so you know where to focus your engineering effort.

The Four Core RAGAS Metrics

1. Faithfulness

Faithfulness measures whether the generated answer is grounded in the retrieved context. A faithful answer only makes claims that are directly supported by the retrieved documents. An unfaithful answer introduces information not present in the retrieved context - this is the hallucination signal in a RAG pipeline.

Score range: 0 to 1. A score of 1 means every claim in the answer can be traced to the retrieved documents. A score of 0.6 means 40% of the claims are not grounded.

What a low faithfulness score tells you: Your generation model is ignoring the retrieved context and relying on its parametric knowledge (what it learned during training). Fix: strengthen your system prompt to explicitly instruct the model to use only the provided context, or add a verification step that checks claims against retrieved documents.

2. Answer Relevance

Answer relevance measures whether the generated answer actually addresses the user's question. A highly relevant answer directly responds to what was asked. A low-relevance answer might be factually accurate but not address the question, or might drift into tangentially related territory.

What a low answer relevance score tells you: Your generation model is answering a different question than the one asked, or is being verbose in ways that dilute the relevant content. Fix: refine your generation prompt to be more directive about format and focus, or add post-processing to trim off-topic content.

3. Context Precision

Context precision measures what fraction of the retrieved chunks are actually relevant to answering the question. If you retrieve 10 chunks and only 2 of them contain information relevant to the query, your context precision is 0.2.

What a low context precision score tells you: Your retrieval system is returning too much irrelevant content. This is costly because it wastes context window space (and therefore money) and can confuse the generation model. Fix: tune your similarity threshold (retrieve fewer chunks with higher confidence), improve your chunking strategy, or add a reranking step after initial retrieval.

4. Context Recall

Context recall measures whether all the relevant information needed to answer the question was actually retrieved. High recall means nothing important was missed. Low recall means relevant source documents exist in your knowledge base but your retrieval system failed to surface them.

What a low context recall score tells you: Your retrieval system is missing relevant documents. Fix: check your embedding model (a domain-specific embedding model often dramatically improves recall for technical content), adjust chunk size (too small or too large both hurt recall), or try hybrid retrieval (BM25 keyword search plus dense vector search).

RAGAS: The Standard Framework for Evaluating RAG Systems

Related Articles

How to Evaluate LLMs: Benchmarks, Vibes, and Building Your Own Evals

Why RAG Evaluation Is Different

The Four Core RAGAS Metrics

1. Faithfulness

2. Answer Relevance

3. Context Precision

4. Context Recall

Setting Up RAGAS in Python

What Good Scores Look Like

RAGAS in CI/CD

Beyond the Four Core Metrics

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

MMLU, HumanEval, and Chatbot Arena Explained: What AI Benchmarks Actually Measure

LM-as-Judge: Using LLMs to Evaluate LLM Outputs

RAGAS: The Standard Framework for Evaluating RAG Systems

Related Articles

How to Evaluate LLMs: Benchmarks, Vibes, and Building Your Own Evals

Why RAG Evaluation Is Different

The Four Core RAGAS Metrics

1. Faithfulness

2. Answer Relevance

3. Context Precision

4. Context Recall

Setting Up RAGAS in Python

What Good Scores Look Like

RAGAS in CI/CD

Beyond the Four Core Metrics

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

MMLU, HumanEval, and Chatbot Arena Explained: What AI Benchmarks Actually Measure

LM-as-Judge: Using LLMs to Evaluate LLM Outputs

The workspace your team
actually needs