RAGAS (Retrieval Augmented Generation Assessment) is the most widely adopted framework for evaluating RAG pipelines, and for good reason. It covers the four ways a RAG system can fail: generating answers not grounded in retrieved documents, answering the wrong question, retrieving irrelevant chunks, and failing to retrieve relevant chunks. Low scores on any of these four metrics point directly to the part of your pipeline that needs work.
Why RAG Evaluation Is Different
Evaluating a RAG system is more complex than evaluating a plain LLM call because there are two separate components that can fail: the retrieval step and the generation step. A generation model that works well with perfect context can give wrong answers if retrieval is poor. A retrieval system that finds the right documents cannot help if the generation model ignores them or hallucinates anyway.
RAGAS was introduced by Es et al. (2023) specifically to address this: it defines metrics that measure each component independently so you know where to focus your engineering effort.
The Four Core RAGAS Metrics
1. Faithfulness
Faithfulness measures whether the generated answer is grounded in the retrieved context. A faithful answer only makes claims that are directly supported by the retrieved documents. An unfaithful answer introduces information not present in the retrieved context - this is the hallucination signal in a RAG pipeline.
Score range: 0 to 1. A score of 1 means every claim in the answer can be traced to the retrieved documents. A score of 0.6 means 40% of the claims are not grounded.
What a low faithfulness score tells you: Your generation model is ignoring the retrieved context and relying on its parametric knowledge (what it learned during training). Fix: strengthen your system prompt to explicitly instruct the model to use only the provided context, or add a verification step that checks claims against retrieved documents.
2. Answer Relevance
Answer relevance measures whether the generated answer actually addresses the user's question. A highly relevant answer directly responds to what was asked. A low-relevance answer might be factually accurate but not address the question, or might drift into tangentially related territory.
What a low answer relevance score tells you: Your generation model is answering a different question than the one asked, or is being verbose in ways that dilute the relevant content. Fix: refine your generation prompt to be more directive about format and focus, or add post-processing to trim off-topic content.
3. Context Precision
Context precision measures what fraction of the retrieved chunks are actually relevant to answering the question. If you retrieve 10 chunks and only 2 of them contain information relevant to the query, your context precision is 0.2.
What a low context precision score tells you: Your retrieval system is returning too much irrelevant content. This is costly because it wastes context window space (and therefore money) and can confuse the generation model. Fix: tune your similarity threshold (retrieve fewer chunks with higher confidence), improve your chunking strategy, or add a reranking step after initial retrieval.
4. Context Recall
Context recall measures whether all the relevant information needed to answer the question was actually retrieved. High recall means nothing important was missed. Low recall means relevant source documents exist in your knowledge base but your retrieval system failed to surface them.
What a low context recall score tells you: Your retrieval system is missing relevant documents. Fix: check your embedding model (a domain-specific embedding model often dramatically improves recall for technical content), adjust chunk size (too small or too large both hurt recall), or try hybrid retrieval (BM25 keyword search plus dense vector search).