Precision and recall are the two fundamental metrics for any information retrieval system, including the retrieval component of a RAG pipeline. Precision measures how much of what you retrieved was actually useful. Recall measures how much of what was useful you actually retrieved. F1 is the harmonic mean of the two. The right metric to optimize depends on the consequences of different failure types in your specific application.
Precision Defined
Precision is the fraction of retrieved items that are relevant. If your retrieval system returns 10 chunks and 7 of them are relevant to the query, precision is 7/10 = 0.70.
The formula:
Precision = True Positives / (True Positives + False Positives)
Where:
- True Positives = retrieved chunks that are relevant
- False Positives = retrieved chunks that are not relevant
A precision-focused retrieval system is conservative. It only retrieves items when it is confident they are relevant, which means it may miss some relevant items but rarely wastes context window space on irrelevant ones.
Recall Defined
Recall is the fraction of all relevant items that were actually retrieved. If your knowledge base contains 20 relevant chunks for a given query and your retrieval system finds 14 of them, recall is 14/20 = 0.70.
The formula:
Recall = True Positives / (True Positives + False Negatives)
Where:
- True Positives = relevant chunks that were retrieved
- False Negatives = relevant chunks that were not retrieved
A recall-focused retrieval system is aggressive. It retrieves anything that might be relevant, which means it surfaces all the important information but may also include a lot of irrelevant chunks.
The Fundamental Tradeoff
Precision and recall trade off against each other. If you lower your similarity threshold to retrieve more chunks, you will capture more relevant ones (higher recall) but also more irrelevant ones (lower precision). If you raise your threshold to retrieve only highly similar chunks, you get cleaner results (higher precision) but miss some relevant documents (lower recall).
There is no universally "better" choice. The right operating point depends on the consequences of each failure type:
Optimize for precision when:
- Your context window is limited (irrelevant chunks crowd out relevant ones)
- False information is more dangerous than no information
- You have a reranking step that can be very strict
- Users trust your application's outputs and rarely cross-check
Optimize for recall when:
- Missing information is more dangerous than including extra noise
- Your generation model is good at ignoring irrelevant context
- You have a large context window and cost is not a constraint
- The stakes are high (medical, legal, compliance) and missing a relevant passage could be harmful
F1 Score: Balancing Both
F1 is the harmonic mean of precision and recall:
F1 = 2 × (Precision × Recall) / (Precision + Recall)
The harmonic mean punishes extreme imbalances. A system with precision 0.9 and recall 0.1 has F1 = 0.18, which correctly signals that the system is practically useless even though precision is high. A system with precision 0.7 and recall 0.7 has F1 = 0.70.
Use F1 when you want a single number to compare retrieval systems and you do not have a strong prior about whether precision or recall matters more.
F-Beta Score: Weighted Tradeoff
When you do have a preference, use F-beta instead of F1:
F-beta = (1 + beta²) × (Precision × Recall) / (beta² × Precision + Recall)
- beta > 1 weights recall more heavily (beta = 2 is common when recall matters more)
- beta < 1 weights precision more heavily (beta = 0.5 is common when precision matters more)
Applying These Metrics to RAG Evaluation
In a RAG context, you evaluate retrieval precision and recall at the chunk level: for a given query, which chunks in your knowledge base are relevant, and did your retrieval system return them?
Building a precision/recall evaluation requires:
- A set of evaluation queries
- For each query, a labeled set of relevant chunks (created by human annotators or LM-as-judge)
- Running your retrieval system on each query and recording which chunks it returned
- Calculating precision, recall, and F1 per query, then averaging
def calculate_retrieval_metrics(retrieved_ids, relevant_ids):
retrieved_set = set(retrieved_ids)
relevant_set = set(relevant_ids)
true_positives = len(retrieved_set & relevant_set)
false_positives = len(retrieved_set - relevant_set)
false_negatives = len(relevant_set - retrieved_set)
precision = true_positives / (true_positives + false_positives) if retrieved_set else 0
recall = true_positives / (true_positives + false_negatives) if relevant_set else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
return {"precision": precision, "recall": recall, "f1": f1}
Precision at K
In practice, most RAG systems retrieve a fixed number of top-k chunks (often k=5 or k=10). Precision@K is precision calculated only over the top k retrieved items:
Precision@K = (Relevant items in top K) / K
This is the most practical metric for fixed-k retrieval. RAGAS's "context precision" metric is essentially Precision@K.
A Real RAG Evaluation Example
Say you are building a RAG system for a company's internal documentation. You have 5,000 document chunks. For the query "What is the process for requesting a budget increase?" your knowledge base has 8 relevant chunks.
Your retrieval system returns 10 chunks. 6 of those are among the 8 relevant ones.
- Precision = 6/10 = 0.60 (40% of what you retrieved was not relevant)
- Recall = 6/8 = 0.75 (25% of the relevant content was missed)
- F1 = 2 × (0.60 × 0.75) / (0.60 + 0.75) = 0.667
Is this good? For a budget process query where missing information could cause someone to follow the wrong procedure, a 75% recall is borderline. You should try to get it above 0.85. The 0.60 precision is acceptable if your context window can handle the noise.
Keep Reading
- RAGAS: The Standard Framework for Evaluating RAG Systems — How RAGAS operationalizes these metrics into a complete RAG eval framework.
- How to Evaluate LLMs: The Complete Guide — Broader framework for evaluating all components of an LLM application.
- Building an LLM Eval From Zero — Practical guide to implementing evaluation for your specific application.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.