Precision, Recall, and F1 Explained for LLM Evaluation
Precision, recall, and F1 are the foundation of retrieval evaluation. Understanding the tradeoff between them tells you whether to optimize your RAG system for fewer wrong answers or fewer missed answers.
Precision and recall are the two fundamental metrics for any information retrieval system, including the retrieval component of a RAG pipeline. Precision measures how much of what you retrieved was actually useful. Recall measures how much of what was useful you actually retrieved. F1 is the harmonic mean of the two. The right metric to optimize depends on the consequences of different failure types in your specific application.
Precision Defined
Precision is the fraction of retrieved items that are relevant. If your retrieval system returns 10 chunks and 7 of them are relevant to the query, precision is 7/10 = 0.70.
True Positives = retrieved chunks that are relevant
False Positives = retrieved chunks that are not relevant
A precision-focused retrieval system is conservative. It only retrieves items when it is confident they are relevant, which means it may miss some relevant items but rarely wastes context window space on irrelevant ones.
Recall Defined
Recall is the fraction of all relevant items that were actually retrieved. If your knowledge base contains 20 relevant chunks for a given query and your retrieval system finds 14 of them, recall is 14/20 = 0.70.
True Positives = relevant chunks that were retrieved
False Negatives = relevant chunks that were not retrieved
A recall-focused retrieval system is aggressive. It retrieves anything that might be relevant, which means it surfaces all the important information but may also include a lot of irrelevant chunks.
Team workspace
Ship faster with chat, meetings, and projects in one place — Zlyqor.
Precision and recall trade off against each other. If you lower your similarity threshold to retrieve more chunks, you will capture more relevant ones (higher recall) but also more irrelevant ones (lower precision). If you raise your threshold to retrieve only highly similar chunks, you get cleaner results (higher precision) but miss some relevant documents (lower recall).
There is no universally "better" choice. The right operating point depends on the consequences of each failure type:
Optimize for precision when:
Your context window is limited (irrelevant chunks crowd out relevant ones)
False information is more dangerous than no information
You have a reranking step that can be very strict
Users trust your application's outputs and rarely cross-check
Optimize for recall when:
Missing information is more dangerous than including extra noise
Your generation model is good at ignoring irrelevant context
You have a large context window and cost is not a constraint
The stakes are high (medical, legal, compliance) and missing a relevant passage could be harmful
The harmonic mean punishes extreme imbalances. A system with precision 0.9 and recall 0.1 has F1 = 0.18, which correctly signals that the system is practically useless even though precision is high. A system with precision 0.7 and recall 0.7 has F1 = 0.70.
Use F1 when you want a single number to compare retrieval systems and you do not have a strong prior about whether precision or recall matters more.
F-Beta Score: Weighted Tradeoff
When you do have a preference, use F-beta instead of F1:
beta > 1 weights recall more heavily (beta = 2 is common when recall matters more)
beta < 1 weights precision more heavily (beta = 0.5 is common when precision matters more)
Applying These Metrics to RAG Evaluation
In a RAG context, you evaluate retrieval precision and recall at the chunk level: for a given query, which chunks in your knowledge base are relevant, and did your retrieval system return them?
Building a precision/recall evaluation requires:
A set of evaluation queries
For each query, a labeled set of relevant chunks (created by human annotators or LM-as-judge)
Running your retrieval system on each query and recording which chunks it returned
Calculating precision, recall, and F1 per query, then averaging
In practice, most RAG systems retrieve a fixed number of top-k chunks (often k=5 or k=10). Precision@K is precision calculated only over the top k retrieved items:
Precision@K = (Relevant items in top K) / K
This is the most practical metric for fixed-k retrieval. RAGAS's "context precision" metric is essentially Precision@K.
A Real RAG Evaluation Example
Say you are building a RAG system for a company's internal documentation. You have 5,000 document chunks. For the query "What is the process for requesting a budget increase?" your knowledge base has 8 relevant chunks.
Your retrieval system returns 10 chunks. 6 of those are among the 8 relevant ones.
Precision = 6/10 = 0.60 (40% of what you retrieved was not relevant)
Recall = 6/8 = 0.75 (25% of the relevant content was missed)
F1 = 2 × (0.60 × 0.75) / (0.60 + 0.75) = 0.667
Is this good? For a budget process query where missing information could cause someone to follow the wrong procedure, a 75% recall is borderline. You should try to get it above 0.85. The 0.60 precision is acceptable if your context window can handle the noise.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.
Practical deep-dives on LLMs, developer tools, and AI engineering. No filler. Unsubscribe any time.
// written byFIG. AUTH-01
530
Mahmudul Haque Qudrati
CEO & ML Engineer
CEO and ML Engineer at Pristren. Builds AI-powered software for teams and writes about machine learning, LLMs, developer tools, and practical AI applications.