Precision, Recall, and F1 Explained for LLM Evaluation

Precision, recall, and F1 are the foundation of retrieval evaluation. Understanding the tradeoff between them tells you whether to optimize your RAG system for fewer wrong answers or fewer missed answers.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

8 min read

// tags

#precision-recall#f1-score#rag-evaluation#information-retrieval

FIG. ART-32

8 min read

“

Precision, Recall, and F1 Explained for LLM Evaluation

// reading plan

sections

1,047

words

min read

// AI Scoring & Evals

How to Evaluate LLMs: Benchmarks, Vibes, and Building Your Own Evals

Benchmarks are gamed and vibes do not scale. Here is how to build real evaluations that tell you whether an LLM actually works for your specific use case.

13 min read

// AI Scoring & Evals

MMLU, HumanEval, and Chatbot Arena Explained: What AI Benchmarks Actually Measure

The Fundamental Tradeoff

Precision and recall trade off against each other. If you lower your similarity threshold to retrieve more chunks, you will capture more relevant ones (higher recall) but also more irrelevant ones (lower precision). If you raise your threshold to retrieve only highly similar chunks, you get cleaner results (higher precision) but miss some relevant documents (lower recall).

There is no universally "better" choice. The right operating point depends on the consequences of each failure type:

Optimize for precision when:

Your context window is limited (irrelevant chunks crowd out relevant ones)
False information is more dangerous than no information
You have a reranking step that can be very strict
Users trust your application's outputs and rarely cross-check

Optimize for recall when:

Missing information is more dangerous than including extra noise
Your generation model is good at ignoring irrelevant context
You have a large context window and cost is not a constraint
The stakes are high (medical, legal, compliance) and missing a relevant passage could be harmful

F1 Score: Balancing Both

F1 is the harmonic mean of precision and recall:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

The harmonic mean punishes extreme imbalances. A system with precision 0.9 and recall 0.1 has F1 = 0.18, which correctly signals that the system is practically useless even though precision is high. A system with precision 0.7 and recall 0.7 has F1 = 0.70.

Use F1 when you want a single number to compare retrieval systems and you do not have a strong prior about whether precision or recall matters more.

F-Beta Score: Weighted Tradeoff

When you do have a preference, use F-beta instead of F1:

F-beta = (1 + beta²) × (Precision × Recall) / (beta² × Precision + Recall)

beta > 1 weights recall more heavily (beta = 2 is common when recall matters more)
beta < 1 weights precision more heavily (beta = 0.5 is common when precision matters more)

Applying These Metrics to RAG Evaluation

In a RAG context, you evaluate retrieval precision and recall at the chunk level: for a given query, which chunks in your knowledge base are relevant, and did your retrieval system return them?

Building a precision/recall evaluation requires:

A set of evaluation queries
For each query, a labeled set of relevant chunks (created by human annotators or LM-as-judge)
Running your retrieval system on each query and recording which chunks it returned
Calculating precision, recall, and F1 per query, then averaging

def calculate_retrieval_metrics(retrieved_ids, relevant_ids):
    retrieved_set = set(retrieved_ids)
    relevant_set = set(relevant_ids)

    true_positives = len(retrieved_set & relevant_set)
    false_positives = len(retrieved_set - relevant_set)
    false_negatives = len(relevant_set - retrieved_set)

    precision = true_positives / (true_positives + false_positives) if retrieved_set else 0
    recall = true_positives / (true_positives + false_negatives) if relevant_set else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

    return {"precision": precision, "recall": recall, "f1": f1}

Precision at K

In practice, most RAG systems retrieve a fixed number of top-k chunks (often k=5 or k=10). Precision@K is precision calculated only over the top k retrieved items:

Precision@K = (Relevant items in top K) / K

This is the most practical metric for fixed-k retrieval. RAGAS's "context precision" metric is essentially Precision@K.

A Real RAG Evaluation Example

Say you are building a RAG system for a company's internal documentation. You have 5,000 document chunks. For the query "What is the process for requesting a budget increase?" your knowledge base has 8 relevant chunks.

Your retrieval system returns 10 chunks. 6 of those are among the 8 relevant ones.

Precision = 6/10 = 0.60 (40% of what you retrieved was not relevant)
Recall = 6/8 = 0.75 (25% of the relevant content was missed)
F1 = 2 × (0.60 × 0.75) / (0.60 + 0.75) = 0.667

Is this good? For a budget process query where missing information could cause someone to follow the wrong procedure, a 75% recall is borderline. You should try to get it above 0.85. The 0.60 precision is acceptable if your context window can handle the noise.

Keep Reading

RAGAS: The Standard Framework for Evaluating RAG Systems - How RAGAS operationalizes these metrics into a complete RAG eval framework.
How to Evaluate LLMs: The Complete Guide - Broader framework for evaluating all components of an LLM application.
Building an LLM Eval From Zero - Practical guide to implementing evaluation for your specific application.

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.

Precision, Recall, and F1 Explained for LLM Evaluation

Related Articles

How to Evaluate LLMs: Benchmarks, Vibes, and Building Your Own Evals

Precision Defined

Recall Defined

The Fundamental Tradeoff

F1 Score: Balancing Both

F-Beta Score: Weighted Tradeoff

Applying These Metrics to RAG Evaluation

Precision at K

A Real RAG Evaluation Example

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

MMLU, HumanEval, and Chatbot Arena Explained: What AI Benchmarks Actually Measure

LM-as-Judge: Using LLMs to Evaluate LLM Outputs

Precision, Recall, and F1 Explained for LLM Evaluation

Related Articles

How to Evaluate LLMs: Benchmarks, Vibes, and Building Your Own Evals

Precision Defined

Recall Defined

The Fundamental Tradeoff

F1 Score: Balancing Both

F-Beta Score: Weighted Tradeoff

Applying These Metrics to RAG Evaluation

Precision at K

A Real RAG Evaluation Example

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

MMLU, HumanEval, and Chatbot Arena Explained: What AI Benchmarks Actually Measure

LM-as-Judge: Using LLMs to Evaluate LLM Outputs

The workspace your team
actually needs