Precision, Recall, and F1 Explained for LLM Evaluation

Precision, recall, and F1 are the foundation of retrieval evaluation. Understanding the tradeoff between them tells you whether to optimize your RAG system for fewer wrong answers or fewer missed answers.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

8 min read

// tags

#precision-recall#f1-score#rag-evaluation#information-retrieval

FIG. ART-32

8 min read

“

Precision, Recall, and F1 Explained for LLM Evaluation

// reading plan

sections

1,047

words

min read

// AI Scoring & Evals

LLM Red Teaming: How to Find Failure Modes Before Your Users Do

Red teaming is adversarial testing designed to find safety, reliability, and robustness failures in LLM applications before they reach production. Here is how to run a systematic red team exercise.

9 min read

// AI Scoring & Evals

How to Evaluate LLMs: Benchmarks, Vibes, and Building Your Own Evals

Precision and recall are the two fundamental metrics for any information retrieval system, including the retrieval component of a RAG pipeline. Precision measures how much of what you retrieved was actually useful. Recall measures how much of what was useful you actually retrieved. F1 is the harmonic mean of the two. The right metric to optimize depends on the consequences of different failure types in your specific application.

Precision Defined

Precision is the fraction of retrieved items that are relevant. If your retrieval system returns 10 chunks and 7 of them are relevant to the query, precision is 7/10 = 0.70.

The formula:

Precision = True Positives / (True Positives + False Positives)

Where:

True Positives = retrieved chunks that are relevant
False Positives = retrieved chunks that are not relevant

A precision-focused retrieval system is conservative. It only retrieves items when it is confident they are relevant, which means it may miss some relevant items but rarely wastes context window space on irrelevant ones.

Recall Defined

Recall is the fraction of all relevant items that were actually retrieved. If your knowledge base contains 20 relevant chunks for a given query and your retrieval system finds 14 of them, recall is 14/20 = 0.70.

The formula:

Recall = True Positives / (True Positives + False Negatives)

Where:

True Positives = relevant chunks that were retrieved
False Negatives = relevant chunks that were not retrieved

A recall-focused retrieval system is aggressive. It retrieves anything that might be relevant, which means it surfaces all the important information but may also include a lot of irrelevant chunks.

The Fundamental Tradeoff

Precision and recall trade off against each other. If you lower your similarity threshold to retrieve more chunks, you will capture more relevant ones (higher recall) but also more irrelevant ones (lower precision). If you raise your threshold to retrieve only highly similar chunks, you get cleaner results (higher precision) but miss some relevant documents (lower recall).

There is no universally "better" choice. The right operating point depends on the consequences of each failure type:

Optimize for precision when:

Your context window is limited (irrelevant chunks crowd out relevant ones)
False information is more dangerous than no information
You have a reranking step that can be very strict
Users trust your application's outputs and rarely cross-check

Optimize for recall when:

Missing information is more dangerous than including extra noise
Your generation model is good at ignoring irrelevant context
You have a large context window and cost is not a constraint
The stakes are high (medical, legal, compliance) and missing a relevant passage could be harmful

F1 Score: Balancing Both

F1 is the harmonic mean of precision and recall:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

The harmonic mean punishes extreme imbalances. A system with precision 0.9 and recall 0.1 has F1 = 0.18, which correctly signals that the system is practically useless even though precision is high. A system with precision 0.7 and recall 0.7 has F1 = 0.70.

Use F1 when you want a single number to compare retrieval systems and you do not have a strong prior about whether precision or recall matters more.

F-Beta Score: Weighted Tradeoff

When you do have a preference, use F-beta instead of F1:

F-beta = (1 + beta²) × (Precision × Recall) / (beta² × Precision + Recall)

beta > 1 weights recall more heavily (beta = 2 is common when recall matters more)
beta < 1 weights precision more heavily (beta = 0.5 is common when precision matters more)

Applying These Metrics to RAG Evaluation

In a RAG context, you evaluate retrieval precision and recall at the chunk level: for a given query, which chunks in your knowledge base are relevant, and did your retrieval system return them?

Building a precision/recall evaluation requires:

A set of evaluation queries
For each query, a labeled set of relevant chunks (created by human annotators or LM-as-judge)
Running your retrieval system on each query and recording which chunks it returned
Calculating precision, recall, and F1 per query, then averaging

def calculate_retrieval_metrics(retrieved_ids, relevant_ids):
    retrieved_set = set(retrieved_ids)
    relevant_set = set(relevant_ids)

    true_positives = len(retrieved_set & relevant_set)
    false_positives = len(retrieved_set - relevant_set)
    false_negatives = len(relevant_set - retrieved_set)

    precision = true_positives / (true_positives + false_positives) if retrieved_set else 0
    recall = true_positives / (true_positives + false_negatives) if relevant_set else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

    return {"precision": precision, "recall": recall, "f1": f1}

Precision at K

In practice, most RAG systems retrieve a fixed number of top-k chunks (often k=5 or k=10). Precision@K is precision calculated only over the top k retrieved items:

Precision@K = (Relevant items in top K) / K

This is the most practical metric for fixed-k retrieval. RAGAS's "context precision" metric is essentially Precision@K.

A Real RAG Evaluation Example

Say you are building a RAG system for a company's internal documentation. You have 5,000 document chunks. For the query "What is the process for requesting a budget increase?" your knowledge base has 8 relevant chunks.

Your retrieval system returns 10 chunks. 6 of those are among the 8 relevant ones.

Precision = 6/10 = 0.60 (40% of what you retrieved was not relevant)
Recall = 6/8 = 0.75 (25% of the relevant content was missed)
F1 = 2 × (0.60 × 0.75) / (0.60 + 0.75) = 0.667

Is this good? For a budget process query where missing information could cause someone to follow the wrong procedure, a 75% recall is borderline. You should try to get it above 0.85. The 0.60 precision is acceptable if your context window can handle the noise.

Keep Reading

RAGAS: The Standard Framework for Evaluating RAG Systems — How RAGAS operationalizes these metrics into a complete RAG eval framework.
How to Evaluate LLMs: The Complete Guide — Broader framework for evaluating all components of an LLM application.
Building an LLM Eval From Zero — Practical guide to implementing evaluation for your specific application.

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

Precision, Recall, and F1 Explained for LLM Evaluation

Related Articles

LLM Red Teaming: How to Find Failure Modes Before Your Users Do

Precision Defined

Recall Defined

The Fundamental Tradeoff

F1 Score: Balancing Both

F-Beta Score: Weighted Tradeoff

Applying These Metrics to RAG Evaluation

Precision at K

A Real RAG Evaluation Example

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

How to Evaluate LLMs: Benchmarks, Vibes, and Building Your Own Evals

MMLU, HumanEval, and Chatbot Arena Explained: What AI Benchmarks Actually Measure

Precision, Recall, and F1 Explained for LLM Evaluation

Related Articles

LLM Red Teaming: How to Find Failure Modes Before Your Users Do

Precision Defined

Recall Defined

The Fundamental Tradeoff

F1 Score: Balancing Both

F-Beta Score: Weighted Tradeoff

Applying These Metrics to RAG Evaluation

Precision at K

A Real RAG Evaluation Example

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

How to Evaluate LLMs: Benchmarks, Vibes, and Building Your Own Evals

MMLU, HumanEval, and Chatbot Arena Explained: What AI Benchmarks Actually Measure

The workspace your team
actually needs