The right way to evaluate an LLM for your use case is: define the task precisely, create 50-100 representative test inputs, define a scoring criterion for success, run it across the models you are considering, and track it over time as you iterate. Industry benchmarks like MMLU and HumanEval tell you something about a model's general capability, but they do not tell you whether it will work for your specific application. Vibes ("I played with it for 20 minutes and it felt good") do not scale and are unreproducible. The only evaluation worth making decisions on is one you can rerun.
This guide covers how to read benchmark scores correctly, why they are insufficient on their own, and how to build evaluations specific to your application.
Section 1: Major Benchmarks Explained
Before building your own evals, it helps to understand what industry benchmarks actually measure and where their limits are.
MMLU (Massive Multitask Language Understanding)
What it tests: 57 academic subjects from elementary school to professional level, covering STEM, humanities, social sciences, and other domains. Questions are multiple choice (4 options).
How it scores: Percentage of questions answered correctly. Higher is better. Random chance is 25%.
Current top scores:
- GPT-4o: ~88.7% (OpenAI technical report, GPT-4o, 2024)
- Claude 3.5 Sonnet: ~88.7% (Anthropic model card, Claude 3.5, 2024)
- Gemini 1.5 Pro: ~85.9% (Google Gemini technical report, 2024)
What it actually tells you: Broad knowledge breadth across academic domains. Good for measuring whether a model knows facts.
What it does not tell you: Whether the model can reason, write well, follow instructions, or do anything useful in a real application. A model can score 85% on MMLU and still produce useless outputs for your specific task.
The benchmark's weakness: Multiple choice format means models can sometimes answer correctly for wrong reasons (statistical patterns in answer distributions). Also, MMLU has been in the public domain long enough that training data contamination is a legitimate concern.
HumanEval
What it tests: 164 Python programming problems. Each problem provides a function signature and docstring; the model must generate code that passes the tests.
How it scores: pass@k — percentage of problems where at least one of k attempts passes all tests. pass@1 (one attempt per problem) is the standard reported metric.
Current top scores:
- GPT-4o: ~90% pass@1 (OpenAI technical report, 2024)
- Claude 3.5 Sonnet: ~92% pass@1 (Anthropic model card, 2024)
- Deepseek R1: ~91% pass@1 (Deepseek technical report, 2025)
What it actually tells you: Can the model generate Python code that passes unit tests for standard programming problems?
What it does not tell you: Whether the model can handle real-world coding tasks: understanding existing codebases, writing tests for non-trivial logic, refactoring, debugging production issues. HumanEval's 164 problems are relatively standard — the kind of things that appear in coding interviews. Real software engineering involves far more complexity.
Better alternative for coding evaluation: SWE-Bench Verified — 500 real GitHub issues from popular Python repositories. Models must actually solve the bug or implement the feature. Claude 3.5 Sonnet achieves ~49% on SWE-Bench Verified (SWE-Bench leaderboard, May 2026). That number is a more realistic representation of what AI models can do on real engineering work than a HumanEval score.
LMSYS Chatbot Arena Elo
What it tests: Human preferences. Real users have conversations with two anonymous models simultaneously and vote for which response they preferred.
How it scores: Elo rating, the same system used in chess rankings. Higher is better, with scores roughly in the 1000-1400 range for current top models.
Current top positions: GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro cluster near the top, with GPT-4o typically in the top 3 (LMSYS Chatbot Arena, May 2026).
What it actually tells you: What real users prefer in open-ended conversations. This is the most realistic benchmark because it involves actual humans with actual tasks, not curated academic questions.
What it does not tell you: How a model performs on your specific narrow task. A model that performs well on open-ended chat may underperform on a specialized domain task. Also, Chatbot Arena voters are self-selected — they skew toward technical users with specific preferences.
The benchmark's weakness: Slow to update (new models take weeks to accumulate enough votes), and the conversation topics are whatever random users bring. This makes it harder to isolate specific capabilities.
GSM8K (Grade School Math 8K)
What it tests: 8,500 multi-step math word problems at roughly 6th-grade math level. Each problem requires 2-8 reasoning steps to solve.
How it scores: Percentage of problems answered correctly.
Current top scores:
- GPT-4o: ~95%
- Claude 3.5 Sonnet: ~97%
- Gemini 1.5 Pro: ~91% (Model technical reports and leaderboard data, 2024-2025)
What it actually tells you: Can the model follow multi-step arithmetic reasoning chains and produce a correct final answer? GSM8K is primarily a test of chained reasoning, not complex mathematics.
What it does not tell you: Performance on genuinely hard math (MATH benchmark), symbolic reasoning, or real-world quantitative problems with ambiguous setups.
SWE-Bench Verified
What it tests: 500 verified real-world GitHub issues from popular Python open source repositories (Django, Flask, Requests, etc.). Models must produce a code patch that resolves the issue.
How it scores: Percentage of issues where the submitted patch passes the repository's test suite.
Current top scores: Claude 3.5 Sonnet ~49%, GPT-4o ~31% (SWE-Bench leaderboard, May 2026). These numbers are for the best AI agent approaches, not single-inference results.
Why this benchmark matters: Unlike HumanEval's synthetic problems, SWE-Bench uses real issues from production code. Solving a Django bug requires understanding the existing codebase, the bug report, the code structure, and how to write a fix that does not break existing tests. This is far closer to actual software engineering.
The benchmark's weakness: Scoring requires running the full test suite, which is computationally expensive and slow. Results depend heavily on the scaffolding (how the AI is prompted and given tools), not just the base model quality.
Section 2: Why Benchmarks Alone Are Misleading
Train-Test Contamination
Modern LLMs are trained on massive web crawls. MMLU, HumanEval, and GSM8K have been public for years. It is likely that many test problems appear verbatim or near-verbatim in training data. A model that "knows" the answer from training data scores higher than a model that must reason to the answer, without that difference reflecting genuine capability.
This is not a solved problem. Benchmark creators attempt to detect contamination, but it is difficult to verify completely. When a new model claims a 95% MMLU score, that number is less meaningful than it appears because the contamination question is never fully resolved.
Goodhart's Law
"When a measure becomes a target, it ceases to be a good measure." AI labs optimize their models for benchmark performance as part of the training process. RLHF and post-training techniques can improve benchmark scores without improving real-world task performance. A model fine-tuned specifically to do well on MMLU may score 90% but perform worse than a model scoring 85% on tasks your users actually care about.
The Domain Mismatch Problem
MMLU measures academic knowledge. HumanEval measures Python coding on synthetic problems. Neither tells you how a model performs on:
- Customer support conversations in your specific domain
- Legal document summarization
- Medical record extraction
- Financial analysis
- Your application's unique prompt templates and response requirements
The only way to know how a model performs on your task is to test it on your task.
Section 3: Building Your Own Eval
Step 1: Define the Task Precisely
Vague tasks produce vague evals. Before writing a single test case, write a one-paragraph definition of exactly what success looks like.
Bad definition: "The model should answer customer questions helpfully."
Good definition: "Given a customer question about a product return, the model should: (1) identify whether the return is within our 30-day policy, (2) state the correct answer (eligible/ineligible) in the first sentence, (3) explain the reason in 1-2 sentences, and (4) include the next step the customer should take. The response should be under 100 words."
The good definition is testable. You can write an automated scoring function against it.
Step 2: Create Test Cases
Collect 50-100 representative inputs. For a customer support application, this means real customer questions from your history (anonymized if needed). For a coding assistant, real code review requests from your team. Do not make up examples — use real inputs from your actual use case.
Your test set should include:
- Typical inputs that represent the common case
- Edge cases (ambiguous inputs, unusual requests, borderline situations)
- Hard cases where the correct answer is not obvious
- Adversarial inputs if your application is public-facing
Step 3: Define Scoring
Exact match: For tasks with a single correct answer (classification, extraction). Did the model produce the exact correct output?
Rubric scoring: For tasks with subjective quality (writing, explanation, analysis). Define 3-5 criteria and score each 1-5. Sum to get a total score. Human scorers use the same rubric for baseline. This takes more time but captures quality nuances that exact match misses.
LM-as-judge: Use a separate LLM call to score the output against your criteria. Faster than human evaluation and scalable. Has its own biases (LLMs tend to prefer outputs from models in the same family).
Pass/fail against requirements: For structured outputs. Does the JSON parse? Does it contain the required fields? Are the values in the expected format?
Step 4: Run Across Models
Run your eval set against all candidate models. Record:
- Score per test case
- Average score across the eval set
- Score on edge cases specifically
- Latency per request
- Cost per request
Step 5: Track Over Time
Evals are not one-time activities. Run your eval set when:
- Considering switching to a new model
- Updating your prompts
- A provider releases a new model version
- You add new features that change the prompt structure
Regressions happen silently. A prompt change that improves one aspect may degrade another. Your eval set is the only systematic way to catch this.
Section 4: LM-as-Judge
Using an LLM to evaluate another LLM's outputs is practical at scale but has known biases.
How it works:
def judge_response(question: str, response: str, criteria: str) -> dict:
"""Use Claude to evaluate another model's response."""
import anthropic
client = anthropic.Anthropic()
judge_prompt = f"""
You are evaluating an AI assistant's response to a customer question.
Question: {question}
Response: {response}
Evaluate the response against these criteria:
{criteria}
Score each criterion 1-5 and provide brief reasoning.
Output as JSON: {{"criterion_name": {{"score": X, "reasoning": "..."}}}}
"""
result = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": judge_prompt}]
)
import json
return json.loads(result.content[0].text)
Known biases in LM-as-judge:
Length bias: LLMs tend to rate longer responses higher, even when longer does not mean better. Control for this by including length as an explicit criterion with both a minimum and maximum.
Self-preference bias: Claude rates Claude outputs higher; GPT rates GPT outputs higher. For high-stakes evals, use a different judge model than the model being evaluated.
Position bias: When shown two responses side-by-side, LLMs prefer the first response. Control for this by alternating which response appears first.
Despite these biases, LM-as-judge is genuinely useful for quick iteration during prompt development. For final model selection decisions, supplement with human evaluation.
Section 5: Eval Frameworks and Tools
PromptFoo: Open source evaluation framework for LLM outputs. Write test cases in YAML, define scoring criteria, run evaluations across multiple models. Good for systematic prompt testing.
Braintrust: Managed evaluation platform. Tracks eval results over time, provides a UI for viewing and annotating outputs, supports A/B testing of prompts and models. Good for teams that want infrastructure without building it themselves.
LangSmith: Tracing and evaluation platform from LangChain. Integrates well with LangChain-based applications. Provides eval datasets, scoring, and experiment tracking.
Weave (W&B): Weights and Biases' LLM evaluation product. Good if your team is already using W&B for ML experiment tracking.
For most teams starting out, PromptFoo covers 80% of needs with no cost. For teams that want a managed platform and better collaboration features, Braintrust is the most polished option.
Keep Reading
- MMLU, HumanEval, and Chatbot Arena Explained — Deeper breakdown of each individual benchmark and what scores actually mean
- LLM API Pricing Comparison 2026 — Once you know which model performs best for your task, compare costs
- Cutting LLM API Costs by 50%+ — After eval selects your model, optimize the cost of running it
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.