The most useful LLM evaluation is the one you build specifically for your task, not the one you borrow from a benchmark paper. Start by collecting real examples from production, defining what "good" means for your specific outputs, and building a test harness that runs in under five minutes. A well-designed task-specific eval will catch regressions that general benchmarks like MMLU and HumanEval completely miss.
Why Generic Benchmarks Are Not Enough
Generic benchmarks measure what models were optimized for during training. MMLU measures broad academic knowledge. HumanEval measures Python coding on toy problems. ChatBot Arena measures general conversational preference. None of these tell you whether your specific application works.
If you are building a customer support bot, you need to know whether it answers your company's product questions accurately. If you are building a code generation tool, you need to know whether the generated code passes your test suite. Generic benchmarks tell you nothing about this.
The first step in building your eval is to stop looking at general benchmarks and ask: what does failure look like in my specific application?
Step 1: Define the Task Precisely
Before writing a single test case, write one paragraph that precisely defines what your application is supposed to do and what a successful output looks like. If you cannot write this paragraph, your eval will be useless.
Example of a vague definition: "The chatbot should give helpful responses to user questions."
Example of a precise definition: "Given a user question about our SaaS product's billing, the chatbot should: (1) answer the specific question asked without adding irrelevant information, (2) cite the correct pricing tier when pricing is mentioned, (3) recommend contacting support for account-specific issues it cannot answer from documentation, and (4) never make up features or pricing that do not exist."
The precise definition gives you four concrete things to test. The vague definition gives you nothing.
Step 2: Collect Real Test Cases
The most valuable test cases are ones that came from real production usage. Start logging model inputs and outputs in production from day one. When users report problems, save those examples as test cases. When the model fails in an interesting way during manual testing, save that too.
For a new application where you have no production data yet:
- Write 20-30 examples that represent the "happy path" — the most common, straightforward inputs your users will send.
- Write 10-15 adversarial examples — inputs designed to trip up your model. These include edge cases, ambiguous questions, and inputs that were problematic in similar applications.
- Write 5-10 boundary cases — inputs at the edge of what your application is supposed to handle. The model should either handle them gracefully or refuse them cleanly.
Aim for 50-100 test cases to start. More is better, but 50 well-chosen examples beat 500 random ones.
Step 3: Define Your Scoring Method
The right scoring method depends on your task. There are three main categories:
Exact match works for tasks with deterministic correct answers. If you are asking the model to extract a phone number from a paragraph of text, the right answer is a specific string. You can check it with string equality or a regex.
def score_exact_match(expected, actual):
return 1.0 if expected.strip() == actual.strip() else 0.0
Unit tests work for code generation tasks. If the model writes a function, you run it against test cases and score based on how many pass.
def score_code_output(generated_code, test_cases):
passed = 0
for test in test_cases:
try:
exec(generated_code)
result = eval(test["call"])
if result == test["expected"]:
passed += 1
except Exception:
pass
return passed / len(test_cases)
Rubric-based LM-as-judge works for quality tasks where there is no single correct answer. You define a rubric, send the model output plus rubric to a judge model, and get a score back. Use this for tasks like summarization, tone checking, and question answering where the correct answer varies.
Step 4: Build the Test Harness
The test harness is the code that runs your model against all test cases and reports results. Keep it simple:
import json
from pathlib import Path
def run_eval(model_fn, test_cases_path, score_fn):
test_cases = json.loads(Path(test_cases_path).read_text())
results = []
for case in test_cases:
output = model_fn(case["input"])
score = score_fn(case["expected"], output)
results.append({
"input": case["input"],
"expected": case["expected"],
"actual": output,
"score": score,
"passed": score >= 0.8
})
avg_score = sum(r["score"] for r in results) / len(results)
pass_rate = sum(1 for r in results if r["passed"]) / len(results)
return {"avg_score": avg_score, "pass_rate": pass_rate, "results": results}
Run this before every deployment. If pass rate drops more than 5 percentage points from the previous run, investigate before shipping.
Step 5: Track Results Over Time
A single eval run is useful but not enough. You need to track results over time to catch regressions. Store eval results in a database or a simple JSON file with a timestamp. Plot the pass rate over time. When it drops, look at which test cases newly started failing — that tells you exactly what changed.
The most important signal is not absolute score but trend. A 70% pass rate that is stable is fine. An 85% pass rate that dropped from 95% last week is a problem.
The 80/20 Rule for Evals
You do not need to achieve 100% coverage of every possible input. The goal is to catch 80% of regressions with 20% of the effort. In practice, this means:
- 30-50 well-chosen test cases covering your most common inputs and your known failure modes
- A scoring method that takes under one minute to run per test case
- An automated run triggered by every prompt change or model switch
- A Slack or email alert when pass rate drops below threshold
That is it. Start there. Add more test cases as you discover new failure modes in production.
Tools to Accelerate Eval Development
Rather than building everything from scratch, consider these tools:
- PromptFoo — open source, define tests in YAML, run against multiple models, integrates with CI
- Braintrust — hosted eval platform, good UI for reviewing failures, supports LM-as-judge
- LangSmith — traces every LLM call, lets you turn production failures into test cases
- Evals (OpenAI) — their own framework, works with any model via the API
For a small team getting started, PromptFoo is the best choice. It is free, runs locally, and can be added to a GitHub Actions workflow in about 30 minutes.
Keep Reading
- LM-as-Judge: Using LLMs to Evaluate LLM Outputs — How to use a model as a judge for rubric-based scoring.
- PromptFoo Eval Tool Guide — Complete setup guide for the most useful open source eval tool.
- Evals for Production LLM Apps — How to build the full offline-to-online eval flywheel.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.