A production LLM evaluation system has two layers: offline eval (a golden dataset you run before every deploy) and online eval (sampling and scoring production outputs in real time). Together they form a flywheel: production failures become new offline test cases, better offline coverage means fewer production surprises. Without both layers, you will regularly be surprised by regressions you could have caught before they reached users.
The Two-Layer Architecture
Most teams start with only one layer and learn the hard way that it is not enough.
Teams that only do offline eval find that their carefully labeled test cases do not cover the long tail of what real users actually send. They ship a change that improves average eval scores but degrades performance on an edge case they did not think to include.
Teams that only do online monitoring (watching metrics in production) catch problems late, after users have already experienced them, and have no systematic way to prevent recurrence.
The right architecture combines both:
Layer 1: Offline eval. A golden dataset of 50-500 test cases with known expected outputs. Run automatically before every deploy. Blocks deploys that cause regression. Updated continuously as new failure modes are discovered.
Layer 2: Online eval. Sampling 5-10% of production outputs and scoring them in real time. Tracks metrics over time. Alerts when quality drops. Feeds new failures back into the offline dataset.
Building the Offline Eval
Start with your golden dataset. For a new application with no production history, seed it with:
- 30-50 representative examples from your expected user distribution
- 10-20 adversarial examples (known edge cases, potential failure modes)
- 5-10 examples from similar applications' known failure modes
As you run in production, continuously add new test cases from two sources:
- Production failures you discover through online monitoring
- User bug reports and feedback
The golden dataset should grow by at least 10-20 new cases per month. After six months of running, you should have a dataset that is meaningfully representative of your actual user distribution.
Scoring for the offline eval:
Choose the simplest scoring method that is reliable for your task:
- Exact match for structured outputs (JSON fields, extracted entities, code)
- Rubric-based LM-as-judge for quality judgments
- Unit tests for code generation
Run the eval automatically on every pull request that touches prompt files or model configuration. Use a CI step that fails if the pass rate drops below your threshold (typically 85-95% depending on how tight your test cases are).
Building the Online Eval
Online eval monitors a sample of production outputs in real time. The goal is to catch quality degradation that your offline dataset does not cover.
What to monitor:
- Output quality sample. Run LM-as-judge scoring on a 5-10% random sample of outputs. Track the moving average of quality scores. Alert if the 7-day average drops more than 0.05 from baseline.
- User feedback signals. Thumbs up/down, correction rate, session abandonment. These are noisy but accumulate quickly.
- Technical metrics. Latency, error rate, token count. These do not measure quality but catch provider-side issues.
- Cost per task. Track cost per meaningful unit of work (per email classified, per meeting summarized). Unusual spikes indicate something changed upstream.
Tooling for online eval:
Several platforms support production LLM monitoring out of the box:
- LangSmith (LangChain's platform) — traces every LLM call, lets you flag and save production outputs for your eval dataset
- Braintrust — hosted eval platform with production logging and LM-as-judge scoring
- Helicone — lightweight logging proxy that adds observability without changing your application code
- Custom solution — log to a database, run a nightly scoring job, push to your metrics dashboard
For a small team, Helicone for logging plus a custom nightly scoring script is often the most practical starting point. It requires minimal setup and gives you full control over what you track.
The Eval Flywheel
The flywheel is what makes the system compound over time. Here is how it works:
- Production outputs are sampled and scored by the online eval system
- Failures are flagged (outputs below a quality threshold)
- Flagged outputs are reviewed (manually or by a second LM-as-judge pass)
- Confirmed failures are added to the offline dataset as new test cases
- The offline eval is re-run on the expanded dataset
- The next deploy is held to a higher standard because the dataset now covers more failure modes
After three to six months of running this loop, your offline eval becomes significantly more representative and your production failure rate drops proportionally.
Setting Up Alerts
For online monitoring to be actionable, it needs to alert you when something changes, not just log everything.
Recommended alert thresholds:
- Quality score moving average drops more than 0.05 from 7-day baseline: investigate
- Error rate exceeds 2% on any feature in a 1-hour window: page on-call
- Cost per task increases more than 30% with no corresponding change in task volume: check for model API changes
- Latency p95 exceeds 3x baseline for more than 15 minutes: check provider status
Avoid alert fatigue by starting with wide thresholds and tightening them as you learn your system's natural variance. An alert that fires daily because your baseline estimation is too tight will be ignored.
Tools Summary
| Layer | Tool | Cost | Best For | |-------|------|------|----------| | Offline eval | PromptFoo | Free | Teams that change prompts frequently | | Offline eval | Braintrust | Free tier + paid | Teams that want a polished UI | | Online logging | Helicone | Free tier + paid | Lightweight observability | | Online logging + eval | LangSmith | Free tier + paid | LangChain users | | Online eval | Custom script | Engineering time | Full control, minimal vendor dependency |
Keep Reading
- PromptFoo Eval Tool Guide — Detailed setup for the best open source offline eval tool.
- LM-as-Judge: Using LLMs to Evaluate LLM Outputs — The technique that powers automated online eval scoring.
- A/B Testing LLM Outputs in Production — How to validate model changes before fully committing to them.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.