A production LLM evaluation system has two layers: offline eval (a golden dataset you run before every deploy) and online eval (sampling and scoring production outputs in real time). Together they form a flywheel: production failures become new offline test cases, better offline coverage means fewer production surprises. Without both layers, you will regularly be surprised by regressions you could have caught before they reached users.
The Two-Layer Architecture
Most teams start with only one layer and learn the hard way that it is not enough.
Teams that only do offline eval find that their carefully labeled test cases do not cover the long tail of what real users actually send. They ship a change that improves average eval scores but degrades performance on an edge case they did not think to include.
Teams that only do online monitoring (watching metrics in production) catch problems late, after users have already experienced them, and have no systematic way to prevent recurrence.
The right architecture combines both:
Layer 1: Offline eval. A golden dataset of 50-500 test cases with known expected outputs. Run automatically before every deploy. Blocks deploys that cause regression. Updated continuously as new failure modes are discovered.
Layer 2: Online eval. Sampling 5-10% of production outputs and scoring them in real time. Tracks metrics over time. Alerts when quality drops. Feeds new failures back into the offline dataset.
Building the Offline Eval
Start with your golden dataset. For a new application with no production history, seed it with:
- 30-50 representative examples from your expected user distribution
- 10-20 adversarial examples (known edge cases, potential failure modes)
- 5-10 examples from similar applications' known failure modes
As you run in production, continuously add new test cases from two sources:
- Production failures you discover through online monitoring
- User bug reports and feedback
The golden dataset should grow by at least 10-20 new cases per month. After six months of running, you should have a dataset that is meaningfully representative of your actual user distribution.
Scoring for the offline eval:
Choose the simplest scoring method that is reliable for your task:
- Exact match for structured outputs (JSON fields, extracted entities, code)
- Rubric-based LM-as-judge for quality judgments
- Unit tests for code generation
Run the eval automatically on every pull request that touches prompt files or model configuration. Use a CI step that fails if the pass rate drops below your threshold (typically 85-95% depending on how tight your test cases are).