How to Build a Complete Evaluation System for a Production LLM App

A production eval system has three layers: offline testing before deploy, online monitoring in production, and a feedback loop that turns failures into new test cases. Here is how to build all three.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

9 min read

// tags

#llm-production#eval-system#monitoring#ai-engineering

FIG. ART-29

9 min read

“

How to Build a Complete Evaluation System for a Production LLM App

// reading plan

sections

1,051

words

min read

// Developer Tools

Monitoring Your Application with Prometheus and Grafana

Prometheus scrapes metrics from your app, Grafana visualizes them. Here is how to instrument a Node.js app, build dashboards, and set up alerts that matter.

10 min read

// Developer Tools

Sentry Error Tracking Guide: From Setup to Production Insights

Building the Online Eval

Online eval monitors a sample of production outputs in real time. The goal is to catch quality degradation that your offline dataset does not cover.

What to monitor:

Output quality sample. Run LM-as-judge scoring on a 5-10% random sample of outputs. Track the moving average of quality scores. Alert if the 7-day average drops more than 0.05 from baseline.
User feedback signals. Thumbs up/down, correction rate, session abandonment. These are noisy but accumulate quickly.
Technical metrics. Latency, error rate, token count. These do not measure quality but catch provider-side issues.
Cost per task. Track cost per meaningful unit of work (per email classified, per meeting summarized). Unusual spikes indicate something changed upstream.

Tooling for online eval:

Several platforms support production LLM monitoring out of the box:

LangSmith (LangChain's platform) - traces every LLM call, lets you flag and save production outputs for your eval dataset
Braintrust - hosted eval platform with production logging and LM-as-judge scoring
Helicone - lightweight logging proxy that adds observability without changing your application code
Custom solution - log to a database, run a nightly scoring job, push to your metrics dashboard

For a small team, Helicone for logging plus a custom nightly scoring script is often the most practical starting point. It requires minimal setup and gives you full control over what you track.

The Eval Flywheel

The flywheel is what makes the system compound over time. Here is how it works:

Production outputs are sampled and scored by the online eval system
Failures are flagged (outputs below a quality threshold)
Flagged outputs are reviewed (manually or by a second LM-as-judge pass)
Confirmed failures are added to the offline dataset as new test cases
The offline eval is re-run on the expanded dataset
The next deploy is held to a higher standard because the dataset now covers more failure modes

After three to six months of running this loop, your offline eval becomes significantly more representative and your production failure rate drops proportionally.

Setting Up Alerts

For online monitoring to be actionable, it needs to alert you when something changes, not just log everything.

Recommended alert thresholds:

Quality score moving average drops more than 0.05 from 7-day baseline: investigate
Error rate exceeds 2% on any feature in a 1-hour window: page on-call
Cost per task increases more than 30% with no corresponding change in task volume: check for model API changes
Latency p95 exceeds 3x baseline for more than 15 minutes: check provider status

Avoid alert fatigue by starting with wide thresholds and tightening them as you learn your system's natural variance. An alert that fires daily because your baseline estimation is too tight will be ignored.

Tools Summary

Layer	Tool	Cost	Best For
Offline eval	PromptFoo	Free	Teams that change prompts frequently
Offline eval	Braintrust	Free tier + paid	Teams that want a polished UI
Online logging	Helicone	Free tier + paid	Lightweight observability
Online logging + eval	LangSmith	Free tier + paid	LangChain users
Online eval	Custom script	Engineering time	Full control, minimal vendor dependency

Keep Reading

PromptFoo Eval Tool Guide - Detailed setup for the best open source offline eval tool.
LM-as-Judge: Using LLMs to Evaluate LLM Outputs - The technique that powers automated online eval scoring.
A/B Testing LLM Outputs in Production - How to validate model changes before fully committing to them.

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.

How to Build a Complete Evaluation System for a Production LLM App

Related Articles

Monitoring Your Application with Prometheus and Grafana

Sentry Error Tracking Guide: From Setup to Production Insights

The Two-Layer Architecture

Building the Offline Eval

Building the Online Eval

The Eval Flywheel

Setting Up Alerts

Tools Summary

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

How to Evaluate LLMs: Benchmarks, Vibes, and Building Your Own Evals

How to Build a Complete Evaluation System for a Production LLM App

Related Articles

Monitoring Your Application with Prometheus and Grafana

Sentry Error Tracking Guide: From Setup to Production Insights

The Two-Layer Architecture

Building the Offline Eval

Building the Online Eval

The Eval Flywheel

Setting Up Alerts

Tools Summary

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

How to Evaluate LLMs: Benchmarks, Vibes, and Building Your Own Evals

The workspace your team
actually needs