LM-as-judge (Zheng et al. 2023, "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena") is a reliable technique for evaluating LLM outputs at scale when you cannot label every response manually. It works best for relative ranking, where you ask a judge model which of two answers is better, and degrades when you ask for absolute quality scores. Set it up with a clear rubric, run the judge on real outputs, and flip answer order to cancel out positional bias.
What LM-as-Judge Actually Is
The core idea is simple: you use one language model to evaluate the output of another. Instead of paying human annotators to score every response, you send the model output, along with a rubric and sometimes a reference answer, to a "judge" model and ask it to rate or rank the response.
The technique was formalized in the MT-Bench paper from LMSYS (Zheng et al. 2023). They ran thousands of human preference judgments and compared them to GPT-4 as a judge. Agreement between GPT-4 and human annotators was higher than agreement between two different human annotators on many tasks. That result validated LM-as-judge as a serious evaluation method, not just a shortcut.
When LM-as-Judge Is Reliable
Relative preference ranking is where LM-as-judge shines. Given two responses to the same prompt, ask the judge: "Which response is better and why?" This format is well-calibrated. It matches how Chatbot Arena works (user votes between two anonymous models), and it produces rankings that correlate well with downstream human satisfaction.
Clear, objective criteria also improve reliability. If your rubric says "does the response contain a code example that compiles," a judge can check that reliably. If it says "is the response high quality," you get noise.
Reference-based evaluation, where you provide the judge with a ground-truth answer and ask it to compare the model's response against it, works well for factual and structured tasks. The judge is essentially doing a diff, which is a task models handle accurately.
When LM-as-Judge Fails
Absolute quality scores are unreliable. Asking a judge to rate a response from 1 to 10 produces inconsistent results across sessions. The same response might get a 7 one day and a 5 the next. If you need absolute scores, you need humans or a very constrained rubric (for example, "does this response contain a hallucination? yes or no").
Subtle factual errors trip up judge models. If a response says the capital of Australia is Sydney instead of Canberra, a judge model that was also trained on internet data may accept the error without catching it. LM-as-judge is not a good substitute for factual verification tools.
Long responses with mixed quality confuse judges. If a response is excellent in the first three paragraphs and wrong in the fourth, many judge models give it a high overall score because they weight the good parts more heavily.
Positional Bias: The Biggest Failure Mode
The most documented failure mode is positional bias. When you show a judge two responses and ask which is better, the judge disproportionately favors the response shown first (or sometimes second). Zheng et al. measured this in their paper and found it affects even GPT-4 judgments.
The fix is simple: run every comparison twice with the order flipped and average the results. If the judge prefers response A when shown first but prefers response B when shown first in the flipped run, call it a tie. If the judge is consistent across both orderings, trust the result.
Here is a minimal judge prompt structure:
def judge_responses(prompt, response_a, response_b, rubric):
judge_prompt = f"""
You are evaluating two AI responses to a user prompt.
User prompt: {prompt}
Response A:
{response_a}
Response B:
{response_b}
Rubric: {rubric}
Which response is better? Answer with "A", "B", or "tie" and explain why in one sentence.
"""
return call_judge_model(judge_prompt)
# Run both orderings
result_1 = judge_responses(prompt, a, b, rubric)
result_2 = judge_responses(prompt, b, a, rubric) # flipped
# Reconcile results_1 and results_2
Self-Evaluation Bias
Another documented failure mode is self-evaluation bias. Models tend to prefer their own outputs when acting as judges. If you use GPT-4 to evaluate GPT-4 responses against Claude responses, you will see inflated scores for GPT-4. The same applies to Claude judging Claude.
The practical fix: do not use the same model family as both the candidate and the judge. If you are evaluating GPT-4o outputs, use Claude as the judge. If you are evaluating Claude outputs, use GPT-4o. This is not a perfect solution because both models were trained on similar internet data, but it eliminates the most obvious self-preference signal.
Building a Judge Rubric
The rubric is the most important part of getting consistent judgments. A good rubric:
- Defines each dimension separately (accuracy, clarity, completeness, format)
- Gives concrete examples of what scores mean
- Specifies exactly what the judge should ignore (length bias is common)
Example rubric for evaluating customer support responses:
- Accuracy (0-2): Does the response correctly answer the user's question? 0 = wrong or misleading, 1 = partially correct, 2 = fully correct.
- Completeness (0-2): Does the response address all parts of the user's question? 0 = misses major parts, 1 = misses minor parts, 2 = addresses everything.
- Tone (0-1): Is the tone professional and empathetic? 0 = cold or inappropriate, 1 = appropriate.
Ask the judge to score each dimension separately, then sum. This is more reliable than asking for a single overall score.
Practical Setup: LM-as-Judge in Production
A practical production setup uses LM-as-judge as a sampling layer, not a scoring layer on every response. You cannot afford to judge every output in a high-volume application. Instead:
- Sample 5-10% of production outputs
- Run judge scoring on the sample
- Track moving averages over time
- Alert when judge scores drop below a threshold
For regression testing when you change a prompt or switch models, run the full eval suite (your offline golden dataset) through the judge and compare aggregate scores before and after the change.
Tools That Implement LM-as-Judge
Several eval frameworks have built-in LM-as-judge support:
- Braintrust — hosted eval platform with LM-as-judge scoring built in
- LangSmith — has an "evaluator" type that wraps any model as a judge
- PromptFoo — open source, supports custom judge prompts in YAML config
- OpenAI Evals — their eval framework uses model-graded evals which is LM-as-judge
Keep Reading
- How to Evaluate LLMs: The Complete Guide — The full framework for building an eval system from scratch.
- Building an LLM Eval From Zero — Step-by-step guide to creating your own evaluation pipeline.
- MMLU and HumanEval Benchmarks Explained — How standard benchmarks work and what they actually measure.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.