Evaluating an AI agent is harder than evaluating a chatbot. A chatbot produces a single response to a single input. An agent executes a multi-step trajectory: it reasons, calls tools, receives results, adjusts, and continues. Every step is an opportunity for error, and errors compound across steps. A task completion rate of 70% sounds acceptable until you realize that the 30% failures are the hardest, most consequential tasks.
Why Agent Evaluation Is Harder
Chatbot evaluation is relatively straightforward: does the response answer the question correctly? You can compare the output to a ground truth answer and compute a score.
Agent evaluation requires evaluating the trajectory, not just the final output. An agent that completes a task by accident (lucky wrong reasoning, fortuitous tool result) is not behaving correctly, even though task completion is 1. An agent that fails gracefully, identifies the failure, and reports it accurately is behaving better than one that produces a confident wrong answer.
Additionally, agents run in environments. The environment matters: a coding agent evaluated on a local code execution environment behaves differently than one evaluated on a remote sandboxed environment. Evaluation results are only comparable when the environment is held constant.
Metric 1: Task Completion Rate
The baseline metric. Did the agent complete the assigned task? This is binary: yes or no. Partial credit requires a rubric.
Task completion rate is necessary but insufficient. It does not tell you how efficiently the task was completed, whether the correct path was taken, or what it cost.
For a meaningful task completion rate, you need:
- A clear, unambiguous definition of "complete" for each task.
- A deterministic evaluation function (not an LLM judge for the primary metric, since LLM judges introduce variance).
- A large enough test set to distinguish signal from noise (at least 50 tasks per category, ideally 200+).
Metric 2: Step Efficiency
How many steps did the agent take to complete the task, compared to the optimal number of steps?
Step efficiency = optimal steps / actual steps.
An agent that completes a task in 12 steps when 4 are sufficient is inefficient and expensive. Step efficiency captures this. It requires knowing the optimal step count, which means you need human experts to solve each task and record the minimum number of steps.
For tasks where the optimal path is unclear, a proxy metric works: compare the agent's step count to the median of other agents solving the same task.
Metric 3: Trajectory Correctness
Did the agent take the right path, even if it reached the right destination?
Trajectory correctness evaluates the intermediate steps, not just the final answer. An agent debugging a bug correctly identifies the root cause before fixing it. An agent that happens to fix the bug while chasing a wrong diagnosis has a correct final output but an incorrect trajectory.
Evaluating trajectory correctness requires human annotation or a trusted oracle. For each task, define the key trajectory checkpoints: what should the agent have done at step 3? At step 7? Automated checks against these checkpoints give a trajectory correctness score.
Metric 4: Cost Per Task
How many tokens did the agent consume to complete the task? How does that translate to dollars?
Cost per task is a business metric, but it is also an agent quality metric. An agent that consumes 100k tokens per task is not production-ready, regardless of task completion rate. Cost efficiency is part of agent design, not an afterthought.
Track tokens consumed per step, per tool call, and per task completion. Break down cost by input tokens vs output tokens (output is typically more expensive). Set a cost budget per task type and flag runs that exceed it.
Metric 5: Error Recovery Rate
When the agent encounters an error (tool failure, unexpected output, context limit), does it recover and continue, or does it fail catastrophically?
Error recovery rate = tasks recovered after error / tasks that encountered an error.
This metric requires injecting errors deliberately: return a tool failure, inject malformed data, truncate a context. Measure whether the agent detects the error, handles it gracefully, and continues toward the goal.
Agents with high task completion rates but low error recovery rates are brittle in production. Real-world environments have errors constantly.
Evaluation Frameworks
SWE-Bench is the standard for coding agents. It presents 2,294 real GitHub issues and requires agents to produce patches that pass the associated tests. The Verified split (500 issues with verified tests) is the most reliable comparison benchmark.
WebArena evaluates agents on web tasks: booking hotels, buying products, navigating CMSs. It uses a deterministic reward signal (did the booking succeed?) which makes it more reliable than human evaluation.
AgentBench covers a broader range of agent tasks: web browsing, database queries, operating system tasks, code execution. It is useful for measuring general agent capability rather than domain-specific performance.
Building Your Own Evaluation Suite
Public benchmarks measure general capability. You need domain-specific evaluation to measure capability on your actual tasks.
Step 1: Define your task taxonomy. What are the 5-10 types of tasks your agent handles? Define them precisely.
Step 2: Create test cases with deterministic success criteria. For each task type, write 20-50 test cases. Each test case includes: input, expected output, and a deterministic check function. Do not rely on LLM judges for the primary success metric.
Step 3: Create deterministic test environments. If your agent calls external APIs, mock them. If it queries a database, use a test database with known contents. Deterministic environments make evaluation reproducible.
Step 4: Measure consistently. Run the full eval suite on every significant agent change. Track results over time. A regression from 78% to 71% task completion is a signal to investigate before shipping.
The Oracle Problem
The oracle problem is: who decides what "correct" looks like?
For coding tasks, a passing test suite is the oracle. For web tasks, a confirmed database entry is the oracle. For research tasks or writing tasks, there is no clean oracle. The answer depends on who is evaluating.
When a clean oracle is unavailable, use a rubric with multiple human evaluators and measure inter-rater agreement. If two humans cannot agree on whether the agent succeeded, the task definition is probably ambiguous.
LLM judges (using an LLM to evaluate another LLM's output) are useful as a secondary signal for tasks without clean oracles, but they have their own error rates and biases. Never use an LLM judge as the sole evaluation signal.
Keep Reading
- AI Agents Explained — foundational understanding before evaluating agents
- Running AI Agents in Production — how evaluation connects to production monitoring
- Devin vs Claude Code vs Copilot Workspace — what SWE-Bench actually means for real coding tasks
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.