Agents that perform well in development fail in predictable ways in production. The failures are not random: they cluster around a set of known failure modes that appear repeatedly across different agent implementations and use cases. Understanding these failure modes before deployment is the difference between a reliable production system and an on-call nightmare.
Failure Mode 1: Context Overflow on Long Tasks
Agents lose track of their goals when the conversation grows beyond what fits comfortably in the context window. This does not require hitting the hard limit: performance degrades significantly when the context is 70-80% full, because the model attends less strongly to early instructions.
A long task with many tool calls accumulates context fast. A search tool that returns 2,000 tokens per call eats through a 128k context window in 60-70 calls. At that point, the agent may still "remember" the goal in a literal sense (it is still in the context), but its behavior changes: it starts repeating steps it already took, loses track of intermediate results, and generates reasoning that does not account for what it already learned.
What helps: Summarize completed steps before they push critical context out of the attention window. After every 10 tool calls, inject a structured summary: "Steps completed: X, Y, Z. Current goal: A. Remaining unknowns: B." This keeps the goal and progress visible near the current position in context.
Also set hard limits on tool calls per task. If a task requires more than 30 tool calls, it is either the wrong task for an agent or it needs to be decomposed into subtasks with separate context windows.
Failure Mode 2: Tool Call Loops
An agent gets stuck retrying a failed tool call without recognizing that it has already tried and failed. This happens when: the error message is ambiguous, the agent does not have a way to signal "I cannot complete this task," or the max iteration limit is too high.
The symptom is easy to diagnose in logs: repeated identical or near-identical tool calls with the same arguments, returning the same error.
What helps: Track tool call history and inject it into context. Before each new tool call, the runtime checks if the same call was made in the last 5 attempts. If yes, it injects: "Warning: you have already tried this tool call 3 times with the same arguments and received the same error. Try a different approach or call report_failure."
Also include an explicit failure-reporting tool (see the tool use design patterns post). Agents that have a structured exit path use it. Agents that do not have one loop until they hit the max iteration limit.
Failure Mode 3: Hallucinated Tool Parameters
The agent calls a tool with arguments that do not exist or are in the wrong format. This is most common with tools that accept complex nested parameters, optional parameters with tricky defaults, or parameters that require specific value formats (UUIDs, date strings, enum values).
What helps: JSON Schema validation on tool inputs before execution. If the schema check fails, return a structured error: "Invalid parameter: date must be in YYYY-MM-DD format, received May 15." Models recover from explicit format errors reliably.
Also simplify tool schemas wherever possible. Every optional parameter with a non-obvious default is a hallucination risk. If a parameter has fewer than 5 valid values, list them in the description explicitly.
Failure Mode 4: Timeout Handling
An external tool takes too long and the agent stalls. This is especially common with web search, external APIs with rate limits, and any tool that makes network calls.
Without explicit timeout handling, an agent waiting on a stuck tool call hangs indefinitely. The entire task is blocked.
What helps: Every tool that makes network calls must have a timeout. Return a structured timeout error ("Tool timed out after 30 seconds") rather than hanging. The agent then decides whether to retry, try an alternative tool, or report failure.
Set different timeouts for different tool categories: fast local tools (< 1 second), medium network tools (< 10 seconds), slow external APIs (< 60 seconds). Surface these in monitoring so you can identify which tools are causing timeouts.
Failure Mode 5: Cost Overruns
Open-ended agents with no cost ceiling consume unlimited tokens on complex tasks. A task that should cost $0.50 in LLM calls consumes $15 because the agent took a circuitous path with many tool calls and long context.
What helps: Set a token budget per task type. Track tokens consumed in the runtime layer, not in the agent. When the budget is 80% consumed, inject a message: "You have used most of your token budget. Wrap up the task with what you have or report that more budget is needed." When the budget is exceeded, terminate and log.
Report token cost per task in your monitoring dashboard. Anomaly detection on cost per task catches runaway agents before the bill arrives.
Monitoring: What to Track
Every production agent deployment needs these metrics:
Steps per task: average and 95th percentile. A spike in steps-per-task is usually a loop or context overflow.
Cost per task: by task type. Enables anomaly detection and trend monitoring.
Failure rate by tool: which tools fail most often. Indicates reliability issues with external dependencies.
Hallucination rate: how often tool calls have invalid parameters. Tracked via schema validation failure rate.
Task completion rate: by task type and over time. Regressions indicate model updates or environment changes.
Time to completion: latency for each task type. Useful for identifying slow tools and context overflow (both cause latency spikes).
What Helps Most
In order of impact:
- Hard limits on iterations and tokens. Non-negotiable. Every production agent must have these.
- Explicit failure-reporting tool. Agents need a structured exit path.
- Idempotent tools wherever possible. If a tool can be called multiple times without side effects, tool call loops become harmless.
- Step-by-step logging of every tool call and result. Required for debugging.
- Human escalation path. Define what happens when an agent cannot complete a task: who gets notified, what they see, how they intervene.
The agents that work reliably in production are the ones where the failure modes were designed around, not discovered after deployment.
Keep Reading
- How to Build an AI Agent — foundational architecture that determines which failure modes you are exposed to
- Tool Use in LLMs: Design Patterns for Reliable Agent Actions — design patterns that prevent hallucinated parameters and loops
- How to Evaluate AI Agents — how to build eval suites that catch production failures before deployment
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.