The Observability Gap in AI Agents
When a traditional API fails, you look at logs. When an AI agent produces wrong output or fails mid-task, the failure is harder to diagnose: was it the LLM's reasoning? A tool call that returned bad data? An error in how results were parsed? AgentOps fills this gap by recording everything an agent does in a structured, replayable session.
Quick Setup
import agentops
from openai import OpenAI
agentops.init(api_key="YOUR_AGENTOPS_API_KEY")
client = OpenAI()
# All subsequent OpenAI calls are automatically tracked
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Plan a 3-day trip to Tokyo"}]
)
agentops.end_session("Success")
Every LLM call in the session is logged with: model name, prompt, completion, latency, token counts, and cost. Sessions appear in the AgentOps dashboard immediately.
Session Recording
A session captures the complete execution trace of an agent run:
- LLM calls — input messages, output, model, latency, tokens, estimated cost
- Tool calls — which tool was called, what arguments were passed, what it returned
- Agent decisions — the chain of reasoning that led to each action
- Errors — exceptions, API failures, parsing errors with full stack traces
- End state — Success, Fail, or Indeterminate
CrewAI Integration
from crewai import Agent, Task, Crew
import agentops
agentops.init(api_key="YOUR_KEY")
researcher = Agent(
role="Researcher",
goal="Find the latest AI developments",
backstory="Expert research analyst",
)
# AgentOps automatically tracks all LLM calls made by CrewAI agents
crew = Crew(agents=[researcher], tasks=[research_task])
result = crew.kickoff()
agentops.end_session("Success")
Cost Attribution
AgentOps tracks per-session cost broken down by model and call type. For multi-agent workflows running at scale, this lets you identify which agents or tasks are responsible for the majority of your LLM spend — often the first step toward optimization.
Error Replay
When an agent run fails, AgentOps records the exact state at failure: what was in context, what tool was being called, what the error was. You can replay the failed session in the dashboard and step through the execution to identify the root cause.
AgentOps vs LangSmith
LangSmith is deeply integrated with the LangChain ecosystem and excels at tracing LangChain chains and agents. AgentOps is framework-agnostic (works equally well with CrewAI, AutoGen, direct OpenAI calls) and has stronger session-level cost tracking. If you're using LangChain heavily, LangSmith is the natural choice. For multi-framework or framework-free agent work, AgentOps provides better coverage.