How to Monitor an AI Agent in Production
AI agents fail silently in ways uptime dashboards miss. A five-step playbook for tracing, evaluating, and alerting on autonomous agents in production.
Your agent returned a confident answer. Your logs show a 200 OK. Your uptime dashboard is green. And yet the agent hallucinated a SQL clause, silently deleted the wrong record, and moved on.
That is the core problem with applying traditional monitoring to autonomous AI agents. Normal APM tells you whether a request succeeded. It cannot tell you whether the agent did the right thing. In 2026, with agents calling external APIs, writing to databases, and spawning sub-agents, the gap between "the process finished" and "the process did what we intended" is where production incidents live.
This playbook gives you five concrete steps to close that gap.
Why agent monitoring is different
Classic application monitoring rests on a simple model: a request comes in, code runs, a response goes out. Determinism is assumed. Errors produce exceptions. Latency is measurable per hop.
Agents break all three assumptions:
- Non-determinism. The same input on two different runs can produce structurally different tool-call sequences. Comparing run A to run B is not meaningful without an eval framework that understands intent, not just output shape.
- Multi-step traces. A single user message may trigger a planner call, three tool calls, a sub-agent handoff, and a synthesis step. Each step can fail, hallucinate, or amplify an earlier error. A single span is not enough.
- Silent failures. Correct and incorrect agent runs can produce traces that look identical at the HTTP layer. The agent returns 200, uses valid JSON, and costs the expected number of tokens. The only signal that something is wrong lives inside the content, not the envelope.
Effective agent monitoring therefore needs three layers that classic APM lacks: structured span tracing across every reasoning step, continuous evaluation of outputs against ground truth, and runtime signals that catch behavioral drift before users do.
Step 1: Instrument full-trace spans
Start with OpenTelemetry-compatible span tracing across every agent step, not just the top-level request.
The target trace shape looks like this:
run_id: abc123
span: planner_call (model=gpt-4o, tokens=1,840, latency=1.2s)
span: tool_call::search (query="Q3 revenue", status=ok, latency=420ms)
span: tool_call::sql (query="SELECT...", status=ok, rows=12)
span: synthesis_call (model=gpt-4o, tokens=640, latency=0.9s)
Each span should carry: the model name and version, token counts (prompt + completion), latency, tool arguments and return values, and any error codes.
LangSmith generates this trace structure automatically for LangChain agents and ships an Insights Agent that clusters traces by failure pattern. Langfuse (acquired by Clickhouse in January 2026, still open-source) does the same with a self-hosted option, which matters if you process regulated data that cannot leave your VPC. AgentOps supports 400+ frameworks and adds time-travel debugging so you can replay any historical trace step by step.
The instrumentation rule is simple: if a step can fail, it needs a span.
Step 2: Define evals and wire up LLM-as-judge
Spans tell you what the agent did. Evals tell you whether it did the right thing.
For each agent use case, define at least three eval dimensions:
- Task completion - Did the agent accomplish the stated goal? (binary or 0-1 score)
- Faithfulness - Are factual claims in the output grounded in the retrieved context, or did the agent invent them?
- Tool selection accuracy - Did the agent call the correct tools in a reasonable order, or did it invoke a write tool when a read was appropriate?
For structured tasks (data extraction, code generation), deterministic evals with expected outputs are faster and cheaper. For open-ended tasks (summarization, reasoning), LLM-as-judge works well: a second model scores the output against a rubric.
Braintrust is purpose-built for this loop. It runs evals in development, CI/CD, and production simultaneously so you catch regressions before they ship. Arize AI ships built-in eval primitives with drift detection and is OpenTelemetry-native via OpenInference, making it the natural fit if you are already on an OTel stack.
A practical starting point: run LLM-as-judge on 100% of traces for your two or three highest-risk agent actions (anything that writes data or calls external APIs), and run deterministic evals on every trace for cheaper operations.
Step 3: Set up runtime monitoring and alerting
Evals that only run in CI will miss production drift. You need continuous evaluation on live traffic with alert thresholds.
The four runtime signals that matter most:
Tool-call error rate. Track failures per tool per time window. A spike in failures on a specific tool (say, your CRM write tool) usually means an upstream API change broke the schema your agent was sending. Alert at 2x baseline over a 10-minute window.
Faithfulness score drop. If your LLM-as-judge faithfulness scores drop from 0.92 to 0.71 overnight, something changed: a model version update, a retrieval pipeline degradation, or a prompt regression. Alert when the rolling 1-hour average falls below your defined threshold.
Unexpected tool-call sequences. Define the expected call graph for each agent task type. Flag any run that deviates significantly (e.g., an agent that is supposed to read-then-write instead calls write twice). This is one of the clearest signals of a misbehaving or compromised agent.
Context window saturation. As conversations grow, agents approaching their context limit start truncating retrieved documents or ignoring earlier instructions. Track prompt token counts per run and alert as sessions approach 80% of the model's context window.
Helicone captures these signals with zero SDK changes by proxying your LLM API calls through its endpoint. Note: as of March 2026 Helicone has been in maintenance mode after its founders joined Mintlify, so evaluate it for existing deployments rather than new ones.
For alerting, route signals to your existing incident tooling (PagerDuty, Slack, OpsGenie). The key is treating agent behavioral alerts with the same severity as infrastructure alerts.
Step 4: Track cost and latency per step
An agent that is accurate but costs $4 per conversation is not a sustainable production deployment.
Track cost and latency at the span level, not just the session level. Span-level data reveals the expensive steps: often a single synthesis call consumes 60-70% of session cost, while a cheaper reranker call could replace it for most queries.
Key metrics to instrument:
- Cost per successful task completion (not per session, which penalizes complex queries)
- p50/p95/p99 latency per span type (planner calls are slower than retrieval calls; treat them differently)
- Token efficiency ratio (output tokens divided by total tokens; low ratios indicate prompt bloat)
- Cache hit rate (semantic caching on repeated queries can cut costs 30-50% on high-traffic agents)
Set budget alerts at both the run level (e.g., kill any run exceeding $0.50) and the daily aggregate level. Runaway agent loops, where an agent retries indefinitely due to a malformed tool response, can burn through API quota in minutes without a hard cap.
Step 5: Watch for signals of a compromised or misbehaving agent
In 2026, "logs can't tell you when an AI agent acts alone" is not a hypothetical concern. Prompt injection attacks, where malicious content in retrieved documents hijacks agent behavior, are a documented production threat.
The runtime signals that flag a compromised or genuinely misbehaving agent:
- Instruction drift: The agent's stated reasoning in a chain-of-thought span contradicts its configured system prompt. Embed a lightweight classifier that checks whether final actions are consistent with the agent's defined role.
- Out-of-scope tool calls: An agent configured for customer support suddenly calls a billing write API it has access to but was never intended to use. Alert on any tool call that has not appeared in the agent's historical trace baseline.
- Anomalous external requests: If your agent makes HTTP tool calls, track the domains it contacts. A new domain appearing in production traces (especially one not in your allow-list) is a prompt injection red flag.
- Self-modification attempts: Any agent that attempts to modify its own system prompt, alter its tool list, or spawn more sub-agents than its configured maximum should trigger an immediate human-review alert.
These signals require that your tracing captures tool arguments at the full payload level, not just success/failure status. Partial logging makes forensic analysis impossible after an incident.
Putting it together: a minimal production stack
For a team shipping its first production agent:
- Tracing: LangSmith or Langfuse for span capture and trace storage
- Evals: Braintrust or Arize for LLM-as-judge pipelines and regression testing
- Runtime alerts: Threshold alerts on tool-call error rate and faithfulness score wired to your incident tooling
- Cost guardrails: Hard per-run budget caps at the SDK level
- Behavioral anomaly detection: Allow-list of expected tool calls per agent role; alert on deviations
The minimum viable commitment is one week of instrumentation before you ship to production. Agents that run unobserved in production are not autonomous systems. They are liability.
Where to go next
For a side-by-side breakdown of the tools mentioned above and newer entrants in the space, see the Best AI agent observability tools guide on Toolradar, which tracks the full category with pricing, feature comparisons, and community reviews.
From the team behind Toolradar
Growth partner for B2B tech
Toolradar also helps B2B tech companies grow, content marketing & distribution through 5 newsletters (550K+ tech professionals), AI Academy, and the Toolradar directory.
See how we work
Written by