Your Agent Isn't Crashing. That's the Problem.

Cover Image for Your Agent Isn't Crashing. That's the Problem.

Six weeks. That's how long a support team ran an AI classification agent before someone thought to audit its outputs — not because anything looked wrong, but because a manager got suspicious that the urgent ticket queue felt lighter than it should. What they found: 23% of high-priority tickets had been quietly routed to a low-priority queue since day one. No exception was ever raised. No alert fired. Error rate: 0.0%. The dashboard was green the entire time.

The agent wasn't crashing. The agent was wrong. Those are different problems, and almost every monitoring system in production today is built to catch only one of them.

Why Agents Fail Differently

Traditional code fails loudly. A function called with the wrong type throws a TypeError. A database connection that drops returns an error code. An HTTP 500 propagates up the stack, hits your error rate alert, and pages someone at 2 AM. The failure is detectable because it manifests as an event — an exception, a status code, a timeout — that existing tooling was designed to surface.

The working assumption underneath all of this is that if no exception occurs, something like correctness has been achieved. That assumption holds for deterministic code. For a given input, a function either produces the right output or throws. Test coverage can formalize this: write enough cases and you build confidence that the absence of errors implies the presence of correctness.

Agents break this contract completely.

An AI classification agent takes a ticket description as input and produces a category label as output. If the label is wrong, the call still returns 200. The latency is normal. The response body is valid JSON. From the perspective of every monitoring primitive that exists — latency, throughput, error rate — the call succeeded. The failure is in the semantics of what was returned, and semantics is exactly what APM tools have never had an opinion about.

The failure also doesn't reproduce cleanly. A bug in deterministic code throws the same exception on the same input every time — that's what makes test-driven development tractable. An agent classifying tickets might get 95% right and systematically mishandle a specific subset: tickets written in a particular register, tickets describing a specific product category, tickets submitted late at night when training data was thinner. That pattern looks like statistical noise until someone sits down and aggregates the outputs. By then, six weeks have passed.

The Monitoring Gap

Every major APM tool — Datadog, New Relic, Sentry — was built around three primitives: latency, throughput, error rate. These map cleanly onto deterministic systems. An HTTP request either completes within acceptable bounds or it doesn't. An error either propagates to your collector or it doesn't. The tooling is excellent at what it was designed for.

LLM calls introduce a fourth primitive that none of these tools track: semantic correctness. Was the output actually right? Not syntactically valid — semantically correct. Did the classification match the intent? Did the summary preserve the key detail? Did the routing decision reflect the actual priority? Current tooling has no opinion on any of this. It sees an HTTP 200 and records a success.

The observability ecosystem has started to notice the gap. Honeycomb's LLM observability spec, updated in early 2025, added dedicated span types for model invocations — capturing what model was called, with what parameters, and what it returned. That's a meaningful step. It answers "what did the agent say" in structured, traceable form. It still doesn't answer "was what it said correct." The capture is there. The evaluation layer is missing.

Braintrust and Langfuse have moved furthest toward filling that gap. Both support storing evaluation scores alongside traces — so a span doesn't just record the model's output, it records a verdict about whether that output was right. Langfuse, open-source and self-hostable, lets teams define their own evaluation functions and attach the results to production traces in real time. Braintrust builds a continuous eval loop directly into the observability model. Neither is a full solution, but both are operating in the right conceptual territory: monitoring that has an opinion about meaning, not just execution.

Most teams deploying agents today have neither. They have request logs and a dashboard that's never been wrong, because it was never measuring the thing that matters.

What Semantic Instrumentation Actually Looks Like

The pattern that works is a secondary evaluation pass running immediately after each inference. The primary model produces its output. A fast, cheaper secondary model — or a deterministic rule set, depending on the task — evaluates that output and returns a verdict. Did the output match an expected category? Did it preserve required information? Was the reasoning coherent? That verdict gets logged alongside the trace.

This turns a trace from "the agent returned X" into "the agent returned X, and X was assessed as correct/incorrect." That signal can be aggregated, alerted on, and trended over time. If correctness scores drop from 97% to 82% over a weekend, you know before a manager audits the queue.

The economics require judgment. Running a full evaluation on every inference adds latency and cost that scales with volume. Some teams run semantic evaluation on a 5–10% sample in production — enough to detect statistical drift, not enough to catch every individual failure. Others use deterministic rules for structured tasks: if the output is supposed to be one of five category labels, check whether it is one of those five labels and flag anything that isn't before it affects downstream systems. That's not semantic evaluation in the rich sense, but it catches the structural precondition for correctness.

For higher-stakes applications — medical triage, financial classification, legal document routing — some teams run synchronous evaluation on every call and treat an ambiguous verdict as a fallback condition: route to a human queue rather than auto-classify. The latency cost is real. So is the alternative.

The tooling infrastructure for this is still young. There's no standard evaluation SDK that plugs into OpenTelemetry the way distributed tracing does for HTTP calls. Teams building semantic instrumentation today are largely building it themselves: writing evaluation functions, attaching scores to spans manually, and aggregating results in custom dashboards. LLM streaming architectures compound the challenge — when responses arrive as token streams rather than discrete responses, capturing the full output for evaluation requires additional buffering that many teams haven't implemented.

The lack of a standard playbook is a real friction point. But the teams doing it are finding something consistent: their agents were wrong more often than their dashboards suggested. The evaluation pass didn't reveal a catastrophe — it revealed a gap between "appears to be working" and "is actually working" that had been invisible because no one had built the instrument to see it.

The Conceptual Problem Underneath the Tooling Gap

The monitoring gap isn't fundamentally a tooling failure. The tools could be written. The gap persists because most engineering teams are operating under a mental model that was correct for the systems they built before.

"No alert means working." That model was built for code that fails loudly. Exceptions propagate. Status codes carry meaning. Latency spikes are detectable. The heuristic was reliable enough that it became reflex: if the dashboard is green, the system is healthy.

Agents fail quietly. They produce outputs that are structurally valid, statistically reasonable, and semantically wrong — and they do it silently, consistently, invisibly, in ways that accumulate across thousands of transactions before the aggregate pattern becomes undeniable. The 23% misrouting rate in the opening scenario didn't happen because something broke. It happened because something was wrong from the start, and "wrong" isn't an event that existing monitoring detects.

Shifting the mental model requires adding a new instrument: one that evaluates meaning, not just execution. This is genuinely new territory. There's no fifteen-year-old standard practice to inherit, no playbook that's been validated across enough production deployments to call mature. Teams deploying agents while skipping the evaluation layer are discovering this the same way the support team discovered their misrouted tickets — by investigating a hunch, auditing outputs manually, and finding a failure that the dashboard had been hiding for weeks.

The discipline of semantic instrumentation exists. It's just not common yet.

Your monitoring is only as good as your definition of "correct." For agents, you probably haven't written that definition yet.