Your LLM Is Failing in Production. You Have No Idea Where.

May 6, 2026

A user files a support ticket. "The AI gave me completely wrong information about my refund status." Your engineer pulls the logs. Request in at 14:03:22. Response out at 14:03:26. HTTP 200. Latency: 3.8 seconds.

What went wrong? You don't know. The retrieval step returned documents — which ones, in what order, with what relevance scores? Gone. The prompt was constructed from a template — what did it look like after assembly, before the model saw it? Gone. The model responded — which context did it actually attend to, and where did its reasoning go sideways? Gone. You have a response and a status code. The four seconds in between are a black box.

That's the observability situation for most LLM deployments in 2026.

The Gap Nobody Fixed Before Launch

The industry built increasingly complex AI applications — RAG pipelines, multi-agent orchestration, tool-calling chains — and instrumented almost none of them. When something breaks, debugging means manually re-running the sequence, adding log statements, and making educated guesses about where the failure occurred.

This isn't a technical limitation. The tooling exists. It's a habit gap: because LLMs entered engineering pipelines as API calls, most teams never treated them as infrastructure components requiring traces.

That mindset works during prototyping. It stops working once your product is live, your failure modes are user-facing, and "unexpected LLM response" appears in your incident reports multiple times a week without a clear root cause.

The thesis is simple: LLM observability is the same discipline as database query tracing and network call monitoring — applied to a nondeterministic system where the need is arguably greater, because the failure modes are harder to reproduce.

What Tracing Actually Means for an LLM Pipeline

In conventional backend systems, distributed tracing captures every operation in a request's lifecycle — database calls, service-to-service requests, cache lookups, external API calls — as timestamped spans with attributes. When a request fails, you navigate the trace tree to find which span errored, how long it took, and what context it carried.

LLM pipelines have equivalent components that can and should be traced:

Retrieval — which chunks were fetched, from which index, with what relevance scores
Prompt rendering — the final text assembled from templates and retrieved context, exactly as it was sent to the model
Model inference — token counts (input and output), model version, temperature, stop reason
Output parsing — what the parser received, what it returned, whether it fell back to a default
Downstream actions — tool calls the model triggered, database writes, external API calls made as a consequence of the response

Without traces, you can't distinguish between "the model reasoned incorrectly" and "the retrieval gave it garbage to reason with." Those are different problems with different fixes. Treating them as the same thing is how you spend a week debugging the wrong layer.

In late 2024, the OpenTelemetry project published GenAI semantic conventions — standardized attribute names for tracing LLM calls — landing in OTel 1.29.0. These formalized what a trace for a model invocation should look like: input/output token counts, model identifier, temperature, stop reasons, error codes. Before this, every team invented their own schema and couldn't compare notes.

The Failure Modes That Stay Invisible

Three categories of LLM failure are nearly impossible to diagnose without tracing:

Silent degradation. The model produces responses that are subtly wrong — factually off by a margin, tonally incorrect, missing a required step in a workflow — but the call never errors. The HTTP response is 200. The user gets output. The issue surfaces in user feedback weeks later, and by then you can't tie it back to a specific input, pipeline state, or document corpus condition.

Retrieval drift. In RAG systems, the documents returned by the vector store change over time as embeddings drift or the corpus shifts. If you're not capturing which chunks were retrieved for each request, you can't detect when retrieval quality degrades. The model may be responding accurately to what it received — the problem is what it received. These two things look identical from the outside.

Prompt injection via context. A user-supplied input that lands inside a retrieved document can alter the prompt structure in ways that affect model behavior. Without capturing the rendered prompt for each request, you can't audit whether injection happened in a specific incident — or assess how frequently it's happening in general.

All three surface as "the AI gave wrong information" in support tickets. Only tracing tells you which one you're dealing with.

The Tools That Exist Right Now

Three tools have emerged as the practical options for production LLM observability:

LangSmith provides trace capture for LangChain-based pipelines, with a UI built for prompt inspection, token accounting, and regression testing. If you're on LangChain, onboarding is a few environment variables and an API key. The evaluation layer lets you build test sets from production traces — a useful loop for catching regressions after model or prompt changes.

Langfuse is the open-source alternative — self-hostable, provider-agnostic, increasingly the choice for teams that can't route data through a third-party vendor. It captures traces via SDK, supports multimodal logging, and has a built-in evaluation layer for systematic testing. If data residency or vendor lock-in is a constraint, this is the path.

OpenLLMetry extends OpenTelemetry to LLM calls. If your team already uses Honeycomb, Datadog, or Grafana for service observability, this is the lowest-friction option — your LLM traces appear in the same system as your service traces. One dashboard. One alert configuration. One place to look during incidents.

The common thread: all three capture what happened inside the LLM pipeline, not just the HTTP wrapper around it.

The Unobservable Component Problem

The irony is that the rest of the stack is already instrumented. Most teams deploying LLMs in 2026 have distributed tracing wired to their services, databases, and message queues. They can trace a failed payment across six microservices. They know the p99 latency for their Postgres queries. They get paged when cache hit rates drop below threshold.

But the LLM call — the component most likely to produce unexpected behavior, the one that's nondeterministic by design — has no spans. Because it was introduced as "an API call," it got treated like any other HTTP request to a third-party service: log the status code, move on.

This connects to a broader pattern. Your AI feature has no tests. You just don't know it yet. is one version of the gap — no evaluation framework, no systematic way to detect regressions. Observability is the runtime face of the same gap. Evals tell you what to expect in controlled conditions. Traces tell you what happened when a real user hit a path you didn't anticipate. You need both.

Most enterprise AI deployments look more production-ready than they are. Observability is what makes the gap concrete. A production LLM without tracing is a system where you can claim it's working, because you have no instrument that would tell you when it isn't.

The Question That Should Be Keeping You Up

When your LLM responds incorrectly to a user, how long does it take you to identify which component caused it?

If the honest answer is "we manually re-run it and try to reproduce the issue," you don't have observability. You have logs and hope.

The discipline has been solved for conventional systems for fifteen years. Applying it to LLMs isn't reinventing anything — it's extending a mature practice to a new component type. The technical cost is low. The organizational cost is recognizing that "it's just an API call" was never a sufficient reason to skip instrumentation.

You wouldn't deploy a payment service with no error logging. An LLM in production deserves at least as much scrutiny as a database query — and considerably more, given that its failure modes are quieter and its outputs are harder to validate at runtime.

Photo by panumas nikhomkhai via Pexels.