The Multi-Agent Dream Runs Great in Demos. Then You Ship It.

June 29, 2026

The demo ran in twelve seconds. Three agents: a planner, a researcher, a synthesizer. Each one focused, each one accountable to the next. The final answer was crisp, confident, correct. We shipped it.

Three weeks later, production had burned through four times the token budget we'd modeled. One pipeline had been returning plausible but wrong answers for five days before anyone noticed. The "trace" showing us what actually happened looked like a plate of spaghetti that had been stepped on.

The multi-agent architecture didn't fail. It just showed us that most of what we thought we'd solved, we'd moved.

The conventional take on multi-agent systems borrows from human organizational theory: specialization creates value. Give each agent a focused job, let them compose, and the whole exceeds what any single model could do. That framing isn't wrong. It's just radically incomplete.

What it ignores is that every handoff between agents is a trust boundary. Every trust boundary is a new place where errors can enter, compound, and travel downstream undetected. The more agents you add, the more non-deterministic hops exist between your input and your output. You're not building a team. You're building a chain of potentially-mistaken colleagues, each generating plausible-sounding text, none of them able to verify what the previous one actually meant.

Multi-agent is overhyped for production use right now — not because the architecture is conceptually wrong, but because the tooling to run it safely doesn't exist at the maturity level the pitch implies. The gap between what's demoed and what actually ships isn't a deployment problem. It's an observability, cost, and failure-mode problem that gets worse when you add agents, not better.

Why Multi-Agent Observability Is Harder Than Single-Model

In a single-model call, you have one request and one response. The trace is a line. If the answer is wrong, you inspect the prompt, the context, the response. You reason backward from a known failure point.

In a three-agent pipeline, you have three separate context windows, three separate inference calls, three separate points where output can drift from intent. When the final answer is wrong, "where did this go wrong?" becomes a real question with no clean answer. Was it the planner's decomposition? The researcher's retrieval? The synthesizer's interpretation of what the researcher produced? All of the above in combination?

This is not a logging problem. You can log every agent's input and output and still not know what happened, because each agent's reasoning is itself non-deterministic. The researcher's output looked reasonable. The synthesizer's interpretation was internally consistent. The error compounded across hops in a way no single log line captures.

The prompt injection surface makes this worse. When agents call external tools — web search, code execution, document retrieval — each tool call is a potential injection vector. An injected instruction embedded in document B that Agent B retrieves gets included verbatim in what Agent C sees as its context. Agent C doesn't know that instruction came from a retrieved document rather than the orchestrator. I wrote about this attack surface in detail here — in multi-agent systems, the same vector compounds across every agent that touches external tools. You don't get an exception when this happens. You get a confident, well-formatted answer that happens to be wrong or compromised, and a trace that doesn't tell you why.

How Multi-Agent Cost Spirals Actually Happen

The demo cost model is wrong in a specific way: it assumes agents take the shortest path to an answer. They don't. Agents explore. They over-generate. They hedge. A planner that could route a task in 200 tokens often routes it in 800, because hedging is free in the demo — nobody's watching the meter.

A rough real-world example: a single summarization call might cost $0.004 using a capable frontier model. A three-agent pipeline — planner, researcher, synthesizer — doing nominally the same job can cost $0.04 to $0.07 per run, depending on tool calls and context passing. That's 10–18x. Run that workflow 10,000 times a month and you've shifted from $40/month to $400–700/month. For a feature that might not measurably outperform the single-call version in user testing.

The spiral has a specific mechanism: orchestrators pass full upstream context to each downstream agent. Agent B receives everything Agent A produced, including Agent A's reasoning chain, plus the original task prompt. Agent C receives all of that plus Agent B's output. By the third hop, you're paying for a large volume of tokens that are just repetition of upstream context — and without hard token budgets enforced at the workflow level, this compounds silently until the invoice arrives.

The worse part: agents don't fail loudly when they're expensive. They succeed. The pipeline returns output. The meter just runs.

What Silent Partial Failure Looks Like in Practice

This is the failure mode nobody demos because it doesn't look like a failure.

Agent A finishes successfully. Agent B produces output that's internally consistent, plausible, and slightly wrong — maybe it misread a data point, maybe it hallucinated a specific number, maybe it solved a subtly different problem than the one specified. Agent C receives B's output as ground truth, builds on it, and delivers a polished, confident final answer constructed on a flawed foundation.

No exception is raised. No confidence score drops below threshold. Nothing in the logs says "Agent B was wrong."

A team I watched was running a multi-agent data extraction pipeline — a researcher agent pulling financial figures from documents, a synthesizer building summaries. For a subset of documents with an unusual formatting pattern, the researcher was pulling the wrong row: one line off, systematically. The synthesizer built clean summaries. The output passed human review because it looked right at a glance — the numbers were plausible, the formatting was impeccable. The error ran for a week before a spot-check against source documents caught it.

The failure wasn't detectable at the agent level. Each agent had done something defensible. The corruption was in the gap between them — in the assumption that the researcher's extraction was correct before the synthesizer acted on it. That assumption was never tested. The non-determinism problem that makes single-model testing hard compounds at every additional hop.

What Actually Works in Production Right Now

Not every multi-agent use case fails this way. The ones that work in production share a few properties:

Bounded orchestration with a defined hop limit. If your workflow requires more than three agent handoffs, either the task is too complex for reliable agentic execution at current tooling maturity, or the decomposition is wrong. Orchestrators that can recursively delegate without limit are demos. Set a maximum, enforce it in code, and design your agent graph with that constraint in mind.

Hard cost caps per workflow run — enforced, not advisory. Not "we expect this to cost $0.05." If the workflow exceeds $0.10, it errors out and notifies. Without infrastructure-enforced limits, you are flying blind. Soft guidance doesn't survive contact with production traffic.

Structured output with validation at every handoff. If Agent A produces JSON that Agent B consumes, validate that JSON before Agent B runs. Schema validation, range checks, basic consistency assertions — whatever the domain allows. The added latency is worth it. Silent corruption downstream is not.

Treat each agent's output as input from an untrusted source. Not hostile — just non-deterministic and potentially wrong. Agent B's output is a draft that needs review before it influences anything downstream. This changes your verification design: you're not debugging failures after the fact, you're building verification into the handoff protocol itself. The discipline that applies to external API responses applies equally to upstream agents.

Multi-agent isn't a dead end. It's genuinely the right shape for a certain class of problem — complex, decomposable tasks where specialization adds more than coordination costs. But we're shipping it like the coordination costs don't exist, because the demos don't surface them.

The tooling for multi-agent observability — real traces across agent boundaries, cost attribution per workflow hop, intermediate output validation — is nascent at best. Until it matures, every additional agent you add is a place where things can go wrong in ways you won't immediately see.

The agents are collaborating. The question is whether you'll know when they're collaborating to be wrong.

Photo by Tara Winstead on Pexels

The Multi-Agent Dream Runs Great in Demos. Then You Ship It.

Why Multi-Agent Observability Is Harder Than Single-Model

How Multi-Agent Cost Spirals Actually Happen

What Silent Partial Failure Looks Like in Practice

What Actually Works in Production Right Now

Confidence Is a Design Token. Your Design System Doesn't Know That Yet.

Everyone Claims to Support Psychological Safety. Almost No One Creates It.

AI Features Don't Fail Tests. They Fail Differently Every Time.