When Your AI Crosses Modalities, Hallucinations Become Structured Fiction

May 20, 2026

Photo by Google DeepMind — abstract illustration of neural networks and data flow

A researcher pastes a video transcript, two screenshots, and three PDF excerpts into a multimodal prompt and asks for a literature synthesis. The model responds with six citations. Four of them don't exist. Two of the fake ones have real-looking DOIs, real-sounding journal names, and conclusions that fit the argument perfectly.

Nobody catches it on the first read.

This is the hallucination problem that single-modality benchmarks were never designed to find.

What Single-Modality Hallucination Testing Misses

When an LLM generates text from text, hallucinations usually have a certain shape: a name slightly off, a statistic slightly wrong, a quote paraphrased into something the source never quite said. Experienced reviewers learn to flag those. Tools like citation checkers and fact-verification pipelines are built around that failure pattern.

Cross-modal reasoning breaks that model.

When a system processes video, images, and text simultaneously — weaving them into a single output — the hallucination isn't a deviation from a single source. It's a synthesis across multiple inputs, none of which fully specified the answer. The model fills gaps between modalities with invented connective tissue. And because each input is genuinely there, the result carries the confidence of something grounded.

The Mu-SHROOM shared task at SemEval 2025 identified cross-lingual and multimodal reasoning as persistent hallucination hotspots, significantly harder to detect than single-modality errors. The reason is structural: the model isn't confabulating from thin air. It's confabulating from real fragments, and the fabricated parts are architecturally designed to bridge them.

When Hallucinations Get Architecture

There's a specific failure mode that emerges under cross-modal reasoning that doesn't have a common name yet. Call it structured fiction.

Structured fiction is a hallucination that passes surface-level credibility checks because it's coherent — internally consistent, correctly formatted, and plausible given the inputs provided. It doesn't read like a hallucination. It reads like the synthesis you asked for.

The CCHall dataset (ACL 2025) specifically examined hallucinations in image-captioning and visual question-answering systems, and found that the most dangerous outputs weren't obviously wrong — they were appropriately confident about things the model couldn't actually verify from the visual input. The fabricated detail wasn't random noise. It was the expected kind of detail for that context.

In a document + image + audio workflow, this compounds. The model knows what a real citation looks like. It knows what a plausible study conclusion looks like for the domain. It knows the format of a DOI. When it can't find the actual source, it builds one from that knowledge — and the result is structured enough that it survives the quick pass a time-pressed reviewer will give it.

This is categorically different from a GPT-4 making up a statistic in a text prompt. That kind of hallucination is increasingly caught by the downstream audience. The cross-modal structured fiction kind is designed, unintentionally, to survive.

Why Leaderboards Don't Surface This

AI benchmarks measure what you point them at. As discussed in the benchmark theater problem, leaderboards tend to reward confident accuracy on well-defined tasks rather than calibrated uncertainty on messy ones.

OpenAI's September 2025 paper on evaluation methodology noted that current LLM leaderboards systematically reward confident guessing over calibrated uncertainty — models that say "I don't know" score worse than models that guess plausibly. Applied to multimodal evaluation, this creates an explicit selection pressure toward architecturally confident fabrication.

The SemEval findings reinforce this: multimodal tasks remain hallucination hotspots specifically because they involve cross-referencing real but incomplete inputs. The model doesn't have a "this crosses modalities and might be unreliable" mode. It has a "here's the synthesis" mode, and it applies it uniformly.

If your team evaluates multimodal outputs with the same spot-check workflow you use for text outputs, you're running the wrong test.

What Single-Source Review Doesn't Catch

The standard advice for hallucination management is: check the sources. For single-modality text, this roughly works. When a model cites a paper, you can search for the paper. When it quotes a statistic, you can trace the statistic.

With structured fiction from cross-modal reasoning, there are three failure modes that this advice misses:

The synthesis gap. The model's conclusion isn't drawn from any single source — it's inferred from the relationship between sources. That inference might be wrong in ways no individual source check will reveal.

The plausible fake. The fabricated citation isn't random. It has the structure of a real citation for the field. A reviewer checking whether a paper "sounds real" will sometimes miss it. Only hitting the actual DOI resolves it, and that takes more time than most review workflows budget.

The distributed error. The hallucination isn't localized in one sentence. It's woven through the synthesis — a false connection here, an overstated conclusion there. Finding it requires re-running the task, not just verifying a footnote.

The chain-of-thought problem is adjacent here: the model's reasoning trace looks coherent, but coherence in the trace doesn't mean accuracy in the conclusion. Structured fiction is what happens when chain-of-thought confidence extends beyond what the underlying data supports.

What Cross-Modal Auditing Actually Requires

Teams deploying multimodal pipelines in production need to treat cross-modal outputs differently from single-modality ones.

Decompose the synthesis. For any multimodal output that makes factual claims, trace which claims come from which input. Claims that can only be explained as inferences between inputs need separate verification.

Treat citation format as a red flag, not a green flag. A correctly-formatted citation is evidence that the model knows what citations look like. It's not evidence the citation exists. Structured fiction is often well-formatted precisely because formatting is learnable.

Use inter-modal consistency checks. If the model synthesizes a video transcript and an image, ask whether the text claim is verifiable from the image alone. If it isn't, the claim crosses a modality boundary and should be treated as higher-risk.

Build uncertainty into the prompt layer. Explicitly instruct the model to distinguish between claims directly supported by the provided inputs and claims inferred from them. Some models will do this; many won't without explicit instruction. But the instruction creates a structured output you can audit, rather than a seamless narrative you can't.

None of this is a solved problem. The multimodal hallucination space as of mid-2026 has good benchmark tooling (Mu-SHROOM, CCHall) and almost no production-grade mitigation tooling. The gap between knowing the problem exists and having reliable ways to catch it at scale remains large.

The Underlying Constraint

Cross-modal reasoning is not going to get worse. Models will continue to integrate more modalities, handle longer contexts across them, and produce outputs that blend inputs more thoroughly. The hallucination profile will evolve with the capability.

The teams that will handle this well are the ones that accept now that cross-modal output quality cannot be evaluated with the same workflows built for text. The adversarial case isn't a hallucinating LLM making obvious errors. It's a confident, well-structured synthesis that happens to include things that aren't true.

That's a harder problem. It deserves a harder solution.