Your 200K Context Window Is Being Wasted — And Nobody's Tracking It

Cover Image for Your 200K Context Window Is Being Wasted — And Nobody's Tracking It

The codebase is large. 200,000 tokens of context — finally enough room. You paste in the full repo, the error logs, three related files, and your question. The model reads all of it. Then gives you a confident, detailed answer that solves the wrong problem.

You try again with more context. The answer gets worse.

This is context rot. Chroma's 2025 research study tested 18 frontier models — GPT-4.1, Claude Opus 4, Gemini 2.5 among them — and found performance degradation happening at every single input length increment. Not just when the window fills up. Continuously, from token one. The longer your context, the more the model works against you.

Most teams don't know this is happening. Nobody's building dashboards for it.

The Window Is Not a Warehouse

The core misconception is storage. Developers treat the context window like a folder: put everything relevant in, the model retrieves it. That's not how transformers work.

LLMs don't store and retrieve — they attend. Every token in the context attends to every other token via self-attention, and the cost of this scales quadratically. At 10,000 tokens, you have roughly 100 million attention relationships. At 200,000 tokens, you have 40 billion.

More tokens don't just add cost — they add noise. The model has to allocate attention across more relationships while maintaining coherent reasoning. And attention is not uniform across the sequence.

Chroma tested retrieval accuracy as input length increases. The results were not subtle. Degradation started immediately and continued consistently across all 18 models. The flagship models — the ones marketed on their 200K and 1M token context windows — all showed measurable performance loss before reaching 10% of their stated capacity.

The window is a budget. You're spending it whether you think about it or not.

Three Mechanisms That Break Your Model

The Chroma study identified three distinct failure modes. Understanding them separately matters because they hit at different points in the context and require different mitigations.

Lost in the Middle

The most documented mechanism. Information placed mid-context gets systematically under-attended relative to content at the beginning and end. Chroma's data — consistent with prior work by Liu et al. at Stanford (2023) — found accuracy drops exceeding 30% for information positioned mid-context compared to equivalent information at the edges.

The model doesn't ignore the middle. It reads everything. But when constructing an answer, it disproportionately draws on early and late content — a serial position effect analogous to human memory. If you're stuffing a 50,000-token codebase into context and your question concerns a module buried in the middle, you've structurally disadvantaged the model before the session even starts.

Attention Dilution

With every token added, the total attention budget spreads thinner. In a short context, any given token receives concentrated attention from the full sequence. At 100,000 tokens, the same token receives a fraction of that signal from each of the others.

This matters most on reasoning tasks. Multi-hop inference — where the model must connect fact A to fact B to fact C — becomes harder as context grows because the associative paths between distant tokens weaken. The model can still "see" everything, but the signal linking non-adjacent content decays.

Chroma found this effect most pronounced on reasoning benchmarks compared to simple retrieval. Ask a model to find a specific phrase: dilution hurts less. Ask it to reason across multiple sections of a long document: dilution accelerates failure.

Distractor Interference

The third mechanism is the most controllable, and the most frequently ignored in practice. Irrelevant content in the context window actively misleads the model — it doesn't just occupy space passively.

Chroma's tests showed that adding thematically adjacent but task-irrelevant content produced worse answers than leaving the context shorter. The model attends to the distractors. It incorporates their statistical patterns. It answers questions shaped by content you didn't mean to make relevant.

This is why "paste everything that might be useful" is the wrong instinct. Context is not a collection of useful things. It's a signal. Noise in the signal corrupts the output.

What Good Context Management Actually Looks Like

The industry talks about context windows in marketing terms: bigger is better, more is more. The Chroma data reframes this. A 200K context window is not 200K usable tokens — it's 200K available tokens with degrading utility at every step.

This has practical engineering implications that most teams aren't working against yet.

Position critical information at the edges. Given the primacy/recency advantage the Chroma data confirms, content you need the model to reliably access should appear at the beginning or end of the context — not buried mid-document. Restructure dynamically when you can. This alone closes a large fraction of the accuracy gap in mid-context retrieval failures.

Filter before you stuff. Semantic chunking, relevance scoring, and summarization passes are not premature optimizations. They're the actual work. Running a 200K context through a model without a curation pass is like handing someone a full meeting transcript when they asked for the decisions. The signal is there. But so is everything else.

Measure context length against output quality in production. Most benchmark evaluations run at short context lengths — 2K, 5K, maybe 10K tokens. Your production system runs at 50K or 200K. The numbers don't transfer. You need evals that match your production context distribution, or you're measuring a different system than the one your users are hitting.

Treat distractors as defects. Every token of irrelevant content costs twice: once in dilution, once in interference. Context hygiene — knowing what to exclude — is as important as knowing what to include. This is an engineering discipline most teams haven't built yet.

The Gap Nobody's Measuring

Teams building on LLMs almost universally have context length as a concern. Almost none have it as a tracked metric with associated quality signals. They know the window is finite. They don't know how much of their context budget they're using per request, how information is distributed across that context, or what their effective distractor ratio looks like.

The Chroma research is public. The toolkit is on GitHub. The methods reproduce. Teams that take this seriously will characterize their context patterns, build retrieval pipelines that position information deliberately, and measure what they're actually shipping.

Teams that don't will keep wondering why their expensive frontier model gives worse answers the more information they give it.

The 200K window didn't solve the problem. It just made the problem harder to see.

If you're already thinking about this at the production level, agent failure modes is the companion read — context rot is one of the infrastructure failures that shows up as agent unreliability.

Photo by Tima Miroshnichenko via Pexels.