Your RAG System's Search Quality Is Degrading. You Won't Know Until It's Bad.

The demo works. Semantic search finds exactly what it should. You add a new document, the retrieval improves. You ship it.
Six months later, a user complains that the search is returning junk. You investigate and find the retrieval quality is fine for recently indexed content and quietly terrible for anything older. Nobody changed the query logic. Nobody changed the prompts. But somewhere between the demo and production, the system started lying to you.
That's embedding drift. It's one of the most underdiagnosed failure modes in production RAG systems, and the teams running into it are often the ones who built the most carefully — they just didn't build for time.
Why Semantic Space Isn't Stable
Embeddings work by projecting text into a high-dimensional space where semantically similar content lands near each other. When you ask a question, the system converts it to a vector and finds the nearest neighbors in that space. Simple enough.
The catch: that space is only coherent when all the vectors in it were generated by the same model, at the same version, under the same normalization scheme. Every embedding model creates its own universe. Claude embeddings, OpenAI embeddings, Cohere embeddings — they're not on the same map. Even different versions of the same model produce vectors that don't cleanly coexist.
In a static system — index once, search forever — this isn't a problem. But production systems aren't static. Documents get updated. New chunks get added. The model gets upgraded. When that happens, your vector collection becomes a mix of spatial regions that were never designed to coexist. Old chunks and new chunks are no longer in the same neighborhood even when they should be. The search runs. The math checks out. The results are wrong.
Research from Chroma and teams at MIT have quantified what this looks like in practice: collections above roughly 10,000 documents start showing measurable retrieval precision degradation when they've accumulated chunks from multiple embedding model runs. Above 50,000, the problem compounds.
The Staging Problem
Staging environments almost never reveal this. Your staging collection was built in a day. All the vectors came from the same model run. They're coherent. Search works beautifully.
Production collections age. They accumulate. The model gets bumped because the vendor released a better version. A batch of documents gets re-embedded after an edit. Someone migrates 20,000 records from a legacy system and re-indexes them with the current model — but the 80,000 records already in the collection were indexed with the previous one.
There's no error. No exception. The similarity scores all fall within normal-looking ranges. The system is doing exactly what you told it to do — finding the nearest neighbors in a space that's now incoherent.
This is worth distinguishing from the context rot problem we've written about, where retrieval accuracy degrades because of how LLMs process long contexts. Embedding drift is upstream of that — it's retrieval quality degrading before the content even reaches the model. The problem lives in your vector database, not in the LLM's attention mechanism.
The Observability Blind Spot
The practical problem is that teams have almost no standard monitoring for retrieval quality. Application-level error rates stay flat. Latency stays flat. The embedding model returns vectors on every request. Nothing triggers an alert.
What you'd actually need to catch embedding drift is something like ongoing retrieval precision measurement — periodically running known-good queries and checking whether the right documents are still surfacing. This requires maintaining a golden test set: a collection of queries where you know what the correct answer should be, and running it against production on a schedule.
Very few teams have this. The ones who do usually built it after they got burned.
The signal that does surface is user behavior — queries that used to return useful results start returning mediocre ones. Support tickets mention the search. Someone notices that the product's knowledge base retrieval has gotten "worse lately." By the time you have this signal, you've already got a trust problem.
What's Happening in Production
Several specific patterns tend to surface from teams dealing with this:
The version bump. The embedding model the team was using got deprecated or superseded. They upgraded to the newer version and re-embedded new content — but didn't re-embed the existing collection. Now the collection has two populations of vectors from incompatible model versions. Similarity comparisons across the boundary are meaningless.
The migration import. Legacy data gets imported and re-embedded in a batch. If the batch process used a different chunking strategy or normalization scheme than the original indexing, the vectors land in different spatial neighborhoods even when the underlying content is closely related.
The incremental drift. No single event causes it. Over months, different parts of the system have been indexed and re-indexed at slightly different times, with slightly different configurations. The collection accumulates inconsistency gradually, the way a codebase accumulates technical debt.
The technical debt analogy is accurate. Embedding drift is a form of retrieval debt — it compounds quietly, it's hard to attribute to any single decision, and the cost of fixing it grows with the size of the collection.
Fixing It Before It Breaks
The architectural answer is embarrassingly simple: version your embeddings. Every chunk in your collection should carry metadata indicating which model version — and which chunking configuration — was used to generate it. When you upgrade the model, you know exactly which chunks need to be re-embedded.
This sounds obvious. It's surprisingly rare. Most teams start with a collection that has no such metadata, because the problem wasn't visible when they built it.
The operational answer is collection discipline. Decompressed's production data suggests keeping collections under 10,000 documents per index when you want to maintain semantic coherence over time. Above that, segment your collections by temporal cohort or document type and implement scheduled full re-embeddings when the model changes.
The monitoring answer is a golden retrieval test set. Define 20–50 queries where you know what the top-3 results should be. Run these against production weekly. Precision below your baseline is an early warning that something has changed in the retrieval layer — either drift, or a chunking problem, or a model-level change — before it becomes visible to users.
None of these are hard. They're the kind of operational discipline that feels unnecessary until the moment it becomes obviously necessary.
The Deeper Question
If your retrieval layer has been running for more than six months and you've re-indexed or updated your embedding model at any point, the right question isn't whether you have drift. It's how much, and how much it's costing you in retrieval quality.
The harder question is what a passing test actually looks like for a search system that needs to keep working the way it did last quarter. Staging tests tell you the system functions. They don't tell you the system still returns what it's supposed to return, for the queries users actually run, against the collection as it exists today.
That's the test most teams never write. It's also the one that would catch this before the user complaints start.
Photo by panumas nikhomkhai on Pexels.