RAG Looks Simple in the Demo. Production Is Where It Falls Apart.

A fintech team spent eight weeks building a document Q&A system. Internal users loved it. The demo worked perfectly. Every test question returned accurate, cited answers from the right policy documents.
Six weeks after shipping to 40,000 customers, support tickets arrived about confident wrong answers. Not hallucinations in the classic sense — the system was retrieving real documents. It just wasn't retrieving the right ones. A customer asking about international wire transfer fees got back three paragraphs about domestic wire limits. Accurate text. Wrong context. The system had no idea.
This is the RAG failure pattern nobody puts in the launch post.
What RAG Is Actually Supposed to Do
Retrieval-Augmented Generation works by inserting relevant external content into the LLM's context before asking it to respond. The idea: anchor the model's outputs to specific, verifiable sources rather than letting it generate from training data alone.
It works. In controlled conditions, RAG substantially reduces hallucination rates. The 2024 survey by Gao et al. on RAG for large language models (arXiv:2312.10997) documents retrieval-augmented architectures as the dominant approach for knowledge-intensive tasks precisely because they outperform parametric models on factual accuracy.
The problem is that "retrieval-augmented" contains a retrieval step. And that step fails in ways that look nothing like the failures teams expect.
The Retrieval Step Is the Failure Point Nobody Audits
Teams building RAG systems spend most of their evaluation budget on the generation side — measuring whether the LLM is staying faithful to the provided context, not making things up, producing sensible responses.
They spend almost nothing auditing what got retrieved.
This is backwards. The language model, given a well-formed context, is the reliable part. It will faithfully summarize whatever you put in front of it. If you give it the wrong document, it will faithfully summarize the wrong document with complete confidence. There is no "I don't think this is what you meant" signal from the model.
The retrieval layer fails in predictable ways:
Cosine similarity is not relevance. High vector similarity between a query and a chunk means the text is statistically related. It does not mean it contains the answer. A customer asking "what happens if I miss a payment?" may retrieve chunks about payment processing, payment methods, and payment confirmation — all cosine-similar, none of them answering the actual question.
Multi-hop questions break single-chunk retrieval. Many real user questions require synthesizing information from multiple documents. "Does my premium plan include the same international coverage as last year?" needs two pieces of information from potentially different documents, possibly with a temporal comparison. A top-k retrieval of the most similar chunks returns the closest single passages. It does not assemble the answer.
Chunk boundaries cut semantic units. Fixed-size chunking — splitting documents every 500 tokens regardless of meaning — routinely separates context from conclusion, conditions from outcomes, questions from answers. The chunk that answers a user's question may begin with a sentence that depends entirely on the previous chunk for its meaning.
Chunk Size Decides Everything — And Nobody Tells You the Right Size
The single most consequential implementation decision in a RAG system is how you split your documents. This gets treated as a configuration detail.
Fixed-size chunking with overlapping windows (e.g., 512 tokens, 50-token overlap) is the default in most tutorials and library examples. It works well enough on the test set. Real-world documents are not structured like tutorial datasets.
Policy documents, legal agreements, and technical manuals have hierarchical structure: sections, subsections, definitions, conditions, exceptions. Fixed-size chunking treats all of this as a flat token stream. A paragraph-level conditional ("The following applies only if the account was opened before January 2024") may end up in a different chunk from the condition it governs.
Semantic chunking — splitting at natural semantic boundaries rather than fixed token counts — produces substantially better retrieval quality. The trade-off is compute cost and implementation complexity. Small chunks improve precision but lose context. Large chunks preserve context but reduce precision.
There is no universal right answer. The right answer depends on your document structure, your query distribution, and how much retrieval precision you need before generation quality degrades. Most teams discover this by observing production failures, not by reasoning about it in advance.
Relevance Scores Are Not Accuracy Scores
The score your vector database returns with each retrieved chunk tells you how similar that chunk is to the query embedding, measured in a particular embedding space. It does not tell you:
- Whether the chunk contains the answer
- Whether the answer in the chunk is correct for this user's specific context
- Whether a more accurate answer exists elsewhere in the corpus
Teams build dashboards tracking average similarity scores on retrieval queries and interpret rising scores as system improvement. They're measuring the wrong thing.
RAGAS — the Retrieval-Augmented Generation Assessment framework published by Shahul Es et al. in 2024 — introduced metrics for evaluating both retrieval quality and generation faithfulness independently. Context precision, context recall, answer faithfulness, answer relevance. These are the numbers that matter. Most production RAG systems never compute them.
The teams that discover their retrieval quality is poor do so by reading their support tickets, not their dashboards.
The Problem Is Evaluation, Not Implementation
The gap between RAG demos and RAG production is an evaluation gap. Demos are tested with questions that were written knowing the answer exists in a small, clean test corpus. Production queries are written by users who don't know how the system works, are asking in their own words about their own situations, and have no idea their question has eight relevant documents and a retrieval pipeline that may return none of them.
This is a design problem: teams don't build evaluation infrastructure before they build the system. They build the system, ship it, and let customers do the evaluation for them.
The standard evaluation suite — a handful of golden Q&A pairs tested against the full corpus — measures retrieval performance on questions you already know the system should handle. It measures nothing about the questions you didn't anticipate.
Robust RAG evaluation requires:
- A test set of adversarial queries — questions where the answer is in the corpus but requires non-obvious retrieval
- Failure mode analysis — what types of queries consistently retrieve irrelevant chunks
- Per-document-type coverage — does retrieval quality degrade on specific document formats or lengths
- Human evaluation of a random sample of production queries, not just the golden set
This is not complicated. It's just invisible until there are customers in the loop, at which point it becomes urgent.
What Production RAG Actually Requires
RAG is an information retrieval problem that teams are building as if it were a prompt engineering problem. The skills required are closer to search engineering — query understanding, index design, retrieval ranking, relevance evaluation — than to LLM fine-tuning or prompt optimization.
Teams that ship reliable RAG systems treat the retrieval pipeline as a first-class engineering problem. They iterate on chunking strategies with offline retrieval metrics before they touch the generation layer. They run regular retrieval audits on samples of production queries. They separate retrieval evaluation from generation evaluation and staff both.
The system that worked in the demo and fell apart in production had one problem: the demo was designed to succeed. Production wasn't.
Related: Your LLM's 200K Context Window Is Mostly Theater — on why effective context length and advertised context length are different problems. And LLM Evals Have a Production Gap — on why evaluation infrastructure tends to lag the systems it's supposed to evaluate.
Cover photo by Brett Sayles via Pexels.