Your AI Feature Has No Tests. You Just Don't Know It Yet.

You shipped the AI feature three months ago. Users haven't complained. The team moved on to the next thing. And somewhere in the background, the model that powers it got updated, the prompt template got tweaked in a hotfix, and the retrieval system's embedding model was quietly upgraded. The outputs are different now. You have no idea.
This is how most AI features work in production. Not with dramatic failures — with slow, undetected drift that nobody's measuring.
The Gap Nobody Talks About
There's a version of the AI deployment problem that gets a lot of attention: the gap between piloting an AI system and actually shipping it. That story is real. The numbers on enterprise AI deployment are stark — most announced AI initiatives don't survive contact with production.
But there's a smaller, quieter problem that comes after deployment: the gap between "it's live" and "we know it's working." The first gap is about whether you ship. The second is about whether you know what you shipped.
Most teams close the first gap through sheer determination. They get the agent live, the chatbot answering, the code reviewer running. Then they declare victory and move on. The second gap — the one that tells you if your AI feature is getting better, worse, or silently broken — gets left open indefinitely.
That's not a moral failing. It's a hard problem nobody solved for you.
What Testing Means for LLMs
Unit tests work on deterministic systems. You write a function, you assert the output, you ship with confidence. LLMs aren't deterministic. The same prompt, run twice, produces different outputs. The "correct" answer for a generative task is often subjective. What does it even mean to test this?
The answer is: you test what you can define. Not every output, but the outputs that matter most.
OpenAI published their evals framework as open source in March 2023. The repository now has over a thousand contributed evaluation sets covering everything from code generation to factual accuracy. The fact that OpenAI open-sourced evals — rather than treating them as proprietary infrastructure — tells you something: they know the problem is widespread, and they want the ecosystem to solve it.
What a production eval suite for an AI feature actually looks like:
Golden sets — a fixed collection of inputs where you know what "good" looks like. Not perfect outputs, but tagged examples: does this response answer the question? Is this code syntactically valid? Does this summary capture the key point? You run your system against these periodically and track whether performance is stable or drifting.
Regression suites — specific inputs that previously caused failures, kept as permanent tests. Every time a user reports a bug and you fix it, the bug becomes a regression test. This sounds obvious. Almost no AI teams do it.
Behavioral smoke tests — basic sanity checks you run after any system change. "Does the response still stay under 200 words?" "Does it still refuse this category of request?" "Does this edge case still resolve correctly?" These aren't comprehensive, but they catch the most common failure modes from model updates and prompt changes.
The academic equivalent of this is HELM — the Holistic Evaluation of Language Models from Stanford — a systematic evaluation framework covering accuracy, calibration, robustness, fairness, and efficiency across dozens of scenarios. The point isn't that you should run HELM on your product. The point is that the people building these models think systematically about evaluation, and most teams shipping products built on those models don't.
The Decay Problem
Here's the specific failure mode that kills AI features slowly:
You build a customer support bot on GPT-4. It works well. Six months later, the model behind that API endpoint gets updated. The provider doesn't tell you — that's considered normal — and the outputs are subtly different. Your prompts, written for the old model's behavior, now produce slightly worse results. Your golden-set scores would show a drop. But you don't have golden-set scores. So you find out when a user tweets about it.
This isn't hypothetical. It's the documented experience of every team that's shipped AI features and then gone back to analyze what went wrong. The vector isn't a dramatic failure. It's a gradual shift in tone, accuracy, or relevance that nobody catches because nobody was watching for it.
The same thing happens with retrieval systems. If you're running a RAG pipeline and the embedding model gets updated (or you switch providers), your similarity scores change. Documents that used to rank at the top of retrieval now don't. The LLM still generates fluent text — it just generates it about the wrong context. Your unit tests pass. Your users get worse answers.
Prompt drift compounds this. Someone tweaks a prompt to fix a specific user complaint. The fix works for that case and creates a regression for three others. Without a golden set covering those cases, the regression ships. This is comprehension debt at the system level — nobody understands the full behavior of the thing they built, so every change is a guess.
Why Nobody Builds Evals
If the problem is this clear, why don't teams build evaluation infrastructure?
Three reasons, in descending order of honesty.
One: evals are expensive to build correctly. Creating a meaningful golden set requires domain expertise, human annotation, and careful design. You have to decide what "good" looks like before you can measure it. For subjective tasks — summarization, tone, helpfulness — this is genuinely hard. Teams know they could do it wrong and get false confidence, so they don't do it at all.
Two: the return is invisible. An eval suite that catches a regression before it ships — you never see what you prevented. The value is entirely counterfactual. It's very easy to deprioritize infrastructure that doesn't have a visible payoff, especially when you're understaffed and shipping under pressure.
Three: most teams aren't sure what "correct" looks like. This is the honest one. You can only test against a definition of success, and many AI features get shipped before anyone has formally defined what success looks like beyond "the demo seemed good." If you don't know what the right output is, you can't test for it.
None of these are permanent obstacles. They're just reasons why evaluation infrastructure, like documentation and test coverage, gets deferred until something breaks.
The Standard Worth Holding
The companies building the models you're using are obsessed with evals. Anthropic publishes detailed model cards and internal evaluation methodology. OpenAI's evals repo. Google's BIG-bench. The model providers treat systematic evaluation as table stakes — not because they're cautious by nature, but because they learned the hard way that you cannot know what you have without measuring it.
The teams shipping products on top of these models inherit the outputs but not the methodology. That gap is the real AI reliability problem. Not misalignment. Not hallucination as an abstract risk. The practical problem that your feature degraded six weeks ago and you have no way to know.
The minimum viable eval infrastructure for an AI feature isn't complicated: twenty hand-labeled examples covering your core use cases, a script that runs your system against them and outputs pass/fail, and a calendar reminder to run it after any system change. That's it. It won't catch everything. It will catch most things.
The question isn't whether you have time to build evals. It's whether you have time for the incident when you find out you needed them.
Photo by Daniil Komov via Pexels.