AI Features Don't Fail Tests. They Fail Differently Every Time.

Cover Image for AI Features Don't Fail Tests. They Fail Differently Every Time.

In 2024, Air Canada's chatbot told a grieving customer that they could book a full-price ticket and get a bereavement discount applied retroactively. That wasn't true. The company's policy didn't work that way. Air Canada's defense in court was that the chatbot was a separate entity and they weren't responsible for what it said.

They lost. They had to pay.

Somewhere in that company's engineering pipeline, there were tests. The tests passed. The chatbot shipped. The hallucination made it into production and into a courtroom.

This is what happens when you build AI features and test them like regular software.

Why Your Tests Pass and Your Feature Still Fails

Traditional software testing works because the code is deterministic. Given the same inputs, it produces the same outputs. You write a test against an expected output, you run it a thousand times, and it either passes or it doesn't. The moment something changes, the test catches it.

LLMs don't work this way.

A March 2026 paper introducing BrittleBench (arXiv 2603.13285) found greater than 70% performance variance across semantically equivalent prompts — prompts that mean exactly the same thing, worded differently. The model produces different quality outputs not because the prompt changed in substance, but because it changed in phrasing.

It gets stranger. Research published at NAACL 2025 documented accuracy swings of 30 to 70 percentage points between models on identical benchmark prompts, depending on output routing and model internals. A 2024 study on production LLM pipelines (arXiv 2404.09398) found flaky test rates of 11 to 27 percent — tests that pass sometimes and fail other times on the same inputs, not because the code is broken but because the model isn't repeating itself.

The test suite that gives you a green checkmark is measuring whether the model produced the right answer this time. It tells you nothing about what happens next time, or in production, or when the model provider silently updates the weights behind the same API endpoint.

What Teams Are Actually Doing (and Not Doing)

A ZenML survey of 1,200 production deployments in 2025 produced one of the more damning observations about the state of AI testing: most teams test LLM features less rigorously than login forms.

Login forms get integration tests, regression suites, edge case coverage, accessibility checks. The AI feature that generates legal summaries or medical disclaimers or customer-facing responses? Often gets a few manual spot-checks before ship and a Slack channel to report issues.

The reason isn't laziness. It's that the standard testing playbook doesn't translate. Developers know unit tests are wrong for AI, reach for the testing infrastructure they know, discover it doesn't work, and freeze. The problem is real. The alternative isn't obvious.

A separate 2025 survey by deepsense.ai found that 96% of organizations are using generative AI features in some capacity. More than 25% named testing and compliance as active deployment blockers. That gap — between shipping AI features and having any serious quality framework around them — is where the Air Canada situations come from.

What Eval-Driven Development Actually Means

The most useful framing I've seen comes from a November 2024 paper by Xia et al. (arXiv 2411.13768), which defines eval-driven development as a formal methodology where evaluation functions as a continuous governing mechanism rather than a terminal checkpoint.

The key word is continuous. In traditional development, tests run at commit time or in CI. They're a gate. In eval-driven development, evaluations run throughout the development cycle, in production, and feed evidence directly back into iteration decisions. You're not asking "did this pass?" You're asking "does this behave the way we intend, across the cases that matter, over time?"

The practical implementation has three pieces:

Behavioral contracts over output matching. Instead of asserting that the output equals a specific string, you define what the output must do. It must contain an acknowledgment of the user's question. It must not claim a policy that doesn't exist. It must stay under 200 words. These are testable against almost any output variation.

Golden datasets. A curated set of representative inputs with documented good and bad response patterns. Not expected exact outputs — documented criteria for what constitutes an acceptable response. You run new model versions, new prompts, and new configurations against the golden dataset and track how the behavioral contracts score.

LLM-as-judge scoring. Using a separate model to evaluate whether an output meets defined rubrics. A 2025 paper on rubric-based scoring (arXiv 2605.30568) describes criterion-separated approaches where each quality dimension gets its own judge prompt, calibrated against human ratings. The judge model doesn't produce the feature output — it evaluates whether the feature output met the intent.

This isn't a perfect system. It introduces its own failure modes, including the judge model's own biases and inconsistencies. But a system with known failure modes that you're actively tracking is better than a green checkmark that tells you nothing about what actually happens when a grieving customer asks about bereavement fares.

The Frameworks That Exist

You don't have to build this from scratch. The tooling has matured considerably since 2023.

RAGAS, which came out of Y Combinator's W2024 batch and has an ACL 2024 demo paper, focuses on retrieval-augmented generation evaluation with metrics for faithfulness, answer relevancy, and context precision. It's open source with 4,000+ GitHub stars and designed to run in CI pipelines.

DeepEval (Confident AI) ships more than 50 evaluation metrics and supports G-Eval, a framework where you write evaluation criteria in plain English and the system converts them into scoring rubrics. The UK AI Safety Institute's Inspect framework includes 200+ pre-built evaluations for safety and capability testing. Braintrust handles production observability with trace-based evaluation for multi-turn agents.

Anaconda's engineering team documented what happened when they moved from ad-hoc testing to systematic eval-driven development: success rates went from a baseline of 0 to 13 percent to 63 to 100 percent after implementing structured evaluation frameworks. That's not a marginal improvement. That's the difference between a feature that mostly fails and one that mostly works.

What You're Actually Testing When You Test AI

The gap between AI behavior and developer intent is harder to close than the gap between code behavior and expected output, because intent is semantic and outputs are variable. A test that checks the string is checking the wrong thing.

What you're actually trying to verify is whether the model does what you meant for it to do — in the cases that matter, often enough, with the failure modes you can tolerate. That's a fundamentally different question from "does this output equal this string."

The reason eval-driven development represents a real shift in practice is that it forces you to define what you meant before you discover what the model produced. That definition — the behavioral contract, the rubric, the golden dataset — is the actual specification. The model's output is evidence about whether the specification is being met.

A test suite that never fails for an AI feature isn't a sign that the feature works. It's usually a sign that the tests don't know what they're measuring.


Cover photo by Daniil Komov via Pexels