Nobody Can Reproduce Your Model's Results. Compliance Just Noticed.

Cover Image for Nobody Can Reproduce Your Model's Results. Compliance Just Noticed.

An examiner asks your team to reproduce the exact score your model gave a loan applicant six months ago. Same inputs, same model version, same everything on paper. You run it. The number is different. Not wildly different — close enough that nobody would have noticed if nobody had asked. But the examiner did ask, and "close enough" is not an answer that survives a hearing.

This is not a hypothetical edge case. It's the default state of most production ML pipelines, and almost nobody has budgeted for it.

The 50% Number Nobody Wants to Own

In April 2022, Odd Erik Gundersen, Kevin Coakley, and Christine Kirkpatrick published "Sources of Irreproducibility in Machine Learning: A Review" on arXiv, building a taxonomy of exactly where ML pipelines lose determinism between one run and the next. Their motivation was a number that had been circulating for years in ML research circles without anyone systematically explaining it: across multiple surveyed subfields, roughly half of published results couldn't be reproduced by independent teams using the authors' own described methods.

The paper's contribution wasn't the number — it was refusing to let the field blame the number on sloppiness. Gundersen, Coakley, and Kirkpatrick built a framework that separates irreproducibility into method-level causes (a paper failing to specify a hyperparameter) and implementation-level causes (the actual execution nondeterminism baked into modern deep learning stacks: nondeterministic GPU kernel scheduling, floating-point summation order that changes with thread count, library version drift between CUDA and cuDNN releases, and random seeds that get set for the model but not for the data loader). Their point: even a perfectly documented paper can be irreproducible, because the hardware and software stack underneath it isn't deterministic by default.

That distinction matters because it kills the easy fix. You cannot solve implementation-level irreproducibility by writing a better methods section. It requires pinning dependency graphs, controlling for hardware-specific kernel behavior, and logging enough about the training run that a different engineer, a year later, can tell you why the number moved — even if they can't make it not move.

Research Labs Could Shrug. Production Can't.

For a decade, this was an academic hygiene problem. A paper that doesn't reproduce is embarrassing, occasionally career-damaging, but rarely expensive. That calculus inverted the moment ML models started making decisions with legal consequences attached — credit underwriting, fraud flags, clinical triage, hiring screens.

Financial services and healthcare didn't invent model-risk management because they read the Gundersen paper. They invented it because regulators like the Federal Reserve (SR 11-7 guidance, still the operative US standard for model risk management) have required for over a decade that banks be able to explain, validate, and reproduce the outputs of any model used in a material decision. SR 11-7 predates the current wave of deep learning entirely — it was written for logistic regression and scorecards, models that are deterministic by construction. Banks are now running that same reproducibility bar against gradient-boosted ensembles and fine-tuned transformers, which are not deterministic by construction, and the gap between the compliance requirement and the technical reality is where the exposure lives.

The tell for how seriously an org is treating this: ask what happens when a model's output is challenged eighteen months after the fact. If the honest answer involves "we'd have to retrain and hope it lands close," that's not a documentation gap. That's an audit finding waiting for an auditor.

What the Fix Actually Costs

The instinct is to reach for full determinism — pin every seed, force deterministic CUDA kernels, disable the nondeterministic-but-faster GPU ops. It's technically possible. It's also frequently a 10-30% throughput hit on training, because the deterministic code paths for many GPU operations are the ones the vendor didn't bother optimizing, since almost nobody asks for them.

Teams that have actually priced this out land somewhere more pragmatic: they don't chase bit-exact reproducibility, they chase bounded reproducibility. That means logging the full environment (library versions, hardware type, seed values, data snapshot hash) at training time, defining an acceptable output-variance tolerance for the model class, and treating any run that falls outside that tolerance as a flagged event requiring investigation — not a training artifact to shrug at. It's the difference between "we can't tell you why the number moved" and "here's the documented range this model is allowed to move within, and here's proof this run stayed inside it."

That framing — bounded variance with a documented tolerance, rather than false claims of exact determinism — is also what survives contact with an examiner. Nobody credible expects bit-for-bit reproducibility from a stochastic training process. They expect you to have measured your own nondeterminism and be able to say, with evidence, that it's within a range you defined in advance rather than one you're discovering under oath.

The Debt Compounds in the Dark

Reproducibility debt behaves like every other kind of technical debt with one cruel difference: it's invisible until someone with subpoena power asks you to pay it down. A team that skips environment logging today isn't choosing to fail an audit — they're choosing not to know whether they'd pass one. The gap between those two positions is enormous, and it closes at the exact moment you have the least room to fix it: mid-examination, model already in production, applicant already denied.

The org that treats reproducibility as an ML-ops nice-to-have and the org that treats it as a documented, tested, tolerance-bound engineering requirement are running the same models. Only one of them can answer the question when it gets asked. If you're building anything that touches a regulated decision, the question is not whether that gap gets found. It's whether you're the one who finds it.

If you're already thinking about what else in your AI pipeline nobody's tested for drift, see production monitoring gaps in reference-dataset staleness.