Machine Learning Models Don't Get a Launch. They Get a Rollout.

The model passed evaluation. Accuracy was up. NDCG looked strong. The team was confident.
Two days after the full production launch, a monitoring alert fired. The model was performing significantly worse on a specific user segment — one that happened to be underrepresented in the offline evaluation dataset. By the time the issue was identified and reversed, it had touched tens of thousands of users.
This is the standard ML deployment failure pattern. And it has nothing to do with the quality of the model.
The Offline-Online Gap That Teams Keep Discovering
In 2015, engineers at Google published "Hidden Technical Debt in Machine Learning Systems" — a paper that named problems practitioners had been running into for years without a shared vocabulary. One of the most expensive: the assumption that offline metrics predict online behavior.
They don't. Not reliably. Not in ways you can predict in advance.
The gap exists for several compounding reasons. Your evaluation dataset, however carefully constructed, is a snapshot of historical data filtered and labeled under conditions that don't fully match what real users actually do. Real users arrive with edge cases your dataset didn't weight correctly. They come with usage patterns that shift seasonally, after product changes, and in response to external context you didn't model. They interact with your model in ways your offline experiments didn't simulate.
The result: a model with excellent offline metrics can have neutral, mediocre, or actively harmful online performance. Teams that have shipped ML at scale have all experienced this. Teams new to it are about to.
The question isn't whether the offline-online gap exists. It's whether your deployment process accounts for it.
The Step Most Teams Skip: Shadow Mode
Before routing live user traffic to a new model, teams that ship ML well run the model in shadow mode. The new model receives the same inputs as the production model and runs its predictions — but those predictions are not served to users. Both the current model's outputs and the new model's outputs are logged and compared.
Shadow mode evaluation surfaces things offline metrics cannot:
- How the new model's predictions compare to the current model on actual production inputs, including the inputs your eval set didn't cover
- Whether the prediction distribution has shifted in ways your test set didn't surface
- Whether the model handles high-traffic periods, rare edge cases, and unusual inputs correctly
- What latency looks like at real production scale, not just test infrastructure
This is expensive infrastructure work. It requires running two models simultaneously, logging their outputs, building tooling to make the comparison legible, and defining what you're looking for in the comparison. Teams skip it because it feels like delay — weeks of infrastructure work when the model already passed eval.
It is delay. It's the delay that prevents the production incident.
Traffic Splitting and What to Measure
Shadow mode tells you that your model produces reasonable outputs. It doesn't tell you how users respond to those outputs — which is the actual question.
The next step is traffic splitting: routing a small percentage of real user traffic to the new model while keeping the majority on the current model. Typically starting at 1-5%, with a defined ramp schedule and specific metrics that determine whether to continue ramping or roll back.
The metrics that matter for traffic splitting are not the same as the metrics that drove offline evaluation. You're not measuring model accuracy against a held-out dataset. You're measuring user behavior: did users who received the new model's outputs do the thing you want them to do? Did they convert, engage, purchase, return? Did they contact support, abandon, express frustration?
This distinction matters because optimizing offline metrics and optimizing user outcomes are not the same problem. The recommendation model that scores better on NDCG might produce recommendations that users find less relevant in practice. The classifier with higher F1 might generate outputs that create downstream problems in your product flow. You won't know until you measure user behavior with real traffic.
Design systems that codify patterns at scale without first validating them against real users propagate the same class of problem: assumptions encoded early spread everywhere before anyone has confirmed the assumptions are correct. ML deployment without traffic splitting does the same thing — the model's assumptions about the world scale to all users before anyone has checked what those assumptions actually produce.
Covariate Shift: Why Production Is Not Staging
The offline-online gap has a specific name when it shows up after deployment: covariate shift. The input distribution in production drifts from the distribution the model was trained on, which causes the model's behavior to drift in ways that offline metrics don't catch.
Covariate shift happens constantly. User populations change. Product features change. External context changes. A recommendation model trained on behavior from six months ago is now operating on different inputs than it was designed for. A content classifier trained before a platform policy changed is encountering language patterns it hasn't seen. A pricing model trained on last year's demand signals is applying those signals to this year's market.
The teams that handle covariate shift well monitor it continuously: comparing the distribution of production inputs to training inputs, tracking model output distributions over time, alerting when distributions drift past defined thresholds. This is not a launch-and-forget problem. It's an ongoing operational responsibility that begins the moment the model goes live.
This is the framing shift that separates teams that ship ML reliably from teams that don't: production ML is not a deployment. It's an ongoing system that requires ongoing monitoring, defined retraining triggers, and practiced rollback conditions. Teams that treat model shipping like software shipping — launch, move on, revisit when something breaks — are running a system they don't understand.
The Rollback Plan You Haven't Written
Most teams have some version of success metrics and traffic ramp criteria. Few have a rollback runbook.
A rollback runbook is not documentation of how rollback works in principle. It's a specific document that answers: when exactly does rollback trigger, who is authorized to trigger it, what are the exact steps, how long does a full rollback take, and what is the user impact during the rollback window?
The reason this matters is that problems in production do not arrive during business hours with time to think. They arrive at 3am on a Saturday, discovered by an on-call engineer who did not design the model, who is looking at alerts that weren't set up to be self-explanatory, who needs to make a judgment call about whether this is a real problem and what to do about it.
The team with a written rollback runbook, practiced at least once before the production launch, handles that incident in twenty minutes. The team that figures out rollback in real time during an incident handles it in several hours with more user impact and more post-incident anxiety about what else they might not have thought of.
Building the Right Primitives
The practical shape of ML deployment tooling is different from feature flag tooling for software. A software feature flag routes users to one code path or another. An ML deployment primitive needs to:
- Route a percentage of queries to a new model while keeping the rest on the current model
- Log both models' outputs for every query in the experiment
- Compute the comparison metrics you defined (user behavior, not just model accuracy) in near-real time
- Support a ramp schedule — not just manual percentage adjustments, but automatic ramp on success or automatic rollback on failure
- Make it easy to export the experiment logs for deeper analysis after the fact
Teams that have built this infrastructure treat it as foundational MLOps tooling, on the same level as model training pipelines and serving infrastructure. Teams that haven't tend to jury-rig something each time — A/B test infrastructure that isn't quite right, feature flags that don't log what you need, manual comparisons in notebooks.
The jury-rigged approach works well enough for the first few deployments. It creates compounding overhead as the team ships more models, and it's the reason "we need to do a proper rollout" keeps getting pushed to the next sprint.
The model probably works. But "probably works" is not a deployment strategy. It's a prelude to a 3am alert.
Photo by cottonbro studio via Pexels.