AI Coding Agents Don't Save Time — They Move the Bill

Cover Image for AI Coding Agents Don't Save Time — They Move the Bill

A team ships a feature on a Thursday. The AI agent wrote most of it, a human reviewed it in eleven minutes because the diff looked clean, and it merged before lunch. Sixty-one days later, a security researcher finds an authorization check that only fires on the happy path. Nobody in that Thursday standup did anything wrong. Nobody will connect the incident back to that PR, either, because by the time it blows up, the sprint it came from is three retros in the past.

That gap — between when the cost gets created and when it gets paid — is the whole story of AI-assisted coding right now, and almost nobody is measuring it.

The conventional pitch is simple: AI coding agents make developers faster, therefore teams ship more, therefore this is unambiguously good. Individual throughput numbers back this up — developers using coding agents complete tasks something like 35-45% faster than they did before, according to industry data collected across enterprise engineering orgs in Anthropic's 2026 Agentic Coding Trends Report. That's a real number and I'm not going to pretend otherwise. But it's an individual number, measured at the wrong altitude. Zoom out to the team level and the picture inverts: pull requests opened per team are up roughly 98%, review cycle time is up roughly 91%, and code churn — the rate at which recently written code gets rewritten or reverted — has climbed from around 3.1% to 5.7%. Faster individual output, slower and messier collective delivery. That's not a paradox. That's a redistribution.

Why "35% Faster" Doesn't Mean What You Think It Means

Here's where it gets uncomfortable, because the individual-speed claim itself doesn't survive contact with controlled measurement. METR ran a randomized trial in early 2025 on experienced open-source developers working real issues in their own repositories — 16 developers, 246 completed tasks, half assigned to use AI tools and half not. The result, documented directly by METR: developers using AI took 19% longer to finish, not less. Before the study, those same developers expected AI to speed them up by 24%. After finishing the study — after living the slowdown — they still believed AI had made them 20% faster.

Sit with that. The perception gap didn't close after the experience. It survived the experience.

This is the piece the "35% faster" framing skips. Vendor-reported velocity numbers usually come from self-report, from lines-of-code proxies, or from time-to-first-commit — all metrics that flatter agent output because agents are extremely good at producing volume fast. What they don't measure is whether that volume needed as much rework, review, and cleanup as it turned out to need. JetBrains' 2026 research on developer workflows found something structurally similar: AI users showed roughly 100 additional code deletions per month compared to about 7 for non-users, despite over 80% of surveyed developers self-reporting a productivity gain and half perceiving improved code quality. The work is getting redone. The developers doing it don't feel the redo as a cost, because the redo is spread across weeks and folded into "just editing," not logged as "fixing what the agent got wrong."

Speed, in other words, is being measured at the exact resolution that hides its own cost.

The Team-Level Numbers Nobody Puts In The Sprint Review

Individual velocity is the number that gets celebrated in standup. Team-level churn is the number nobody screenshots for the all-hands deck, because it doesn't flatter anyone. But it's the number that predicts what your engineering org will look like in a quarter.

Near-doubling of code churn — 3.1% to 5.7% — means more of what gets shipped gets touched again soon after, which is the textbook signature of code that wasn't right the first time. A 98% increase in PRs opened per team sounds like throughput until you pair it with a 91% increase in review cycle time: reviewers aren't reviewing more PRs faster, they're drowning in more PRs that each take longer to get through, because the volume of AI-authored diffs has outpaced the team's actual capacity to verify what's inside them. I wrote about this exact failure mode a few weeks back — the developer who reviews AI code faster is not the better developer, and this is the macro version of that same trap. Fast review of AI output isn't rigor, it's abdication wearing rigor's clothes. Scale that abdication across a whole team and you get exactly this: more PRs, slower cycles, more churn, and a review process that's technically "keeping up" while actually rubber-stamping.

None of this shows up in a sprint velocity chart. Velocity charts count story points closed, not stability six weeks out. That's the whole mechanism by which this cost stays invisible for as long as it does.

Where The 2.74x Vulnerability Multiplier Actually Bites

Now the part that should actually worry you if you run a security or platform team. Research analyzing large-scale coding-agent session data — including the kind of failure-mode analysis catalogued in the arXiv paper on developer-agent misalignment across 20,574 real-world sessions — has converged on a figure that keeps recurring across independent studies of AI-generated code: roughly 2.74 times more security vulnerabilities than comparable human-written code. Not obviously broken code. Not code that fails a linter. Code that passes review, passes tests, ships, and sits quietly wrong.

The part that makes this dangerous instead of merely annoying is timing. These vulnerabilities don't fail at merge. They surface 30 to 90 days post-deploy, on average — after the feature has shipped, after the team has moved three sprints past it, after the person who reviewed the PR has genuinely forgotten which lines they skimmed and which they actually read. Coding agents are disproportionately good at producing code that satisfies the immediate, visible spec — the tests you wrote, the happy path you described — while quietly missing the adjacent case nobody specified: the auth check that only guards one route variant, the input validation that covers the form but not the API endpoint behind it, the race condition that only shows up under load the staging environment never simulated.

A vulnerability that surfaces in code review costs a comment and a re-push. A vulnerability that surfaces 60 days post-deploy costs an incident, a postmortem, and — if you're doing this right — a very uncomfortable conversation about why your fastest quarter also produced your worst outage.

So Actually, This Isn't a Productivity Story

Here's the reframe, and it's not a twist, it's just what the numbers actually describe once you stop reading them at the individual level. This isn't AI making software development faster. It's AI moving cost from one place in the timeline to another — from the moment of writing to the moment of discovery, from the individual contributor's afternoon to the team's next quarter, from a visible line item to an invisible one.

Every organization already understands this pattern under a different name: technical debt. What's different here is the interest rate and the disguise. Traditional tech debt is usually visible to the person who takes it on — you know you're cutting a corner. AI-generated tech debt is invisible to the person taking it on, because the agent produced confident, plausible, test-passing code, and confidence reads as correctness even when it isn't. You're not knowingly borrowing. You're borrowing without knowing you signed anything, and the note comes due on someone else's sprint.

The teams that will come out ahead here aren't the ones with the highest individual velocity numbers. They're the ones treating AI-generated code as a liability class that needs its own accounting — tracked churn, tracked time-to-vulnerability-discovery, tracked review depth, not just review speed. Everyone else is going to keep celebrating the 35% and wondering, two months from now, why the roadmap keeps getting eaten by fires nobody saw coming.

What does your team's dashboard actually measure — how fast the code went out, or how long it stayed fixed?

Cover photo by Alberlan Barros via Pexels.