AI Wrote the Code. Now Nobody Can Review It.

May 13, 2026

Three months after a team adopts an AI coding assistant, they're merging twice as many pull requests. Eighteen months in, their review queue is a disaster. Senior engineers are drowning in PRs they didn't ask for, junior engineers can't evaluate code they didn't write, and the thing everyone calls a "bottleneck" turns out to be something nobody budgeted to fix: human judgment.

This is the pattern METR documented in their 2025 study on AI-experienced open-source developers. AI-assisted developers completed 21% more tasks and merged 98% more pull requests. PR review time increased 91%. Both numbers are true. The productivity story and the crisis story are the same story.

The 91% Number Everyone Is Glossing Over

The industry consensus on AI coding tools has settled into a comfortable narrative: developers feel faster, and with the right habits, they actually are faster. That's true for writing code. It's not true for shipping code.

The METR study is the most rigorous look at this gap yet. It measured experienced developers — not students, not developers on their first week with Copilot — on real open-source projects. Even with experienced users, the review bottleneck grew. A 2026 analysis by LinearB found something even starker: developers using AI tools reported feeling 20% faster while being, by objective measurement, 19% slower end-to-end. The 39-point gap between perception and reality sits almost entirely in review and integration overhead.

Nobody talks about this because the 98% more PRs merged headline is the good number. The 91% increase in review time is the number that means you need to hire more senior engineers or redesign how you work.

Coding Was Never the Bottleneck

In March 2026, InfoQ covered Agoda's finding after deploying AI coding tools across their engineering org: delivery velocity barely moved. The team had expected the AI to accelerate the parts of engineering that took the most time. It did. Those parts just weren't the bottleneck.

The actual constraint in software delivery is understanding — understanding what the code does, whether it does it correctly, what it might break, and whether the approach is the right one. Those questions require human judgment. A faster typist doesn't answer them faster. Neither does an AI coding assistant.

What AI changed is the volume of output that judgment now has to contend with. Before AI tools, a developer writing a feature would produce a manageable PR: 200 lines, maybe 400, in code they wrote themselves and understood intimately. After AI tools, the same developer produces a PR that might be 600 lines of code they steered but didn't write — and the person reviewing it has to understand all of it, not just the parts the human contributed.

Review was always the hard part. AI made it the broken part.

The Verification Tax

There's a specific reason AI-assisted code demands more review, not less.

When you write code yourself, you understand its structure before anyone else reviews it. You know why you made that tradeoff on line 47. You know what you didn't test. When an AI writes the code and you steer it, that implicit understanding is thinner. You know what the code is supposed to do. You may not know what it actually does.

Reviewers face the same problem at scale. An AI-generated PR isn't wrong more often than a human-written one — but the failure modes are different. AI code tends to be locally coherent and globally off: it solves the immediate problem while introducing subtle coupling, edge-case fragility, or architectural drift that a human who understood the codebase would have caught before writing the first line.

That means reviewers need to hold more context, check more edge cases, and spend more time on code that looks right. The 91% increase in review time isn't random noise. It's the measurable cost of the verification tax.

Who Can Actually Review AI-Generated Code

Here's the organizational problem nobody planned for: AI tools raise the level of the code before they raise the level of the reviewer.

A junior developer using Copilot or Claude can produce code that looks — and often functions — like the work of someone two levels above them. That's the tool working as intended. But the junior developer can't review that code the way a senior would. They don't know the codebase well enough to spot the drift, the coupling, the edge cases the AI missed.

The result is a review pipeline that only works at the top. Senior engineers review junior code that was actually written by AI at a senior-adjacent level. Their cognitive load per PR is higher, their review count is higher, and the queue is growing faster than headcount can fix it.

Some teams have responded by making PRs smaller — constraining AI output to more bounded tasks so that reviewers face less surface area per request. This works. The Sonar 2026 development summit identified small PRs, distributed review load, and automated quality gates as the three interventions with real evidence behind them. Not more AI in the pipeline — structural process redesign.

Redesigning Review Before the Queue Wins

The teams that have avoided the 91% trap have one thing in common: they redesigned the review process before deploying AI tools, not after.

The practical interventions aren't complicated, but they require an intentional choice to treat review as a scalable resource rather than a free senior-engineer tax.

Scope discipline: AI tools encourage expansive code generation. Counter this by scoping tasks explicitly before involving the AI — not "refactor this module" but "change this function to handle null inputs, nothing else." The constraint is on the human, not the tool.

Automated quality gates: Linting, test coverage requirements, and type checking should catch the low-value review burden before it reaches a human. This isn't new advice, but AI-assisted codebases that skip it pay a steeper price because the surface area of potential issues is larger.

Context documentation: When AI generates a non-trivial implementation, the developer who steered it should document the key design decisions in the PR description — not what the code does, but why this approach, and what alternatives were considered. This transfers the implicit understanding that would have come from writing the code yourself.

Graduated review responsibility: Not all PRs need senior review. Define what does — system interfaces, security-relevant code, database migrations, public APIs — and let everything else move through a lighter process. The failure mode to avoid is defaulting to "everything goes to a senior" when volume doubles.

The deeper issue is that AI coding tools were adopted as a productivity intervention in isolation, without asking what the downstream effects would be on the people who have to verify the output. That question has an answer now. The answer is that review is the actual constraint, it doesn't scale automatically, and the teams that don't redesign for it are going to find out the hard way.

The METR number to hold onto isn't 21% more tasks completed. It's 91% more review time. That's where the leverage is.

If you've been tracking how AI tools are changing what senior engineers actually do, this post on AI code accountability covers what happens when the review process breaks down entirely. And for the broader picture on AI coding productivity — why the speed perception gap exists — this post on the productivity paradox lays out the underlying mechanism.

Photo by cottonbro studio via Pexels.