AI Writes the Code. You Review It. The Debt Ships Anyway.

304,000 commits. That's the dataset. MSR 2026's Mining Challenge tracked AI-generated code changes across real production repositories — not a controlled lab — and found something the velocity metrics leave out: 24% of the technical debt those commits introduced is still unresolved. Passed review. Passed CI. Sitting in production.
The conversation about AI-generated code fixates on output. GitHub's 2026 Octoverse reports AI writes 41% of commercial code. Stack Overflow's developer survey puts AI tool adoption at 76% of active engineers. These numbers circulate as proof the technology is working. What they don't measure: by February 2026, the repositories in the Mining Software Repositories study had accumulated over 110,000 unresolved issues introduced by AI coding agents.
The problem isn't that AI makes mistakes. Every developer makes mistakes. The problem is that AI makes convincing mistakes — code that looks correct, compiles cleanly, passes linters, and fails in ways that don't show until load, edge cases, or a library update makes them visible.
The MSR 2026 Findings, Without the Hype
The Mining Software Repositories Conference's 2026 challenge paper, "Characterizing Self-Admitted Technical Debt Generated by AI Coding Agents," studied commits from teams using AI coding agents across real, production-facing repositories. Peer-reviewed research on actual codebases — not a developer sentiment survey.
The headline number — 24% of AI-introduced debt surviving to HEAD — matters less than what the study found beneath it. The debt isn't random. It clusters around three code areas: error handling, test coverage breadth, and external API interactions. These are, not coincidentally, the areas where code reviews tend to be shallowest — where a reviewer confirms a pattern is present without verifying the behavior is correct.
A separate arXiv analysis from March 2026 tracked the same repositories over time and found that AI-generated code accumulates debt faster than human-authored code, with a different failure signature that existing review practices weren't designed to catch. Stack Overflow's engineering blog put the dynamic plainly in January 2026: AI tools remove friction in exactly the places where friction happened to catch errors.
What AI-Generated Debt Actually Looks Like
This is what matters most and gets discussed least.
Human developers write bad code in recognizable patterns. Duplicated logic. Functions with six responsibilities. Magic numbers. Comments that describe what but not why. Reviewers know these patterns because they've written them. Code review evolved alongside the failure modes.
AI-generated debt looks different. It follows project conventions because it learned them from your codebase. It doesn't have structural problems that jump out. Three patterns recur in the MSR data:
Optimistic error handling. AI generates try/catch blocks that handle the right exception types but either swallow errors silently or log and continue when the correct behavior is to halt and escalate. The syntax is correct. The semantics fail under load. The reviewer sees "error handling present" and moves on — which is exactly the pattern the code was waiting for.
Surface-level test coverage. AI writes tests that pass. They test function behavior on the inputs the AI was given — the happy path and obvious edge cases. They don't test the edge cases a developer with domain knowledge would invent. Coverage metrics look healthy. Actual coverage of failure modes is not.
Stale API assumptions. AI-generated code calls external libraries with syntax learned from training data. If a library updated its contract since the training cutoff, the call compiles and works under normal conditions while silently using deprecated behavior that breaks on the next major version.
None of these fail a linter. All three are debt.
Why Code Review Misses Them
Code review is a human process calibrated for human failure modes.
When a reviewer encounters a try/catch block that catches the right exception type, their mental model registers "error handling present." That's the check. They confirm the pattern exists and move on. The AI-generated code exploited precisely the heuristic the reviewer was applying.
The MSR paper documents a particularly sharp example: AI coding agents sometimes flag their own debt in comments. "TODO: handle the case where X returns null" appears in the AI-generated block. The reviewer reads the comment, notes the flag, approves the PR — assuming someone will follow up. By February 2026, 110,000 such notes were unresolved in the study repositories. Nobody followed up.
This isn't reviewer negligence. It's calibration failure. Review discipline was built for code written by humans with domain knowledge. When AI generates code that looks domain-aware but isn't, the review heuristics pass over the gap without registering it.
A Review Discipline Calibrated for AI
The fix isn't slower AI adoption. It's adjusting what reviewers specifically check in AI-generated code.
Three concrete shifts:
Edge case interrogation. For any AI-generated function, actively probe failure modes: what if this input is null? What if the downstream service is unavailable? What if this runs under concurrency? AI-generated code is optimistic — it handles the cases it was shown during generation. Reviewers need to supply the pessimism.
Error path tracing. Follow every error path from AI-generated exception handling. Not "is there error handling?" — trace what happens when it actually throws. Does it propagate correctly? Recover to a valid state? Log enough context to debug? Silent failures are the most common pattern in the MSR data, and they're findable when you look.
API contract verification. Any external library call in AI-generated code should be checked against current documentation — not assumed correct because it compiles. AI training data has a cutoff. Library maintainers don't.
These checks take more time per PR than the standard pass. Less time than debugging a production incident six months from now, with 110,000 unresolved notes in the codebase and no clear record of when each one was introduced.
The Audit Question
If your team has used AI coding tools for six months or more, the MSR data makes a reasonable prediction: you've accumulated technical debt that passed review because the review process wasn't calibrated for how AI fails.
The useful audit isn't "find everything AI wrote and recheck it." Too slow. It's targeted: identify the files where AI contributed most code in the past six months. Focus on the three patterns above — error paths, test coverage of failure modes, API call currency. That's where the debt lives.
A few hours of focused audit against the right checklist will find more than weeks of general code review passes. The debt that looks correct is harder to find than debt that looks wrong. Knowing the failure signature tells you where to start.
The question isn't whether to use AI coding tools. It's whether your team has a specific answer to: "what does AI-generated debt look like in our codebase, and where would we find it?"
Most teams don't have that answer yet.
Photo by Godfrey Atima via Pexels