The 80% Problem in Agentic Coding

Managing comprehension debt when leaning on AI to code

Andrej Karpathy said something this week that made me pause:

“I rapidly went from about 80% manual+autocomplete coding and 20% agents to 80% agent coding and 20% edits+touchups. I really am mostly programming in English now.”

The inversion happened over a few weeks in late 2025. While this may apply to new (greenfield) or personal projects more than existing or legacy apps, I imagine how far AI takes you is still further than a year ago. You can thank models, specs, skills, MCPs and our workflows improving.

Boris Cherney, creator of Claude Code, has recently echoed similar sentiments:

“Pretty much 100% of our code is written by Claude Code + Opus 4.5. For me personally it has been 100% for two+ months now, I don’t even make small edits by hand. I shipped 22 PRs yesterday and 27 the day before, each one 100% written by Claude. I think most of the industry will see similar stats in the coming months - it will take more time for some vs others.”

Some time ago I wrote about “the 70% problem” - where AI coding took you to 70% completion, then leave the final 30% last mile for humans. That framing may now be evolving. The percentage may shift to 80% or higher for certain kinds of projects, but the nature of the problem changed more dramatically than the numbers suggest.

Armin Ronacher’s poll of 5,000 developers compliments this story: 44% now write less than 10% of their code manually. Another 26% are in the 10-50% range. We’ve crossed a threshold. But here’s what the triumphalist narrative misses: the problems didn’t disappear, they shifted. And some got worse.

I want to caveat: I’ve definitely felt the shift to 80%+ agent coding on new side-projects, however, this is very different in large or existing apps, especially where teams are involved. Expectations differ, but this is a taste of where we’re headed.

The mistakes changed

AI errors evolved from syntax bugs to conceptual failures - the kind a sloppy, hasty junior may make under time pressure.

Karpathy catalogs what still breaks:

“The models make wrong assumptions on your behalf and run with them without checking. They don’t manage confusion, don’t seek clarifications, don’t surface inconsistencies, don’t present tradeoffs, don’t push back when they should. They’re still a little too sycophantic.”

Assumption propagation: The model misunderstands something early and builds an entire feature on faulty premises. You don’t notice until you’re five PRs deep and the architecture is cemented. This is kind of two-steps-back pattern.

Abstraction bloat: Given free rein, agents can overcomplicate relentlessly. They’ll scaffold 1,000 lines where 100 would suffice, creating elaborate class hierarchies where a function would do. You have to actively push back: “Couldn’t you just...?” The response is always “Of course!” followed by immediate simplification. They’re optimizing for looking comprehensive, not for maintainability.

Dead code accumulation: They often don’t clean up after themselves. Old implementations linger. Comments get removed as side effects. Code they don’t fully understand gets altered anyway because it was adjacent to the task.

Sycophantic agreement: They don’t always push back. No “Are you sure?” or “Have you considered...?” Just enthusiastic execution of whatever you described, even if your description was incomplete or contradictory.

It’s possible to mitigate some of this via Skills if you know what to watch for.

These otherwise persist despite system prompts, despite CLAUDE.md instructions, despite plan mode. They’re not bugs to be fixed - they’re sometimes inherent to how these systems work.

Agents optimize for coherent output, not for questioning your premises.

I've watched this happen on my own teams - code that looks right in review but breaks three commits later when someone touches an adjacent system.

If you’re data minded, recent survey data suggests “verification bottleneck” has emerged: only 48% of developers consistently check AI-assisted code before committing it, even though 38% find that reviewing AI-generated logic actually requires more effort than reviewing human-written code. We’re generating correct code faster, but may be accumulating technical debt even faster.

Comprehension debt: a hidden cost we don’t track

Generation (writing code) and discrimination (reading code) are different cognitive capabilities. You can review code competently even after your ability to write it from scratch has atrophied. But there’s a threshold where “review” becomes “rubber stamping.”

Jeremy Twei coined the perfect term for this: comprehension debt. It’s certainly tempting to just move on when the LLM one-shotted something that seems to work. This is the insidious part. The agent doesn’t get tired. It will sprint through implementation after implementation with unwavering confidence. The code looks plausible. The tests pass (or seem to). You’re under pressure to ship. You move on.

Over time, you may understand less of your own codebase.

I caught myself doing this last week. Claude implemented a feature I’d been putting off for days. The tests passed. I skimmed it, nodded, merged. Three days later I couldn’t explain how it worked.

Yoko Li captured the addiction loop perfectly:

“The agent implements an amazing feature and got maybe 10% of the thing wrong, and you’re like ‘hey I can fix this if I just prompt it for 5 more mins.’ And that was 5 hrs ago.”

You’re always almost there. The final 10% feels tantalizingly close. Just one more prompt. Just one more iteration. The psychological hook is real.

Someone else put it differently:

“I spend most of my time babysitting agents. The AGI vibes are real, but so is the micromanagement tax. You’re not coding anymore, you’re supervising. Watching. Redirecting. It’s a different kind of exhausting.”

The dangerous part: it’s trivially easy to review code you can no longer write from scratch. If your ability to “read” doesn’t scale with the agent’s ability to “output,” you’re not engineering anymore. You’re hoping.

The productivity paradox: More code, same throughput

Individual output surged 98% in high-adoption teams, but PR review time increased anywhere as high as 91%.

The data from Faros AI and Google’s DORA report are interesting:

Teams with high AI adoption merged 98% more PRs
Those same teams saw review times balloon 91%
PR size increased 154% on average
Code review became the new bottleneck

Atlassian’s 2025 survey found the paradox in stark terms: 99% of AI-using developers reported saving 10+ hours per week, yet most reported no decrease in overall workload. The time saved writing code was consumed by organizational friction - more context switching, more coordination overhead, managing the higher volume of changes.

We got faster cars, but the roads got more congested.

We're producing more code but spending more time reviewing it. The bottleneck just moved. When you make a resource cheaper (in this case, code generation), consumption increases faster than efficiency improves, and total resource use goes up.

We’re not writing less code. We’re writing vastly more code, and someone still has to understand much of it. There are of course groups of developers who feel this should no longer be the case if AI can do that.

Where the 80/20 split actually works

The 80% threshold is most accessible in greenfield contexts where you control the entire stack and comprehension debt stays manageable through small team size.

This actually works in a few contexts.

Personal projects where you control everything
MVPs where “good enough” is actually good enough
Startups in greenfield territory without legacy constraints
Teams small enough that comprehension debt stays manageable

In these environments, the agent’s weaknesses matter less. You can scaffold rapidly, refactor aggressively, throw away code without political friction. The pace of iteration outweighs occasional misdirection.

In mature codebases with complex invariants, the calculus inverts. The agent doesn’t know what it doesn’t know. It can’t intuit the unwritten rules. Its confidence scales inversely with context understanding.

Someone pointed out the obvious thing I was tiptoeing around: the first 90% might be easy, but the last 10% can take a long time. 90% accuracy is fine for non-mission-critical stuff. For the parts that actually matter, it's nowhere close. Self-driving cars work great until they don't, and that's why L2 is everywhere but L4 is still mostly vaporware.

For non-engineers, the wall is lower but still real. Tools like AI Studio, v0 and Bolt can turn sketches into working prototypes instantly. But hardening that prototype for production - handling real user data at scale, ensuring security and compliance - still requires engineering fundamentals. AI gets you 80% to an MVP; the last 20% requires patience, learning deeply or hiring engineers.