← writings

Nobody can review a hundred agents

Every tool in the coding agent space is racing toward the same number, and the number is "more." More agents, more parallel sessions, more terminals in a grid. The marketing has quietly settled on a hundred as the magic figure, a hundred agents working for you at once, a personal engineering org in a desktop app.

I've been using these tools daily, and I can tell you exactly where the dream falls apart for me. It's not at a hundred agents. It's at about five. Not because the tools can't spawn more terminals. Spawning terminals is the easy part. It falls apart because I'm the one who has to read everything they produce, and I do not scale. Nobody's reviewer does.

The bottleneck moved and the tools didn't notice

For the entire history of software, the expensive part was writing the code. Typing it, debugging it, getting it to compile, getting the tests green. Review was the cheap step at the end, a colleague skimming your diff and leaving two comments. Every process we have is built on that cost structure.

Agents flipped it. Producing a plausible thousand-line diff now costs a few minutes and a few dollars. Verifying that the diff is actually correct, safe, and something you want to own in production still costs what it always cost: a competent human's full attention, which is the one resource that didn't get cheaper. When the cost of production collapses and the cost of verification doesn't, the verification step becomes the entire business. The agent tools are optimizing the step that stopped mattering.

Code produced grows with the number of agents while human review capacity stays flat, and the gap between the lines is code merged without real review. number of agents running work per day code the agents produce code you can actually review this gap gets merged on vibes
The blue line is what the tools advertise. The dark line is you. The red zone is where incidents come from.

What five agents actually feels like

Let me describe the real experience, because the demos never show it. You've got five sessions going. Agent one finishes and wants a review. While you're reading its diff, agents two and three finish. Agent four asks a clarifying question that stalls it until you answer. Agent five went sideways twenty minutes ago and has been confidently building the wrong thing since, which you'd know if you'd been watching, but you were reading agent one's diff.

You are now context switching between five half-finished mental models of five different changes. Every switch costs you the thing review actually needs, which is a loaded picture of what the change is supposed to do. Somewhere around the third diff of the hour, something in your brain quietly downshifts. You stop reading the code and start skimming the shape of it. Green tests, plausible file names, the summary sounds right. Approve. You've stopped being a reviewer and become a person who presses the approve button at a certain rhythm.

That downshift is the most dangerous moment in this entire workflow, and no tool I've used even tries to detect or prevent it. The tools measure agents running. Nobody measures reviewer attention, which is the actual scarce resource in the whole system.

Merging on vibes is how you lose your codebase

Here's the compounding problem. Code you merged without really reading is code nobody understands. Not the agent, which has no memory of it by tomorrow. Not you, because you skimmed it. So the next task builds on foundations nobody's actually inspected, and the agents, which are magnificent yes-and machines, will happily extend whatever weirdness is already there. A few weeks of vibes-merging and you have a codebase where the answer to "why is it built this way" is that nobody decided, it accreted.

Then something breaks in production and you're debugging code with your name on the merge commit that you are reading, genuinely, for the first time. Every engineer who's used these tools hard has felt some version of that cold moment. It's the moment the productivity was revealed to be a loan, and review was the interest you didn't pay.

The honest number versus the marketing number

To their credit, some of the people building these tools admit the gap when they're being candid. The comfortable number today is a handful of agents, five-ish, seven maybe, and the hundred-agent stuff is a goal, not a description of the present. I don't think that's lying, I think it's a roadmap wearing a t-shirt. But it matters, because the two numbers imply opposite products. If the real number is five, the product should be about making five agents' output trivially safe to absorb. If the number is a hundred, the product is a grid of terminals and good luck. Everyone's building the second product while living the first number.

And the thing is, the constraint was never the software. A hundred terminals is a for loop. The constraint is that a hundred agents produce more decisions per hour than a human can make with integrity. Until something absorbs that decision load, adding agents past your review capacity doesn't add output. It adds unreviewed code, which is worse than no code, because it looks like progress.

Review is a product problem, and nobody's building it

What does the review experience look like in these tools today? A diff viewer. Split or unified, stage, commit. That's roughly the same experience I'd get reviewing a human's PR, except the volume is 10x and my familiarity with the change is zero, because I didn't watch it get written and I can't grab the author for coffee. The tools scaled production 10x and shipped the same review surface as before. That's the whole indictment, honestly.

Meanwhile the agent that made the change knows everything a reviewer needs and throws it away. It knows what it tried and abandoned. It knows which part it was unsure about. It knows which three files carry the actual decision and which seventeen are mechanical fallout. All of that dies in the session log, and I get handed twenty raw files, cold.

Review today hands you one giant raw diff. A review layer that scales hands you a summary, the risky parts, test evidence, and the raw diff last. review today one raw diff 20 files, cold, good luck effort lives with the human review that scales what changed and why, in prose the 3 files that matter, first "here's what I wasn't sure about" raw diff, last, if you want it effort lives with the agent
The agent already knows all of this. Today it throws it away and hands you the pile.

What I actually want from a review layer

Start with the summary, written by the agent, in prose: here's what I changed, here's why, here's the approach I rejected and the reason. Then the decision surface: these three files carry the real logic, read them first, the rest is renames and imports. Then the confession: here's the part I wasn't confident about, look here hardest. Then evidence instead of assertions, tests it wrote and ran, and for anything with a UI, a preview environment already spun up so I can click the thing instead of imagining it. The raw diff stays available at the bottom for when I want to go deep. Reviewing should feel like reading a well-prepared brief, not like doing an autopsy.

And not every change deserves the same ceremony. A dependency bump with green tests and no API changes doesn't need what a schema migration needs. Review effort should scale with blast radius, and the system should know the difference, with a real audit trail either way and rollback that's one action, not a scavenger hunt. Give me that, and my honest capacity goes from five agents to maybe twenty. Not because I read faster, but because each review costs a fraction of the attention it costs today.

The pretend version of this problem

I'll say the quiet part: a lot of the hundred-agent talk is aspiration marketing, and chasing it has the same smell as startups that build for a billion users before they have a hundred. The fantasy of scale gets prioritized over the experience of the actual, present-tense user, who is one person, with one brain, trying to safely absorb the output of five sessions before dinner. Make that person's life dramatically better and the bigger numbers follow, because review capacity is the thing that was capping them all along.

Where this leaves the whole category

This is the third piece I've written about these tools, and the three problems are really one problem. The tools scope agents to a repo instead of a product, they scope agent lifetimes to your laptop lid, and they scope success to how many agents are running instead of how much work a human can trust and absorb. In each case the tool optimized the thing that was easy to build, and the actual unit of value, the task, the outcome, the reviewed and trusted change, got left as an exercise for the user.

The agents are ready for more than this. The models get better every quarter, and none of what I've described is a model problem. It's a tooling problem, which is good news, because tooling problems are just decisions someone hasn't made yet. I'm spending a lot of my time thinking about what this category looks like when someone makes them. If you're feeling these same walls, I'd genuinely love to hear from you.

← more writings