15 May 2026 · Nick Finch
The line moved. The discipline did not.
Simon Willison admitted he has stopped reviewing every line of agent-written code, even on production work, and Anthropic shipped its answer the same day. Here is where the discipline actually lives now, and what is still missing.
On 6 May, Simon Willison, co-creator of the Django web framework and one of the most-read independent voices on LLMs, published “Vibe coding and agentic engineering are getting closer than I’d like.” For seven months he has been the loudest practitioner voice insisting that responsible, disciplined AI-assisted development, what is now called agentic engineering, is fundamentally different from vibe coding. He has been the keeper of that line. The 6 May post is him admitting the line is moving in his own work. As the coding agents get more reliable, he wrote, he is “not reviewing every line of code that they write anymore, even for my production level stuff.” He called it the normalisation of deviance, the idea that every time a model quietly writes the right code without close monitoring, the risk grows that he will trust it at the wrong moment and get burned. He ended in discomfort, not in prescription.
The same day, in San Francisco, Anthropic ran Code w/ Claude 2026 and shipped Claude Code Code Review. Boris Cherny said it is now used by every team at Anthropic because reviews were the bottleneck. Code output per Anthropic engineer is up 200% this year. The first reviewer of a Claude Code pull request is now Claude Code.
Two signals, one problem, one day. The keeper of the distinction and the platform that hosts the agent admitted the same thing within hours of each other. Line-by-line human review is no longer where the work happens. The remaining question is what replaces it. Willison did not answer it. We have been grappling with the same discomfort for six months, and iterating on what to do about it.
The afternoon we stopped reading every line
For us the moment was Studio, November last year. Opus 4.5 had just shipped and the models were suddenly capable enough to build large pieces of code if they were well specified. We were generating code at a rate where reading every line was no longer practical. We got nervous about pushing it live.
The instinct in that moment is to retreat to handcraft. Slow down, read everything, treat the agent as a junior whose every commit you inspect. We did not do that. We asked a different question. How do we feel confident about what we are shipping without reading every line of it?
We built the automated pen test suite in an afternoon. OWASP ZAP and Nuclei, orchestrated through a Python script that handled the Cognito auth, the spidering, the active scans, the Nuclei templates, and produced one merged HTML report with remediation notes. A couple of weeks later Claude Code Security shipped and slotted in alongside it. We had crossed the line Willison just described. We had also worked out, by accident, where the new line should be.
Where the discipline went
The discipline did not disappear when we stopped reading every line. It moved. Some of it moved upstream, into the artefacts the agent reads on the way in. Some of it moved downstream, into the gates that verify what was built before it ships.
Upstream first. Every module in our codebase carries a structured header. Purpose, Responsibilities, Consumers, Dependencies, Notes. Every exported function carries a docstring with a summary, the parameters, the returns, the side effects, and the list of every caller across the project. The Consumers field is not a guess, it is verified against the imports with grep. We codified the standard into a /document skill the agent applies automatically to new code, and we told the agent to write the standard into its own instructions file so it enforces the rule the next time it opens the repo. None of this is documentation for human convenience. It is embedded context for the agent that makes the next change.
We learned why that context matters the hard way. An early Studio build produced a dashboard designer that worked perfectly for the one sample dashboard in the specification and could not build anything else. Massive overfitting. The runtime barely called the language model at all. The model had drifted to a degenerate solution because it had no structural context about what the system was for. Two things fixed it. The models got better, and the documentation standard went in. The drift stopped. The header is not bureaucracy, it is the thing that stops the agent collapsing to whatever satisfies the nearest example.
Before any of that, the specification. We hand the model the spec against a clean context and ask it to tear it apart. Find the security holes, find the things we have not thought through, find the gaps. We do this four, five, six times, until the reservations the model raises feel like housekeeping rather than architecture. The discipline that used to live in line-by-line review now lives in the tear-down pass before a line of code exists.
Here is the part that turns this from theory into evidence. In the OpenEdge expert system we use hybrid retrieval with reciprocal rank fusion. RRF produces small fused scores, so the relevance threshold has to be small, around 0.015, or it filters out everything. The specification said exactly that. The code did not. An agent refining the retrieval logic changed the threshold to a value typical of vector similarity, somewhere up around 0.65, because that is what looks right if you are reasoning about cosine distance and have not read the spec. It silently filtered every result. We caught it in dev environment testing.
The fix was not better review. We were not reading every line, and even if we had been, the new value looks perfectly reasonable in isolation. The fix was to move the constraint into the code, next to the value, with the explanation of the RRF mathematics sitting right there in the comment the agent reads before it touches the line. It has not recurred. The structural artefact now does the job line-by-line review used to do, and it does it for the agent, which is the thing that actually changes the file next.
Downstream, the gates. The pen test suite and the Claude Code Security scan both run before deployment. Both are still manual. We have not automated them, not because there is a reason to leave them manual, but because there has not been time.
The 6 May announcements made automating them cheaper than continuing to defer. Anthropic also shipped Claude Code Routines, scheduled or triggered agent sessions that wake, do the work, and land a pull request ready to merge. The gates we had been running by hand are exactly the kind of work a routine runs. That work is happening this quarter.
The bottleneck nobody named
The industry has a tidy phrase for this. Human oversight shifts from reviewing everything to reviewing what matters. Cherny’s framing was that review is the bottleneck because output is up 200%. The platform answer is to put an agent team in front of every pull request. That is the right answer to the review bottleneck. It is not the answer to the bottleneck most disciplined teams are actually hitting.
That bottleneck is cognitive load. When you produce artefacts at agent velocity, your mental model of what has been built cannot stay current. The documentation is correct. The system is working. The human cannot keep up with the rate at which working systems are being added. Before coding agents, I would spend months on a single detail and remember it for years. Now I spend a day, maybe a few hours, and two weeks later my memory of it is hazy. The code is correct. My recollection of why it is correct, and what assumptions it rests on, is not.
We have lived this one too. We designed and shipped several strategies for the platform to improve itself. Gap detection that generates interview requests when the knowledge base is thin. Expert feedback recorded as semantic patches against the vector database chunks. A living knowledge base that gets better the more it is used. We documented all of it in detail. The discipline was there.
The problem is not the documentation. The problem is that without surfaced metrics and user-facing elements showing the effects, a mechanism like that can quietly degrade and nobody notices. Not because anyone was careless. Because nobody is holding the whole system in their head closely enough to notice a slow drift in retrieval quality across a knowledge base that is patching itself. The documentation being correct does not help. We are producing it faster than anyone can read it, and even if we could read it, reading is not the same as watching the behaviour. A working system you have stopped watching is a future incident you have not met yet.
This is the half of the answer the practitioner discourse has not named. Willison named the discomfort. Cherny named the review bottleneck and shipped a fix for it. Nobody named the cognitive load bottleneck, and it is the one that bites disciplined teams hardest, precisely because they are the ones shipping working systems fast enough to lose track of.
Routines are not just a way to automate the gates we already run. They are the runtime for the measurement layer the human can no longer maintain at velocity. A weekly knowledge base health routine that connects to the repo and the database, pulls the metrics, flags every chunk patched more than twice, compares retrieval quality to last week, and lands the report as a pull request comment. An undocumented mechanism routine that runs on push to main and scans for new pipeline stages or modules with no corresponding documentation. A UI surface coverage routine that holds a registry of measured mechanisms and flags any that have measurement but no user-facing surface, because a mechanism nobody can see is a mechanism nobody will notice failing.
The discipline has not just moved. It has acquired a new layer. The measurement layer the humans cannot maintain at the pace of the build, the agents can.
Two answers, two bottlenecks
Anthropic’s same-day answer was Code Review, an agent team in front of every pull request. That is the right answer to the review bottleneck. Our answer is to move the discipline upstream into the artefacts and downstream into the routines that watch the system. That is the right answer to the cognitive load bottleneck. The two answers are not competitors. They address different points where the work used to happen and no longer can. A team can do both, and most should.
I want to be precise about what is still aspirational on our side. The gates are still manual. The routines are designed, not shipped. The cognitive load problem is named here more clearly than it is solved in our own stack. The argument is not that the inmydata pattern is finished. The argument is narrower and harder than that. The line has moved, the discipline has moved with it, and the practitioner who is still trying to hold the line at the level of “did I read every line” is defending a position that no longer exists. Better to admit where the discipline actually lives now and build for it.
Willison ended in discomfort. The platform answered the bottleneck it could see. The practitioner answer is to specify every interface, document every consumer, run the tear-down passes until the objections are housekeeping, gate the deployment, and then run the routines that watch the system you can no longer hold in your head. The line moved. The discipline did not. It is in a different place, doing a different job, for a different consumer. The teams that work this out will keep shipping. The teams that do not will meet the normalisation of deviance the way you always meet it, on the day it finally costs you something.