Measure what you'll promise, not what you typed

Our code output is up roughly fiftyfold this year. Fifty times the lines, twelve times the commits, measured straight off the repositories, not estimated. By the standard most teams reach for, that is a triumph.

It is also close to meaningless.

Lines of code is the metric that rewards lockfiles, generated migrations, scaffolding and the boilerplate an agent expands without thinking. Some of that fiftyfold is simply more projects started against a quiet base year. Most of it tells you nothing about whether we shipped more software, and nothing at all about whether any single piece of work got done faster.

We tooled up this year. We did not necessarily speed up. And I think most teams claiming their developers are 20 percent faster are looking at the first number and feeling the second.

The studies stopped disagreeing

This year the evidence converged, and not where the tooling vendors hoped.

METR ran a randomised controlled trial in 2025. Sixteen experienced developers, working on repositories they had maintained for years, completed 246 real tasks with and without AI. They predicted the tools would make them 24 percent faster. Afterwards they believed they had been 20 percent faster. They were measured at 19 percent slower.

That gap between felt and measured is the thing nobody has explained away. METR has since redesigned the experiment and the slowdown signal weakened and widened, so treat the precise figure as a 2025 result rather than a current one. The gap between perception and measurement survived the redesign. The point estimate was never the point.

The lab finding was narrow. The organisational evidence is not. Faros AI’s telemetry, spanning thousands of developers, found the heaviest adopters merging far more pull requests while review time rose 91 percent and bug counts rose 9 percent. In the highest-adoption organisations, the share of pull requests merged with no review at all, human or agentic, climbed past 30 percent. Organisational delivery metrics did not move.

Individual velocity went up. Throughput did not. The work did not disappear, it moved downstream into review, integration and verification, and that stage was never sized for the new inflow.

The bottleneck moved, it did not vanish

A recent paper on what its authors call the productivity-reliability paradox frames this through the theory of constraints. Optimise a step that was never the bottleneck, writing code, and system throughput does not move while the real constraint runs at human speed. Its argument is that writing code is no longer the primary bottleneck. Governing it is.

The 2025 DORA report, the largest study in the field, lands in the same place and gives the cleanest frame. AI is an amplifier. It does not improve delivery on its own, it multiplies the engineering conditions you already have. Disciplined teams pull ahead. Fragmented teams accelerate into instability.

I wrote in May that when you stop reading every line, the discipline does not vanish, it relocates, upstream into specification and downstream into gates. The data has now caught up with that claim. The bottleneck relocated, and the studies show where it went. You can read that earlier argument in The line moved. The discipline did not.

Why fiftyfold output did not become fiftyfold chaos

Here is where our own numbers get interesting, and where the honest version of them helps rather than embarrasses.

If you looked at our repositories you would notice something that ought to be a scandal. Most of them do not use pull request review. Of around thirty active repositories, fewer than a third route work through pull requests, and where they do, the same person often opens and merges the branch. By the convention that review means a second human reading the diff, we barely review our own code.

That is not negligence. It is the whole point.

If our discipline lived in pull request review, the git history would show review. It does not. The gate lives somewhere git cannot see it, and it is automated.

Every push triggers a pipeline that runs Claude Code Security and an automated code review before anything can ship. Whatever fails blocks the deploy and has to be put right first. There is no route to production that skips the check, and no human standing at the gate deciding to wave a change through.

Past that, the code carries its own context. Every module we ship has a header recording its purpose, responsibilities, consumers and dependencies. Every exported function lists its callers, verified against the imports. That embedded context is what keeps the next agent’s pass cheap instead of expensive, which is the direct answer to a review stage the studies found getting slower under load.

Then the tests. Our retrieval platform ships behind 243 of them, including isolation tests that block the release outright if a tenant boundary leaks. When an agent recently changed a relevance threshold from 0.015 to around 0.65, because the value read to it like a cosine distance, it silently filtered out every result. No human caught it in a diff. The test suite caught it, in development, before it reached anyone.

Every deploy to staging then gets a penetration test, and that one I will be honest about. It is manual today and narrower than I would like, as I have written about in The Friday afternoon pen test. The engineering to automate it is in motion, and the next piece we are building is the part that matters. The models will build scripted authorisation tests from the specification itself, checking whether a low-privilege user can reach a high-privilege action, and those tests will run automatically and gate the deploy like the rest.

That is the gate. Structural, automated, layered from the push through to the staging deploy, and it is what let fiftyfold code volume through without fiftyfold chaos behind it.

The automated pipeline gets code to staging in under forty minutes, more than once on a typical working day. It was built this year, so treat that as evidence the gate runs hot, not evidence we got faster. I am holding our own numbers to the same bar I am asking you to apply.

None of this is an argument against agentic coding. We delegate almost all of our coding to agents, and we are not going back. The argument is narrower and harder than that. The speed is only real if the gate is already there to absorb what the speed produces.

What the gate actually buys you

Here the effect goes somewhere the studies do not measure. A gate you trust does more than catch bugs. It changes what you are willing to commit to.

When you know your downstream verification will catch the drift, you stop padding estimates. You stop hedging scope. You say yes to work you would have declined a year ago. That confidence to commit is throughput showing up where it actually matters, not in keystrokes saved but in what you are prepared to promise a client.

The industry is busy counting the first thing. The thing that moved is the second.

The proof git cannot hold

So how would I back a claim that we deliver more, having just spent several paragraphs dismantling our own metrics.

Not with a number scraped from a repository. With a contract.

We recently delivered two fixed-price statements of work for a database-tooling client. We scoped both at a fraction of what we would have estimated a year ago, and we delivered under the cap on both. The client has since handed us a larger piece of work.

That client is the cleanest example, not the only one. We have delivered for other customers this year on the same footing, scoped tighter than we once would have dared and delivered inside the estimate.

I want to be precise about which half of that is felt and which is measured, because that distinction is the whole post. “Scoped at a fraction” is a judgement. It compares our estimate now against a hypothetical estimate we would have given a year ago, and that is the soft, subjective part. “Delivered under the fixed-price cap, twice, and the client came back with more” is the hard part. You do not come in under a fixed-price cap twice by feel.

That is the like-for-like comparison the git history could not give me. It does not live in the commit layer. It lives in the commercial layer, and that was always the only place a real before-and-after was going to be found.

The company’s financial position bears it out. I will not claim revenue proves we are faster, because revenue moves for reasons that have nothing to do with engineering. What the financials do is rule out the catastrophe. If our gate had buckled under fiftyfold the code volume, we would be drowning in rework and missing dates, not delivering ahead of them. The numbers do not prove the good story. They rule out the bad one.

Where this does not apply, and where it does

One honest limit. On small or greenfield work the downstream queue does not exist yet, so the velocity gain is real and you should enjoy it. The relocation bites hardest on mature systems with real review and integration load. The discipline scales down to fit the work. It does not disappear.

There is a flip side, and we are living it. We carry a large legacy product, our original analytics platform, where this velocity is hard to reach. The problem it solves, we would solve very differently today. It was built for a human reading dashboards, and that assumption has dated.

So we are re-engineering it, and the interesting part is what we are not touching. The backend stays. A data layer fast enough to summarise hundreds of thousands of rows in under a second, a row and column level security framework, and a metadata layer built to present data in a form a model can understand. We built all three for an analytics product years ago. They turn out to be exactly the foundation an agentic system needs, speed for a conversational interface, security for an agent acting on someone’s behalf, and structure an LLM can read.

What we are rebuilding is the ingestion, now agent-driven, and what sits on top, data presented to agents rather than served as pure analytics. A year ago that scope would have been reckless. With agentic velocity and a gate we trust to catch what the velocity produces, it is now the easier and better path. That is the confidence to commit turned inward, the same trust in the gate that lets us promise more to a client now letting us take on our own codebase at a scale that used to be unthinkable.

This is the call I made in decades of business logic, drawn inside a single product rather than around it. The line was never old versus new. It is durable versus dated. You keep what still earns its place, even when it is years old, and you re-engineer what was built for a world that has moved on.

Everywhere else, the lesson holds. Your tools will count every line you produced and stay completely blind to whether you delivered. No velocity chart or deployment-frequency dashboard will show you the number that matters. It turns up as a signature on a statement of work.

So before you buy another coding seat to make your developers feel faster, look at what they are willing to promise. Measure what you will commit to and deliver, not what you typed. That is the only figure that was ever the point.

Measure what you'll promise, not what you typed

The studies stopped disagreeing

The bottleneck moved, it did not vanish

Why fiftyfold output did not become fiftyfold chaos

What the gate actually buys you

The proof git cannot hold

Where this does not apply, and where it does

More from the blog

Build the agent that cannot pay

Get ruthless about what's worth building

Partition your workloads before Washington does it for you

Want to discuss this?