28 May 2026 · Nick Finch
Your AI can't add up. So we stopped asking it to.
The same model that planned, built, and shipped a cross-service feature before lunch could not reliably total a column of numbers. That contradiction tells you exactly where the reliability of an AI system actually lives.
This morning I started planning a change to a RAG platform we are building for a client. The plan was finished by ten past ten. It threads token capture through four separate services, adds a database migration, builds a charting interface in a frontend that had no chart library, and ships a runtime model switcher with an audit trail and the multi-task consistency problem solved properly. The kind of change that would take a careful engineer a few days. The tested pull request landed at twenty to twelve. It passed our security gate and deployed just after half past. Plan to production in three hours, over a coffee.
The model that did that, Opus 4.7, is the most capable thing I have ever built software with. And yet, on a different project, the same model could not be trusted to add up a column of numbers.
Not a hard column. A profit and loss statement. Sum the rows, write the total at the bottom. The total was sometimes wrong. Not always, more often when there were many rows, sometimes off by a little and sometimes by a lot. A model that can architect a cross-service feature before lunch, defeated by primary school arithmetic.
That is not a difficulty gradient. Summing twenty numbers is trivially easier than the feature I shipped this morning. The easy thing was the unreliable thing. Difficulty and reliability had come apart, and understanding why is the whole point.
Spiky, not smart
Andrej Karpathy has a phrase for this. Spiky intelligence. These models are not uniformly capable in the way the word intelligence implies. They are brilliant in some places and unreliable in others, and the two sit millimetres apart. The mistake almost everyone makes, including people who use these tools every day, is to treat capability as a single dial. If the model can do the hard thing, surely it can do the easy thing. It cannot, and assuming it can is how you ship a wrong number to a parent company.
The arithmetic failure is the clearest window into the shape of the spike, so it is worth understanding mechanically rather than waving at.
The report in question was a Tagetik P&L, a profit and loss statement that goes to the parent company in a strictly prescribed format. Around 150 line items across twenty sections, with seven layers of running totals. Gross profit, then industrial gross margin, logistics margin, distribution margin, EBIT, EBITDA, net result. Percentages and per kilogram ratios at each level. A real corporate reporting obligation, the kind where a wrong figure has consequences a chatbot getting a film recommendation wrong does not.
The first fix was the one everyone reaches for. Strengthen the prompt. Tell the model to double and triple check the totals, to think carefully, to guard against errors. It did not work, and the diagnosis of why was more illuminating than the fix would have been.
Telling the model to check carefully does not cause it to do more computation. It causes it to generate text that sounds like checking. The model produces tokens left to right. By the time it writes the total, it has already committed to the rows above. Summing many numbers accurately in a single forward pass is unreliable no matter how emphatic the instruction. Worse, we had told it not to show its working, which removed the one thing known to actually help, the visible step by step arithmetic that gives the model somewhere to do the sum other than in a single confident leap.
This is the part that should change how you think. The model does not pause and recompute when you ask it to be careful. It cannot. It writes the next token. If that token is a wrong total, no amount of preceding self talk about diligence changed the arithmetic. And it delivers the wrong total in a flawless table, with exactly the same authority it shows when the total is right. That is the dangerous bit. The failure is invisible at the moment it matters, because confidence and correctness are not connected either. A wrong sum in a perfect table is a hallucination wearing a suit.
Equipping around the spike
Once you accept the spike, the obvious move is to give the model a better tool for the thing it is bad at. We tried two.
First, a calculator. Any number in the final answer that results from arithmetic must come from a tool call, never from the model’s own head. Correctness wise, this worked. The totals were exact. Cost wise, it was a disaster. The P&L has dozens of derived values per column, every subtotal, every running total, every percentage, every ratio, across actuals, comparison, budget and differences. Each one became its own tool call, each call dragging the full conversation context along for the ride. Hundreds of round trips per question, and the tokens per answer went up severalfold. The fix worked. The economics did not.
Then, code execution. Let the model write a short Python snippet and run it, doing many calculations at once. Correctness held, code is exact, and token usage was reasonable. But the latency was unacceptable. The answer needed several execution rounds, serialised because the later calculations depended on the earlier ones, and each round carried the overhead of spinning up a sandbox and the model reading results to decide what next. End to end, the user waited noticeably longer than with the original, sometimes wrong, version. For an interactive assistant, a hard sell.
Two architecturally sound solutions, each failing on a different axis. The calculator was correct and cheap per call but expensive per question. Code execution was correct and fast per call but slow per question. Both treated calculation as something the model had to do, just better equipped to do it. Which was the assumption worth questioning. What if the model did not have to do the calculation at all?
Make the data return the answer
The Tagetik P&L is a fixed report. The same calculations every month, every quarter, every year. The structure of the running totals is stable, and the accounts contributing to each total are knowable in advance. If a value is going to be computed every single month by a model that sometimes gets it wrong, perhaps it should not be computed by the model at all. Perhaps the data platform should hand it over as a value you simply ask for by name.
That was the inflection point. Everything after it was about moving calculation out of the model and into the platform.
Our BI layer exposes curated subjects with named dimensions and metrics. The running totals were the thing the model kept getting wrong, and there was no way to retrieve them directly. So we built one. We added boolean columns to the account dimension, one per running total. In gross result. In industrial gross margin. In logistics margin. In distribution margin. In operating cost EBIT. In net result. Each flag is true for the accounts that roll up into that total. A filtered sum over a flag now returns a running total as a single value. Ask for EBIT, get EBIT, in one row, no formula chain.
The effect was immediate and slightly uncanny. The EBITDA sign error that had survived every prompt rule we threw at it simply disappeared on the next run. Not because we finally wrote the sign rule correctly, but because the model no longer did the calculation. EBITDA is EBIT minus a depreciation adjustment, and both were now retrieved as signed values. One subtraction on two numbers, instead of a five term chain accumulating sign confusion. We had not made the model better at the sum. We had removed the sum.
There was a cost. A budget and comparison question went from eleven steps to twenty nine, because each running total was now its own filtered call. Correctness first, then compression. We promoted the flag filters to named computed metrics, each running total expressed once as a metric with its filter logic baked in, so a single call returns all six at once. The step count fell from twenty nine back to fourteen. The report was the same shape it had been at the start. The model was doing less and less work at every iteration.
Reliability lives in the layer, not the model, and that cuts both ways. The hardest we ever blamed the model was a stretch where it kept sending empty queries and we rewrote the instructions three times, certain it was misbehaving. It was not. We had added the new metrics but the schema the model validates against did not list them yet, so it was obediently dropping every field name it could not see, exactly as instructed. Two minutes to fix the schema, and the calls landed. The model was being scrupulous about a world we had described incorrectly. Before you decide the model is wrong, check that the data layer it is looking at represents reality.
But the next model will just do the maths
This is the obvious objection, and it is the right one. Models keep getting better. The next one will sum 150 rows without breaking a sweat, and then all this data scaffolding looks like an elaborate workaround for a problem that solved itself. I have argued before on this blog against over engineering around the current limitations of models, because a new release so often makes the workaround redundant. So I am obliged to take this seriously.
It does not hold, and the reason it does not hold is the strongest part of the whole approach.
Asking for a value by name is cheaper and faster than computing it, even for a model that computes flawlessly. The running totals are defined once, when the metric is added to the platform, and the query engine serves them with whatever pre-aggregation and indexing it already has. A perfect model still has to do the work at request time if you ask it to compute. A named retrieval skips the work entirely. That gap does not close as models improve. It widens, because the rest of the system keeps getting faster while request time computation stays expensive.
The proof is in how the project ended. Once the platform was doing the arithmetic, we switched the answer model from Opus to Sonnet 4.6 and the cost dropped again with no loss of quality. The better the architecture got, the less model we needed. A workaround gets thrown away when the model improves. This gets cheaper, because a smaller model doing less work is the entire goal.
There is a real boundary, and it is worth being precise about. The calculator and the code execution were not wrong. They are exactly right for genuinely ad hoc analysis, the unforeseeable question that needs arbitrary arithmetic over the data. You cannot pre define an answer you have never been asked for. The Tagetik P&L is not that kind of question. It is the same fixed report run on repeat. The discipline is knowing which kind of problem you have. For the ad hoc kind, equip the model to compute. For the fixed kind, and most enterprise reporting is the fixed kind, bake the answer into the platform and let the model ask for it.
The reliability is in the layer
The capability of the model matters less than the shape of what sits behind the interface. Every named metric, every boolean flag, every directly retrievable total we added was the model’s equivalent of a calculator, letting it skip a step it would otherwise have performed and potentially got wrong. The instructions did not get cleverer over this work. They got simpler. The platform got more capable, and the model got correspondingly more reliable.
That is the lesson hiding inside the contradiction I opened with. The same model ships a cross-service feature before lunch and cannot total a column, because intelligence is spiky and reliability is not the same thing as capability. You do not fix the spike by demanding the model be uniformly excellent. You map where the troughs are, and you build the platform to cover them. For the P&L, the trough was multi step arithmetic in a single forward pass, and the platform covered it by making the answers retrievable.
We stopped asking the model to add up. We gave it a platform that already knew the totals. It got faster, cheaper, and right. The model was never the thing that needed to improve.