Stop rebuilding the retriever. Your corpus is what rotted.

The enterprises pouring serious money into AI mostly have nothing to show for it. Writer’s 2026 adoption survey, twelve hundred executives and twelve hundred employees, found that only 29 percent of organisations see significant return from generative AI and only 23 percent from agents. Fifty-nine percent are spending more than a million a year. Forty-eight percent of executives now call the whole thing a massive disappointment, up from 34 percent a year ago. The money went out. The return did not come back.

So the industry went looking for the broken part, and it decided the broken part was retrieval. In the first quarter of 2026, retrieval optimisation overtook evaluation as the top enterprise investment priority for the first time. Buyer intent for hybrid retrieval tripled in three months, from around 10 percent to 33. The reasoning is sound enough. RAG is failing at scale, retrieval is the part of RAG everyone understands, so rebuild the retriever.

That is half right and expensively half wrong. A better retriever pointed at a rotting corpus retrieves rot faster.

Two failures that look identical from the outside

When an agent gives a confident, wrong answer, two completely different things could have gone wrong, and from the outside they look the same.

The first is retrieval. The system failed to find the chunk that held the right answer, or it found the wrong one and ranked it top. This is the failure the rebuild is aimed at, and it is real. Agentic workloads make orders of magnitude more retrieval calls than a human ever did, and at that volume retrieval architecture genuinely matters. Hybrid search, reranking, fusion, all of it earns its keep.

The second is the corpus. Retrieval worked perfectly. It found exactly the chunk it was supposed to find. The chunk was stale, or thin, or contradicted by something newer, or never quite right in the first place. The retriever did its job and handed the model a confident, well-cited, wrong answer.

The failure data is measuring the second problem. Multiple analyses through the first half of 2026 converge on data quality and freshness as the dominant driver of production RAG failure, with the figures landing in the low to mid sixties. Gartner’s cleaner adjacent number, 57 percent of organisations saying their data is not AI-ready, points the same way. The spending wave is aimed at the first problem. That gap, between where the failures are and where the budget is going, is a large part of why the ROI numbers are flat.

Retrieval architecture decides whether you find the right chunk. Corpus maintenance decides whether the right chunk is still true. The first is necessary. It is nowhere near sufficient.

What rot actually looks like

We hit this early in the DBA expert system we are building with White Star Software, the OpenEdge consultancy. We had put together a voice demo so one of their engineers could show the system at conferences, ask it a database question out loud, get a spoken answer back. During one run the engineer asked about buffer pool sizing and the agent recommended growing the buffer pool to a frankly ridiculous size, the kind of number that would make any DBA wince.

We traced the retrieval. Nothing was wrong with it. The system had surfaced the relevant chunks, the ones genuinely about buffer pool configuration. The problem was that those chunks were vague. They talked about the concept without the specifics, the thresholds, the practical ceilings, the do-not-go-past-this detail that a real DBA carries in their head. The vagueness was a gap, and the model did what models do with a gap. It filled it, confidently.

This is the freshness failure in miniature, and it is worth sitting with, because it is not the failure people picture. Nothing was stale. Nothing was out of date. Retrieval succeeded. The corpus was simply thin in exactly the wrong place, and a better retriever would have found the same thin chunks faster. The only reason the absurd recommendation went nowhere is that a knowledgeable human heard it and caught it. Nothing in the system would have. There was no feedback mechanism at that point. The system had no way to know it had just been wrong.

Catching the contradiction at write time

That buffer pool moment is why we built the feedback loop.

Now when a White Star engineer testing the system hits something inaccurate, they correct it in the moment, and the correction does not just get logged somewhere for a human to triage later. It flows through a pipeline that extracts the actual claim, finds the chunks it bears on, and writes a semantic patch against them. The corpus updates from use, with full provenance, so every change traces back to who made it and why.

The part that matters most is what happens when corrections collide. The conflict detection runs at the moment the patch is created, not at inference time. If a new correction would contradict a patch that is already in place, the system does not silently overwrite the old one. It flags the conflict and routes it to a human, who decides whether to accept both, supersede one with the other, or reject the new one outright. We never patch over an existing patch without a person reading both and deciding which is right.

This is the honest answer to the question everyone asks about expert knowledge systems. What happens when two experts disagree. The wrong answer is that the most recent write wins, because that means your corpus quietly contradicts itself and you find out in front of a customer. The right answer is that the contradiction surfaces in the editorial pipeline, where someone qualified looks at it before it can ever reach a user. Catch it at write time or discover it at inference time. Those are the only two options, and one of them is far cheaper.

A gap should become a work item, not a hallucination

The second failure had a different shape. During internal testing the system invented a ProTop option that does not exist. ProTop is the leading monitoring tool in the OpenEdge world, and the agent described a feature of it with complete confidence. The feature was fiction.

The trace was unambiguous this time. We had not yet loaded the full ProTop help material into the knowledge base. There was nothing there to retrieve, so the model reached past the empty space and made something up. A pure gap, and a pure hallucination on top of it.

The obvious fix was to load the missing material, and we did. The fix that mattered was structural. We hardened the system so that a detected gap, a question the retrieval pipeline cannot answer from the corpus, triggers an interview request to the relevant expert. The missing knowledge gets captured, processed, and fed back into the base, stamped with provenance back to the gap that prompted it. A gap stops being a silent hole that the model papers over and becomes a work item that gets filled. The next person who asks the same question gets a grounded answer instead of a confident invention.

That is the loop the freshness literature keeps describing and rarely shows. Not “documents go stale, you should refresh them”, but a specific mechanism that notices the absence and goes and fixes it.

Why targeted beats brute force

The standard objection to all of this is simpler. Just re-embed everything more often. Re-index the whole corpus on a schedule and freshness takes care of itself.

It does not, and it is expensive. One Pinecone user reported spending twelve thousand dollars a month to re-embed a one-terabyte corpus weekly, purely to chase freshness. Treat that as illustrative rather than a benchmark, but the shape is right. Brute-force re-embedding pays the full price of the entire corpus every cycle to fix the small fraction of it that actually changed. And after all that money, it still does not detect a contradiction between two chunks, and it still cannot tell you that a chunk is retrieved constantly but almost never contributes to a good answer.

Targeted maintenance costs in proportion to the change, not the size of the corpus. Detecting a gap and analysing it runs to a few thousand tokens, pence. An expert interview to fill it costs a dollar or two. A semantic patch costs almost nothing. You pay for what moved, not for what stood still. At any real corpus size that is not a marginal saving, it is a different cost model entirely, and it catches the failures, contradiction and low grounding, that re-embedding is blind to.

What is actually running, and what is not

I want to be precise about maturity, because the prescription is only worth anything if the claims behind it are honest.

The feedback loop, the semantic patching, and the conflict detection at write time are deployed and in use. They are not slideware. The buffer pool failure and the ProTop hallucination are not hypotheticals, they are the failures that drove the build, and the machinery that answered them is running.

The next layer is built and deployed and gathering telemetry, but it is not yet running autonomously at production scale across live customers. That includes the weekly consolidation pass that proposes merges and splits across the knowledge base, and the grounding counters that separate how often a chunk is retrieved from how often it actually grounds an answer, which is how you find the chunk that looks busy but is quietly useless. The retirement switch exists too, a single predicate that takes a flagged chunk out of every retrieval path at once. What is missing is real-world volume, not the code. We will know more when it has run against a large corpus under real load, and I would rather say that than pretend the telemetry already exists.

The layer where the ROI is hiding

The retrieval rebuild is real. At agentic scale, with that many calls, retrieval architecture matters and the investment is not wasted. But putting the whole budget there is a bet that finding the chunk is the hard part, and the failure data says it is not.

Stop tuning the retriever in isolation. Build the maintenance layer underneath it. Catch contradictions when the correction is written, not when a customer hits one. Turn gaps into capture requests instead of hallucinations. Measure whether a chunk grounds answers, not just whether it gets retrieved. Retire the chunks that fail, cleanly, everywhere at once. This is governance as truth maintenance, not governance as access control, and it is the layer almost nobody is instrumenting.

The flat ROI numbers are not hiding in your retriever. They are hiding in the corpus the retriever points at. Retrieval decides whether you find the chunk. Maintenance decides whether the chunk was worth finding.

Stop rebuilding the retriever. Your corpus is what rotted.

Two failures that look identical from the outside

What rot actually looks like

Catching the contradiction at write time

A gap should become a work item, not a hallucination

Why targeted beats brute force

What is actually running, and what is not

The layer where the ROI is hiding

More from the blog

The knowledge that matters has never been computerised. Until now.

Build like your model can be switched off on Friday.

Your AI can't add up. So we stopped asking it to.

Turn expertise into infrastructure