31 March 2026 · Nick Finch
Stop tuning your prompts. Start engineering your context.
Every frontier model degrades as you fill the context window. The bottleneck was never capacity. It was curation. Here's what that means for enterprise AI, and what we're doing about it.
Here is something counterintuitive. Giving an AI model more information makes it worse.
Chroma Research tested 18 frontier models and found that every single one degrades as input length increases. A model with 200,000 tokens of capacity can start producing worse answers at 50,000. The researchers call it “context rot.” The more you put in, the less the model can use. Position matters. Size matters. Simply having the right information somewhere in the window is not enough. The model has to be able to find it and use it, and the longer the window gets, the harder that becomes.
The industry spent the last three years racing to make context windows bigger. Bigger was supposed to be better. It turns out that bigger, filled carelessly, is actively worse.
This matters because the gap between AI that impresses in a demo and AI that works in production is almost entirely a context problem. And most enterprises are not doing the work to solve it.
The production gap is a context gap
Seventy-eight percent of enterprises now have at least one AI agent pilot running. Only fourteen percent have scaled an agent to organisation-wide operational use. That gap is not about model capability. The models are good. It is about what the models can see.
Foundation models know everything about public libraries and nothing about your internal deprecation decisions. They will confidently suggest using a library your team retired six months ago. They will generate code that violates security policies they have never seen. They will produce answers that sound authoritative but are grounded in generic training data rather than the specific reality of your business.
Stack Overflow reported in March that their Internal Knowledge APIs became, in their words, very hot, as enterprises plugged verified internal knowledge into AI assistants. The models needed grounding. The companies that provided it got production value. The ones that did not got impressive demos and thin results.
This is the context engineering problem. Not “how do I write a better prompt” but “how do I design the entire information ecosystem that surrounds the model so it can actually do useful work.”
Context engineering is an infrastructure problem
Gartner formally defined context engineering in Q1 2026 as designing and structuring the relevant data, workflows, and environment so AI systems can understand intent, make better decisions, and deliver enterprise-aligned outcomes. It is now listed as a top emerging technology skill for the year.
The term is new. The discipline is not. It is the same lesson the database industry learned decades ago. You do not solve performance problems by buying bigger hardware. You solve them with proper schema design, connection pooling, caching layers, and query optimisation. The architecture matters more than the machine.
The same principle applies to AI. You do not solve the context problem with bigger context windows. You solve it with retrieval architecture, memory systems, data formatting, and structured knowledge pipelines. The work is not glamorous. It is the kind of plumbing that never makes it into a press release. But it is the work that determines whether your agent scales or stalls.
Andrej Karpathy put it well when he endorsed the term. Context engineering, he said, is the delicate art and science of filling the context window with just the right information for the next step. Not all the information. Not as much information as you can fit. The right information.
What this looks like in practice
We have been building agentic RAG systems at inmydata for months now, and the context engineering problem is the problem we spend the most time on. Not model selection. Not prompt tuning. Context.
Our retrieval pipeline uses two parallel search strategies. One is a vector search, which finds content by meaning. If a user asks about transaction alerts, vector search will surface documentation about alerting systems even if the exact words do not match. The other is a keyword search, which finds content by exact terms. If a user mentions a specific command or error code, keyword search catches it even when the semantic meaning is ambiguous.
We combine the results from both searches using a scoring system that rewards chunks appearing in both lists. A piece of documentation that ranks highly on meaning and on keywords gets a strong signal. Something that only appears in one list gets a weaker score. This fusion gives us a ranked set of candidates, ordered by confidence that they are genuinely relevant to the question.
Then comes the quality gate. We set a threshold. If the best candidate scores above it, we inject that context into the agent’s window. If nothing scores high enough, we tell the agent honestly that no relevant documentation was found, and instruct it not to guess.
This is where things get interesting, and where the context engineering lesson hit us hardest.
The hallucination that changed our approach
During a live demo of one of our expert systems, a user asked about navigating to a transaction screen. Our retrieval pipeline ran. It found relevant documentation. But the top result scored just below our quality threshold, high enough to be topically related, not high enough to inject with confidence.
The quality gate did its job. It filtered the borderline results and told the agent no relevant context was available. The agent had been explicitly instructed not to guess.
It guessed anyway.
Over several turns, the agent fabricated navigation instructions. It told the user to press a key that did something completely different. It described a menu bar that did not exist. When challenged, it admitted it was guessing and apologised. Two turns later, the user asked a different question that happened to match strongly in both search lists. The retrieval scored well above the threshold. Thirty-eight chunks of context were injected. The agent gave a correct, grounded, well-sourced answer.
The contrast was stark. With good context, the agent was excellent. Without it, no amount of prompting could prevent hallucination.
The fix was architectural, not linguistic
The obvious response would have been to rewrite the prompt. Tell the model harder not to guess. Add more emphatic instructions. We will do that too, because prompts matter. But the root cause was not the prompt. The prompt was already clear. The root cause was that relevant context existed in our knowledge base and was not making it into the agent’s window.
Lowering the quality threshold globally was not the answer either. The borderline results at that score included chunks that were topically adjacent but factually misleading. Same domain, same terminology, completely different tool. Injecting those would have caused the agent to give a wrong answer with a citation, which is worse than an unsupported guess.
So we built a secondary step. When the quality gate finds nothing above the threshold but there are candidates in a grey zone just below it, a lightweight language model evaluates each candidate against the specific question. Is this chunk directly relevant to what the user asked, or is it merely in the same neighbourhood? The model classifies each one. Relevant chunks get injected with a note that the confidence is lower. Irrelevant chunks get discarded.
If this secondary step fails for any reason, it defaults to injecting nothing. The safe failure mode is always silence, never noise.
This is context engineering. Not prompt tuning. Not hoping a bigger window solves it. Designing the retrieval architecture so that the right information, at the right confidence level, reaches the model at the right time.
Prompts matter. Context matters more.
I want to be clear about one thing, because the narrative around context engineering can tip into a false dichotomy. Prompts are not irrelevant. The instructions you give the model matter. We are refining our system prompts alongside this architectural work, because when an agent is told clearly not to guess it should not guess, and making that instruction more robust is worthwhile.
But optimising the prompt while ignoring the context pipeline is like optimising SQL queries while your database schema is a mess. You will see marginal improvements that mask the structural problem. The prompt sets intent. The context supplies the situational awareness that makes intent actionable. Both matter. Enterprises are dramatically underinvesting in the latter.
The forty to sixty percent of RAG implementations that fail to reach production are not failing because the prompts are poorly written. They are failing because the retrieval is noisy, the data is poorly structured, the quality controls are absent, and nobody has done the architectural work of deciding what should and should not reach the model.
Context is becoming a hardware concern
NVIDIA announced the Inference Context Memory Storage platform as part of its Rubin architecture. Dedicated hardware for managing agent context. Up to 16 terabytes of context memory per GPU. Five times higher throughput than traditional storage.
Context has become important enough to warrant its own silicon. The models will get better at handling longer windows. Context rot will diminish as architectures improve. But even a model that handles a million tokens flawlessly still needs to be fed the right million tokens. The retrieval, the curation, the quality decisions, those do not go away when the model improves. They become more valuable, because a better model does more with better input.
The work most teams are not doing
There is a temptation in this industry to over-engineer solutions around the current limitations of models. We have seen it before. Teams build elaborate workarounds, and six months later a new model release makes the workaround redundant. That is a real risk, and anyone building agentic systems should be honest about it.
Context engineering is not that. It is not a workaround for models that cannot handle long windows. It is the foundational work of deciding what your agent should know, when it should know it, and how confident you are in the information you are providing. That work pays dividends regardless of how capable the model underneath becomes. A better model with better context produces better outcomes. A better model with noisy, unstructured, carelessly assembled context still underperforms.
The organisations building this infrastructure today, retrieval pipelines, quality gates, memory systems, structured knowledge bases, are accumulating advantages that compound. Every edge case they solve, every quality threshold they tune, every architectural decision they make feeds into institutional knowledge that makes everything they build next faster and more reliable.
We see this every day at inmydata. The work that makes our agents reliable is not the model. It is the context architecture underneath. The retrieval strategies, the scoring systems, the quality gates, the fallback classifiers, the structured data pipelines that feed operational reality into the agent’s window alongside expert knowledge. That infrastructure is what turns a demo into a production system.
The model interprets. The context is the magic.