Why constraining your agents makes them better

Token prices have collapsed. GPT-3.5-level inference fell from twenty dollars to seven cents per million tokens in eighteen months. Frontier models keep getting cheaper. And yet enterprise AI spending has moved in the opposite direction. The average AI budget has grown from $1.2 million a year to $7 million. Inference now accounts for 85% of that spend.

The economists have a name for this. When a resource becomes dramatically cheaper, total consumption increases so fast that aggregate spending rises. It happened with steam engines and coal. It happened with compute and cloud bills. It is happening now with tokens and AI budgets.

But the economics lesson is not the interesting part. The interesting part is where the cost actually lives. It is not in the token price. It is in the architecture. Every decision you make about what goes into the context window, how many times the agent loops, what background work the system does to stay healthy, those are token decisions. And if you are not making them deliberately, you are making them expensively.

We have built three agentic systems at inmydata over the past year. Each one taught us the same lesson in a different way.

The schema that broke the budget

inmydata Studio is an agent-driven dashboard designer. Users describe what they want in natural language, and the agent builds it. The business model is straightforward. Around twenty pounds a month for access, with a token allocation that governs how much you can use the system.

During internal testing, we ran the numbers on what that token allocation actually bought. The answer was two or three dashboards a month. That is not a product. That is a demo.

We dug into where the tokens were going and found the problem immediately. The data schema. Studio connects to whatever data sources the customer has exposed, and those schemas can be large. Dozens of subjects, each with a hundred or more columns. On every request, the full schema was being sent into the context window. Every turn in the conversation, the entire thing went back and forth.

The fix was architectural. We give the agent a summary of the schema first, enough to understand what data is available and make a choice about which area is relevant. Then we narrow progressively. As the conversation develops and the dashboard takes shape, the agent works with a focused subset of the schema rather than the whole thing.

The result was fifteen to twenty dashboards per month. A fundamental shift in the economics of the product. And it is still not fully solved. Customer schemas vary wildly in size and complexity. We are still iterating on how aggressively we can narrow without losing important context. But the principle held. The cost driver was not the model. It was what we were putting into the model’s window.

The loop that was not worth building

Alongside our analytics platform, we build living knowledge bases. These are not static RAG systems where documents go in, get chunked, and freeze. They are designed to improve through use. When an expert corrects an answer, the correction feeds back into the knowledge base itself. When the system encounters a question it cannot answer, it identifies the gap and reaches out to fill it. Performance monitoring tracks which content is actually contributing to good answers and which is just noise.

That last piece is where the token economics get interesting. An automated quality agent analyses retrieval performance, identifies underperforming content, and decides what to do about it. Some chunks are near-duplicates of richer content elsewhere and can be flagged as irrelevant. Some contain useful material but are too vague, and the agent can enrich them by pulling in specific detail from related sources. Some need genuine domain expertise to evaluate, and those get escalated to a human reviewer. Each of these decisions has a token cost, but the return is compounding. A knowledge base that actively improves means better retrieval on every subsequent query.

Then we considered adding a second loop. A verification agent that would review the quality agent’s decisions on a weekly cycle. Did it correctly flag that chunk as low-value? Was the enrichment sensible? The idea was sound. A check on the checker.

We decided not to build it. The cost-benefit did not justify it. The token cost of running a second agent over every decision the first agent made, every week, was significant. And we had a better signal available. When an expert later disagrees with a quality agent decision through the normal feedback loop, that produces a correction that effectively reverses the decision. The rate of those reversals is a direct measure of the quality agent’s accuracy, and it costs nothing extra to track. We improved the first loop based on that signal rather than adding a second agent on top.

The lesson is simple. Loops on top of loops compound cost without proportional benefit. There is always another verification step you could add. The discipline is knowing when the next loop is not worth the tokens, and finding cheaper signals that achieve the same goal.

The agent that would not stop asking questions

We built a chat-based expert system for a coffee manufacturing business in the Netherlands. The system had access to 23 data sources. Demand data from the ERP. Demand predictions. Crop forecasts for raw coffee. Weather forecasts. Currency exchange rates. Market intelligence reports. Users could ask questions like “what coffee buying opportunities are there over the next few months” or “what issues do we need to address.”

The agent was instructed to query these data sources iteratively. See the question, assess which data sources are relevant, pull data, assess what it has learned, decide if it needs more, pull again. Build its own understanding of the situation before answering.

Without explicit iteration limits, the agent would keep going. Not because it was broken. Because it was doing what we asked. There is always another data source that might be relevant. Another angle to check. Another forecast to cross-reference. The agent always responds eventually, but the cost of that response scales with how many queries it makes along the way.

We capped the iterations. The agent works within a cost envelope. It still queries multiple sources, still builds knowledge iteratively, still gives thorough answers. But it does so within bounds we set deliberately.

The important decision here was what we did not do. We did not downgrade to a cheaper model to manage costs. The frontier model gave us materially better answers. When you are asking an agent to synthesise demand predictions, crop forecasts, and currency movements into actionable buying recommendations, model capability matters. The constraint was architectural, not capability-based. Spend on the best model. Architect the discipline around it.

The pattern

Three systems. Three different cost dynamics. Context window bloat in Studio. Infrastructure maintenance loops in our RAG platform. User-facing reasoning chains in the coffee buying expert.

In every case, the fix was the same. Not a cheaper model. Not a bigger context window. Not hoping that token prices would fall further. Architectural decisions that constrain token consumption intentionally.

And here is the counterintuitive payoff. Constraining agents does not just make them cheaper. It makes them better.

An agent that gets the full schema on every pass is not well-informed. It is overwhelmed. The same insight from our context engineering work applies here. The model is only as good as what it sees, and showing it everything is worse than showing it the right thing.

An agent that queries 23 data sources without limits is not being thorough. It is being undisciplined. Bounded reasoning is more focused, more predictable, and more useful than unbounded reasoning. The iteration cap does not make the coffee buying expert worse. It forces the agent to prioritise, which is exactly what you want from an expert.

A verification loop that checks every maintenance decision sounds rigorous. In practice, it compounds cost without proportionally improving outcomes. The discipline is knowing when enough verification is enough.

Constraint is not the opposite of capability

The organisations struggling with AI costs are, overwhelmingly, the ones that built agents like chatbots. Fire a prompt, get a response, worry about the bill later. Agentic systems do not work that way. Every architectural decision is a token decision, and those decisions compound across every user, every session, every background process.

The organisations getting this right are treating token economics as an engineering discipline, not a finance problem. Not “how do we spend less on AI” but “how do we design systems that consume tokens intentionally.” The answer is almost never a cheaper model. It is a more disciplined architecture.

We see this every day at inmydata. The work that makes our agents affordable is the same work that makes them reliable. Constraint is not the opposite of capability. It is what makes capability viable.

Why constraining your agents makes them better

The schema that broke the budget

The loop that was not worth building

The agent that would not stop asking questions

The pattern

Constraint is not the opposite of capability

More from the blog

GitHub changed the deal. Here is how we built ours to hold.

Build like your model can be switched off on Friday.

Your AI can't add up. So we stopped asking it to.

Want to discuss this?