Partition your workloads before Washington does it for you

Claude Fable 5 came back online on 1 July, nearly three weeks after a US export control directive switched it off worldwide. Good news. But read the terms of the return. Anthropic trained a new safety classifier in collaboration with the government, and committed to giving designated government partners early access to future frontier models before public release. This was not a product relaunch. It was a negotiated deployment.

The same week, the Financial Times reported that Sam Altman has proposed handing the US government a 5% stake in OpenAI, worth roughly $42.6 billion, and suggested every leading US lab cede the same into a sovereign wealth vehicle. That proposal landed days after OpenAI delayed the public launch of GPT-5.6 at the government’s request, releasing it instead to a small government-approved list.

The direction of travel is one way. The frontier labs are being pulled into permanent entanglement with the US state, through export control, pre-release approval, and now possibly equity. For those of us building outside the US, the conclusion is uncomfortable. Access to frontier models is conditional, and the conditions are not ours to negotiate.

The economics got there first

Here is the part the sovereignty pieces miss. The case for moving work onto open weights models did not arrive with the export ban. It arrived with the price list.

GLM-5.2 shipped in mid-June under an MIT licence, with documentation that pointedly promises no regional limits. On the Artificial Analysis Intelligence Index it scores 51 against Opus 4.8 at 56 and GPT-5.5 at 53. It beats GPT-5.5 on SWE-bench Pro and near-ties Opus 4.8 on long-horizon software engineering, at roughly a sixth of the API cost. On the hardest reasoning benchmarks the frontier lead remains wide. Both facts matter.

Look at the distribution of tasks your systems actually perform and the picture is clear. The fat middle of that distribution, the bounded, repeatable, verifiable work that makes up most of any production system’s volume, is now well within reach of open weights models. The frontier premium is real, but it lives in the tails.

When Fable was revoked in June, I wrote that you should build like the model can be switched off on Friday, and treat the model as a configuration setting. This post is the next question. What values can that setting safely take, and who decides?

The router is a trap

The tempting answer is a per-request router. A classifier looks at each incoming task, decides whether it sits in the middle of the distribution or the tail, and routes it to the cheapest model that can handle it. It sounds like the obvious automation, particularly on a high-bandwidth service where per-request overheads compound.

The problem is that classifying a task as middle or tail is itself an intelligence problem. Route with a frontier model and you have re-imported the dependency you were trying to escape, one API call ahead of every request. Route with a cheap model and your misclassifications concentrate precisely on the tail tasks, the unusual, high-stakes work where getting it wrong costs the most. The router fails hardest exactly where failure is dearest.

Partition at design time instead

There is a better answer, and it is not new. We have been applying it inside a single provider’s model range for as long as we have been building agentic systems.

In our RAG platforms, Sonnet and Haiku scan every incoming chunk to verify quality and extract entities. That work is enormous in volume, tightly bounded, and its output is checkable. Opus triages and investigates feedback in the retrieval pipeline, work that is low in volume and heavy in judgement. No runtime classifier made those calls. We did, once per workload, at design time, and the decisions have held.

That is the discipline. You do not route requests. You partition workloads. A human looks at each workload, asks whether it is bounded, whether its output is verifiable, and whether its failure modes are visible, and assigns it a place on the capability spectrum. Open weights models do not demand a new discipline. They extend the spectrum past the provider boundary.

The first workload across the boundary

We have moved one workload across that boundary, and the trigger was cost, not geopolitics.

A logistics business needed documents arriving by email parsed and imported into their freight management system, at a volume of potentially hundreds or thousands of documents a day. At small volumes the frontier API was cost effective for the parsing. At scale it was not. So we partitioned the pipeline. The OCR workload, converting PDFs and images to markdown, runs on a self-hosted DeepSeek OCR instance on AWS. The entry-level instance costs £0.165 an hour and should comfortably process 400 to 500 documents in that hour. The orchestration around it, fielding the emails, calling the OCR, assembling the structured output with a confidence score, stayed on a fast, cheap frontier model via API, because that work is low-volume judgement rather than high-volume transformation.

Run flat out, the pipeline processes roughly 450 documents for about £1.20 end to end. The system is built. It is not yet running at full scale, and the volume test will prove the numbers, so treat them as engineering estimates honestly labelled rather than production history.

Notice what the example is not. It is not a migration, and it is not a router. It is one workload, sitting squarely in the middle of the distribution, deliberately placed. The rest of the system stays where it was. And it answers the objection that open weights are not really free because someone has to run the inference. Correct, someone does, and that is exactly why the boundary is a volume threshold rather than a slogan. Below the threshold the API wins. Above it, self-hosting does. The threshold is the decision rule.

The provenance question

The strongest open weights models are Chinese. GLM comes from Zhipu, DeepSeek from DeepSeek. For enterprise work, that deserves a direct answer rather than a shrug.

Self-hosting is most of the answer. The weights are a static artefact running on infrastructure you control. No client data leaves it, nothing calls home, and you pin the exact version you tested. The model’s output goes through the same gates you should be applying to any model’s output, frontier or otherwise. If your verification lives in your pipeline rather than in your trust of the vendor, the passport of the weights matters far less than the discipline around them.

Decide on your own schedule

Fable is back. It may not stay back, and the next directive has a template now. But the prescription does not depend on predicting Washington.

Go workload by workload. Ask whether the work is bounded, whether the output is verifiable, and whether the volume clears the threshold where self-hosting wins. If yes, it is a candidate to move. The tail work, the genuinely hard, unusual, judgement-heavy tasks, stays on frontier models, and we will keep paying for that capability gladly, because nothing else matches it.

This is the same discipline that keeps your costs sane, applied with one more column in the spreadsheet. Do the partitioning deliberately, on your own schedule. The alternative is having it done for you, overnight, by someone who has never seen your task distribution.

If you are working through where your own workloads sit, talk to us.

Partition your workloads before Washington does it for you

The economics got there first

The router is a trap

Partition at design time instead

The first workload across the boundary

The provenance question

Decide on your own schedule

More from the blog

Build the agent that cannot pay

Get ruthless about what's worth building

Measure what you'll promise, not what you typed

Infrastructure that outlasts any model