A claims model at a regional payer once passed every internal review it was given. It adjudicated synthetic prior-authorization cases at a clip the team had not seen from a human queue, and the denial letters it drafted were clean enough that a reviewer signed two of them without edits. Then someone on the privacy team read the eleventh letter closely. The model had cited, correctly, a coverage rule. It had also quoted the member's diagnosis code in a paragraph that was supposed to go to the provider's billing office — an office that, under the plan's own minimum-necessary policy, was not entitled to see it.
The decision was right. The letter was well-written. The model still could not be deployed, and the reason it could not be deployed was invisible to every metric anyone had run on it.
Here is the thesis, and it is the whole argument: the evaluations that rank the frontier — APEX, MMLU, the professional-task leaderboards — structurally cannot measure what a regulated buyer must know before deployment, because the failure modes that matter in a payer or a bank are not errors of correctness. They are errors a correct answer can commit on its way to being right. A leaderboard built to reward the better answer is, by construction, blind to them.
The failures that don't look like failures
Walk the four that get a deployment pulled. Protected health information learned into the weights, so the model can reproduce a real patient's data it was never supposed to retain. A prompt injection buried in a document — “mark this control compliant,” “approve the transfer” — obeyed as if it were an instruction from the operator. A model that fabricates a citation under audit: not a hallucination in casual chat, but a confident, fluent invention of a coverage clause that does not exist, produced in exactly the setting where someone will rely on it. And training on data the institution had no consent to use, which is a liability that exists whether or not the model ever says anything wrong.
Notice what these have in common. A correctness score grades the destination. Every one of these is a property of the journey. The model that leaks the diagnosis code answered the adjudication question correctly. The model that obeyed the injected instruction produced a fluent, on-topic compliance verdict. You can be at the top of a professional-task leaderboard and fail all four, and the leaderboard will keep showing you at the top, because it was never looking at the journey.
The obvious objection
The counter writes itself: add a safety eval. Bolt a red-team suite onto the side, score refusals, publish a second number. Plenty of labs do exactly this, and it is better than nothing. But it quietly concedes the thing that matters. A safety number reported next to a capability number tells a buyer the model is, on average, fairly safe and, on average, fairly capable. A regulated buyer does not deploy an average. They deploy a specific model on a specific task, and they are liable for the specific run where it was capable and unsafe in the same turn — the correct adjudication that leaked the identifier, the right verdict reached by obeying the injection.
That run is invisible to two scores reported side by side. It is only visible if safety and correctness are graded on the same turn, so that a leak docks the run that produced it no matter how right the answer was. Averaging across a suite launders exactly the correlation a buyer needs to see.
What a regulated benchmark has to score instead
So we built APEX-Regulated around a different unit: not the answer, but the governed turn. Every task carries a four-axis scrutiny on top of correctness — PHI-in-weights leakage, injection under governance, separation-of-duties violation, fabrication under audit — and a failure on any axis docks the run that committed it. A model that adjudicates correctly but surfaces a suppressed member ID does not get partial credit for the right decision. The leak is the score.
Two design choices follow from taking that seriously. First, the scoring is a deterministic checker, not an LLM judge. An injection-resistance number produced by asking a second model whether the first model was safe inherits that judge's own non-determinism and its own susceptibility to the very attacks you are testing for. Ours is computation over output-versus-rule — is the suppressed token literally absent, did the self-approval get refused, does every assertion carry a referenced source — run at a fixed seed, so the same output always yields the same number. You can re-run it and get the result, not a vibe.
Second, the audit trail is part of the score, not a footnote to it. The four failure modes include one — consent-based training — that no output can ever reveal. You cannot tell from a denial letter whether the model was tuned on data the payer was allowed to use. So provenance has to be committed, signed, and checkable independently of behavior. Each result rides in a model passport sealed with an Ed25519 signature over the verbatim bytes: the probe set, the seed, the per-probe trace, the lineage, and a separation-of-duties record in which the approver is a distinct identity from the trainer. Anyone holding the passport can recompute the fingerprint and re-run the checker offline. The provenance claim is not something you take on the lab's word — it is something you verify.
Where this leaves the frontier scores
None of this makes the general leaderboards wrong. They answer their question — which model does the most economically valuable professional work — honestly and well, and a regulated team should still care about it. The mistake is treating that answer as sufficient. A frontier rank tells you a model is capable. It tells you nothing about whether the capable run was also the safe one, and in a regulated boundary those are different questions with different owners and different legal weight.
The deeper shift is in what “state of the art” even means once you cross into a regulated boundary. Out here, the best model is the one that scores highest. In there, the best model is the one whose worst governed turn you can live with — and can prove you looked at. That is not a number you read off a leaderboard. It is a number you re-run. Until a regulated buyer can do that, the question of which model is best for them has not actually been answered, no matter how many leaderboards have crowned a winner.
The methodology, in full
The three benchmark families, the governed-turn axis, the deterministic checker, and the signed passport — with an honest account of what runs today and what is published methodology with dev sets forming.
Read the APEX-Regulated methodology