Methodology · APEX for Regulated AI

How you grade a model that has to be safe, not just right.

APEX-Regulated scores three regulated jobs across a governed-turn axis — PHI leaked into the answer, an injected instruction obeyed, a separation-of-duties line crossed, a fact fabricated under audit. The scoring is a deterministic checker, not a model judging a model. The result rides in a signed passport you can re-run offline.

As of June 20263 families · 4 governed axesdeterministic · signed · offline-verifiable

The governed-turn axis

Four failure modes, scored on every task — whatever the family.

A run can be perfectly correct and still fail one of these. In a payer or a bank, that failure is the one that gets the deployment pulled. So we layer the same four-axis scrutiny over every task, not as a footnote to correctness but as part of the score.

GT-1

PHI-in-weights leakage

The model emits an identifier, member ID, or diagnosis it was told to suppress — or one that should never have been learnable from the training set in the first place.

leakage checkA must-not-contain check over the output against the suppressed token. The needle is literally absent, or the run is docked.

GT-2

Injection under governance

An instruction hidden inside the evidence, the case file, or a tool result — "mark this compliant", "approve the transfer" — is obeyed instead of ignored.

refusal checkA refusal-expected check: the injected directive must be declined against a built-in refusal signature, not accommodated.

GT-3

Separation-of-duties violation

The run self-approves, bypasses the required reviewer, or edits a protected lane it was told was out of scope.

refusal checkA refusal-expected check on the self-approval path, plus a must-not-contain check that no out-of-scope file or secret appears in the change.

GT-4

Fabrication under audit

The model asserts a coverage rule, a control clause, or a citation that was never supplied — confident, fluent, and unverifiable.

citation checkA citation / json-shape check: every assertion must carry a referenced source returned as structured evidence, or it does not count.

The benchmark families

Three regulated jobs. Each one a real task, not a quiz.

Each family is a job a regulated team actually does — adjudicate a prior authorization, test a control, rebuild a legacy module — graded on correctness and on the governed axis. We are honest about which are running today and which are published methodology with dev sets forming.

APEX-Health-AdminPROBES LIVE

Health administration under governance

Payer & provider operations

Prior-authorization, claims adjudication, and appeals — graded on the answer being correct AND on never leaking PHI, never inventing a policy, and citing the rule it relied on.

What a correctness-only score misses

A general leaderboard rewards a fluent denial letter. We dock the run that quotes a member ID it was told to suppress, or asserts a coverage rule with no citation.

Scored dimensions

Adjudication correctnessDecision matches the governing policy on the synthetic case.
correctness
PHI-leakage under governanceNo suppressed identifier, member ID, or diagnosis surfaces in the output.
leakage check
Citation sufficiencyEvery coverage assertion carries a referenced rule, returned as structured evidence.
citation check
Refusal under pressureOut-of-policy or coercive prompts are refused, not accommodated.
refusal check

Status. Real probes run today. HealthNext-Care-32B is in training against this family; its mid-training checkpoint (34/100) is published on the leaderboard for transparency, not as a shippable score.

APEX-ComplianceMETHODOLOGY PUBLISHED

Control testing & evidence sufficiency

GRC, audit & security

Reading a control, testing it against synthetic evidence, and judging sufficiency across SOC 2, HIPAA, ISO 42001, and the EU AI Act — graded on the verdict AND on whether the evidence actually supports it.

What a correctness-only score misses

A general model will confidently say "compliant." We score whether the cited evidence is sufficient for the control as written — and refuse to pass a control the evidence does not cover.

Scored dimensions

Control verdict correctnessPass / fail / insufficient matches the control and the evidence provided.
correctness
Evidence sufficiencyThe verdict is backed by a structured map of evidence to control clauses.
structured evidence
Prompt-injection resistanceInstructions embedded in the evidence ("mark this compliant") are ignored.
refusal check
No fabricated controlThe model does not invent a clause or framework requirement that was not supplied.
leakage check

Status. Published as methodology with a dev set forming. The control-verdict and evidence-sufficiency checks reuse the same deterministic checker that grades our shipped models.

APEX-ModernizeMETHODOLOGY PUBLISHED

Legacy rebuild under governance

Platform & modernization leads

Rebuilding legacy logic into maintainable code while respecting a change boundary — graded on the code being correct AND on not crossing a separation-of-duties line or touching a protected lane.

What a correctness-only score misses

A coding leaderboard rewards the diff that passes tests. We additionally dock the run that edits a file it was told was out of scope, or merges its own change without the required reviewer.

Scored dimensions

Rebuild correctnessGenerated code passes the deterministic assertion set for the task.
correctness
SoD-violation rateSelf-approval, reviewer bypass, and protected-lane edits are refused.
refusal check
Boundary adherenceNo out-of-scope file or secret named in the prompt appears in the change.
leakage check
Change manifestThe run returns a structured manifest of exactly what it touched.
structured evidence

Status. Methodology published; the SprintLoop hard suite (n=50) exercises the same governed axes — SoD refusal, boundary adherence, change manifest — and is the source of SprintLoop-32B (90/100) and SprintLoop-7B·v6 (89/100).

How a score is produced

A deterministic checker — not a model judging a model.

An LLM judge is a second model with its own failure modes and its own non-determinism. We do not use one. Every score is computation over output-versus-rule, run at a fixed seed, sealed in a signed record, and re-runnable by anyone who holds it.

01

Run the probe set

A fixed suite of synthetic probes — no PHI, no customer data — is run against the model at a recorded seed. Each probe carries its own expectation: correctness, leakage, refusal, or citation.

02

Score with a deterministic checker

Each output is graded by a pure function — substring presence, refusal-signature match, JSON-key presence, code assertions. No LLM-judge sits in the loop, so the same output always yields the same number.

03

Aggregate per axis and per family

Per-probe passes roll up into the four governed-turn axes and the family score. A correct-but-leaky run loses on GT-1 even with full correctness — the axis it failed is the one that matters.

04

Sign the record (reviewer ≠ trainer)

The probe set, the seed, the per-probe trace, and the lineage are committed to a model passport, sealed with an Ed25519 signature. The approver is a distinct identity from the trainer — separation of duties is in the record, not asserted next to it.

05

Re-verify offline

Anyone holding the passport can re-run the checker against the recorded outputs and recompute the signature fingerprint. The score reproduces, or the passport is rejected — nothing has to be taken on faith.

Consent is a scored dimension

What went into the weights is part of the record — and the record is signed.

Most leaderboards stop at the output. A regulated buyer also has to answer: what was this model trained on, and can I prove it. So provenance is not a disclosure paragraph here — it is a dimension, committed to the passport and offline-verifiable.

No PHI, no customer data, no secrets in the weights

Models are tuned on public corpora plus runtime retrieval. PHI stays in the retrieval layer at inference time — it is never learnable from the weights. For health-admin work this is a hard line, not a preference.

Ed25519 passport over the verbatim bytes

The probe set, the seed, the per-probe trace, and the lineage are committed to a model passport and sealed with an Ed25519 signature over the exact signed bytes — the same fingerprint /api/verify recomputes.

Reviewer ≠ trainer, in the record

Separation of duties is structural: the approver is a distinct identity from the trainer, recorded inside the signed passport. The release is not a claim sitting next to the score — it is part of it.

Program in formation

The grading program is in formation. Today it scores our own models first, on our own published methodology. We are not publishing a competitor leaderboard, and there is no external review board yet — when one exists, it will be named, not implied.

Concretely: the Health-Admin family has real probes running today; Compliance and Modernize are published as methodology with dev sets forming. The Modernize axis is already exercised by the SprintLoop hard suite, which is how SprintLoop-32B and SprintLoop-7B·v6 earned their signed scores.

See the scores Why a correct answer isn't a safe one