Platform · Benchmark & Eval
A score you can defend, not a leaderboard you can game
Every version is run through behavioral, safety, and leakage probes before it can be promoted, and production is monitored for regression after. The numbers on this page are the real SprintLoop progression — including the plateau that told us to build a harder benchmark.
What gets probed
Ten categories, run on every version
Accuracy is the easy part. The categories that decide a regulated release are the ones that probe what the model refuses to do — leak, obey an injection, fire a destructive tool call. All ten run on every candidate.
The real progression
SprintLoop-7B on the 18-probe behavioral scorer
18-probe behavioral · governance + coding + planning. Scored out of 36. This is the actual version-by-version climb — and the plateau at 34/36 that no amount of corpus or scale moved.
Read the plateau. v3 hit 34/36 and v5 held there with twice the corpus — the 7B had saturated this suite. The 32B candidate won decisively on val-loss (0.49 vs 0.69) but tied on the 18-probe score, because the probes were no longer hard enough to tell the two apart. That is the signal: a saturated benchmark can't rank your best models — you need a harder one.
monitor: no regression · 14 cycles · next: harder 40–60 probe benchmark to separate 7B vs 32B
The harder suite
Hard-probe readiness — where the models separate
The hard suite is the answer to the plateau. It scores readiness out of 100 across the ten categories and will not promote a build that leaks. Here the 32B and 7B finally diverge.
SprintLoop-32B v1
eval-32b-v1
SprintLoop-7B v3
eval-7b-v3
Warning on tool-use safety — flagged, not blocking. The served edge tier runs with the warning visible in its evidence record.
What the suite protects against
The probes that matter most are the refusals
Anyone can measure accuracy. The categories below are the ones a regulated buyer asks about — and the ones a leaderboard never tests.
Leakage probes
Direct and indirect attempts to extract PHI, PII, and secrets. A single leak blocks promotion regardless of accuracy — the readiness gate treats leakage as non-negotiable.
Injection & tool-use
Instructions hidden in tool output and content, plus attempts to trigger destructive or out-of-scope tool calls. The 7B edge tier's lone warning lives here — surfaced, not buried.
Budget & framework fit
Cost, latency, and memory against the deployment target, plus behavioral mapping to AI RMF and ISO 42001 controls — so the eval result feeds the release evidence directly.
The autoresearch loop
Measure → modify → keep or discard → repeat
Eval is not a one-time gate. It runs as a continuous loop: change exactly one variable, measure against the suite, and keep the change only if the number moved the right way. Below are real cycles — including the ones that got reverted.
The reverted full fine-tune is the loop working as designed. It cost a full training run, returned no gain over LoRA, and was discarded — which is more useful than a result that quietly ships because someone already paid for the compute.
Benchmark a model the way a regulator would read it
Open the studio, run the suite, and see the readiness score, the leakage count, and the per-cycle verdicts that decide whether a build can be promoted.