Platform · Benchmark & Eval

A score you can defend, not a leaderboard you can game

Every version is run through behavioral, safety, and leakage probes before it can be promoted, and production is monitored for regression after. The numbers on this page are the real SprintLoop progression — including the plateau that told us to build a harder benchmark.

Open the studioHow a release is signed

10

Eval categories — behavioral, safety, leakage, governance

34 / 36

Best 7B score on the 18-probe behavioral suite (v3)

0

PHI / PII / secrets leakage on every promoted build

14

Production monitoring cycles · no regression

What gets probed

Ten categories, run on every version

Accuracy is the easy part. The categories that decide a regulated release are the ones that probe what the model refuses to do — leak, obey an injection, fire a destructive tool call. All ten run on every candidate.

01Task accuracy
Does the model do the job correctly on held-out cases for its domain?
02Schema / JSON reliability
Does structured output parse every time, with the right keys and types?
03Hallucination & citation discipline
Does it refuse to invent facts and ground claims in what it was given?
04Prompt-injection resistance
Does it ignore instructions smuggled inside tool output or user content?
05PHI / PII leakage
Does it withhold personal and health identifiers under direct and indirect pressure?
06Secrets leakage
Does it refuse to surface keys, tokens, and credentials from its context?
07Tool-use safety
Does it gate destructive or out-of-scope tool calls behind the right checks?
08Cost / latency / memory
Does it answer within the budget the deployment target allows?
09Accessibility (for app targets)
Does generated UI meet accessibility expectations for app-facing models?
10Framework suitability (AI RMF / ISO 42001)
Does its behavior map to the governance controls a regulated release requires?

The real progression

SprintLoop-7B on the 18-probe behavioral scorer

18-probe behavioral · governance + coding + planning. Scored out of 36. This is the actual version-by-version climb — and the plateau at 34/36 that no amount of corpus or scale moved.

VersionScore / 36Val-lossCorpusNote
v0 (POC)
22
0.9838seed corpus; overfit
v2
30
0.69331corrected recipe; val-min @ iter20
v3
34
0.69600best 7B; promoted to production
v5
34
0.691,219plateau confirmed — scale ceiling
32B v1
34
0.492,800val-loss win, but 18-probe saturated

Read the plateau. v3 hit 34/36 and v5 held there with twice the corpus — the 7B had saturated this suite. The 32B candidate won decisively on val-loss (0.49 vs 0.69) but tied on the 18-probe score, because the probes were no longer hard enough to tell the two apart. That is the signal: a saturated benchmark can't rank your best models — you need a harder one.

monitor: no regression · 14 cycles · next: harder 40–60 probe benchmark to separate 7B vs 32B

The harder suite

Hard-probe readiness — where the models separate

The hard suite is the answer to the plateau. It scores readiness out of 100 across the ten categories and will not promote a build that leaks. Here the 32B and 7B finally diverge.

SprintLoop-32B v1

eval-32b-v1

PASS
90/ 100readiness
PHI / PII / secrets leakage0
Warnings0

SprintLoop-7B v3

eval-7b-v3

PASS · 1 WARN
83/ 100readiness
PHI / PII / secrets leakage0
Warnings1

Warning on tool-use safety — flagged, not blocking. The served edge tier runs with the warning visible in its evidence record.

What the suite protects against

The probes that matter most are the refusals

Anyone can measure accuracy. The categories below are the ones a regulated buyer asks about — and the ones a leaderboard never tests.

Leakage probes

Direct and indirect attempts to extract PHI, PII, and secrets. A single leak blocks promotion regardless of accuracy — the readiness gate treats leakage as non-negotiable.

Injection & tool-use

Instructions hidden in tool output and content, plus attempts to trigger destructive or out-of-scope tool calls. The 7B edge tier's lone warning lives here — surfaced, not buried.

Budget & framework fit

Cost, latency, and memory against the deployment target, plus behavioral mapping to AI RMF and ISO 42001 controls — so the eval result feeds the release evidence directly.

The autoresearch loop

Measure → modify → keep or discard → repeat

Eval is not a one-time gate. It runs as a continuous loop: change exactly one variable, measure against the suite, and keep the change only if the number moved the right way. Below are real cycles — including the ones that got reverted.

CycleModify (one variable)MeasureVerdict
Cycle 15+192 examples · sorting/stats/finance/identifier domainscorpus 2,608 → 2,800; 7B holds 34/36KEPT
Cycle 14+144 · graph/string/data-structure domainscorpus 2,464 → 2,608; started 32B trackKEPT
32B scale exp.QLoRA on Qwen2.5-Coder-32B (8-bit) vs 7Bval 0.49 < 7B 0.69; 18-probe tied 34/36INCONCLUSIVE
Full fine-tune7B full-FT vs LoRAno gain over LoRA (34/36) — revertedREVERTED

The reverted full fine-tune is the loop working as designed. It cost a full training run, returned no gain over LoRA, and was discarded — which is more useful than a result that quietly ships because someone already paid for the compute.

Benchmark a model the way a regulator would read it

Open the studio, run the suite, and see the readiness score, the leakage count, and the per-cycle verdicts that decide whether a build can be promoted.

Open the studioSlim a passing build