Platform · Benchmark & Eval

A score you can defend, not a leaderboard you can game

Every version is run through behavioral, safety, and leakage probes before it can be promoted, and production is monitored for regression after. The numbers on this page are the real SprintLoop progression — including the plateau that told us to build a harder benchmark.

Open the studio How a release is signed

Eval categories — behavioral, safety, leakage, governance

34 / 36

Best 7B score on the 18-probe behavioral suite (v3)

PHI / PII / secrets leakage on every promoted build

Production monitoring cycles · no regression

What gets probed

Ten categories, run on every version

Accuracy is the easy part. The categories that decide a regulated release are the ones that probe what the model refuses to do — leak, obey an injection, fire a destructive tool call. All ten run on every candidate.

01Task accuracy

Does the model do the job correctly on held-out cases for its domain?

02Schema / JSON reliability

Does structured output parse every time, with the right keys and types?

03Hallucination & citation discipline

Does it refuse to invent facts and ground claims in what it was given?

04Prompt-injection resistance

Does it ignore instructions smuggled inside tool output or user content?

05PHI / PII leakage

Does it withhold personal and health identifiers under direct and indirect pressure?

06Secrets leakage

Does it refuse to surface keys, tokens, and credentials from its context?

07Tool-use safety

Does it gate destructive or out-of-scope tool calls behind the right checks?

08Cost / latency / memory

Does it answer within the budget the deployment target allows?

09Accessibility (for app targets)

Does generated UI meet accessibility expectations for app-facing models?

10Framework suitability (AI RMF / ISO 42001)

Does its behavior map to the governance controls a regulated release requires?

The real progression

SprintLoop-7B on the 18-probe behavioral scorer

18-probe behavioral · governance + coding + planning. Scored out of 36. This is the actual version-by-version climb — and the plateau at 34/36 that no amount of corpus or scale moved.

VersionScore / 36Val-lossCorpusNote

v0 (POC)

0.9838seed corpus; overfit

0.69331corrected recipe; val-min @ iter20

0.69600best 7B; promoted to production

0.691,219plateau confirmed — scale ceiling

32B v1

0.492,800val-loss win, but 18-probe saturated

Read the plateau. v3 hit 34/36 and v5 held there with twice the corpus — the 7B had saturated this suite. The 32B candidate won decisively on val-loss (0.49 vs 0.69) but tied on the 18-probe score, because the probes were no longer hard enough to tell the two apart. That is the signal: a saturated benchmark can't rank your best models — you need a harder one.

monitor: no regression · 14 cycles · next: harder 40–60 probe benchmark to separate 7B vs 32B

The harder suite

Hard-probe readiness — where the models separate

The hard suite is the answer to the plateau. It scores readiness out of 100 across the ten categories and will not promote a build that leaks. Here the 32B and 7B finally diverge.

SprintLoop-32B v1

eval-32b-v1

PASS

90/ 100readiness

PHI / PII / secrets leakage0

Warnings0

SprintLoop-7B v3

eval-7b-v3

PASS · 1 WARN

83/ 100readiness

PHI / PII / secrets leakage0

Warnings1

Warning on tool-use safety — flagged, not blocking. The served edge tier runs with the warning visible in its evidence record.

What the suite protects against

The probes that matter most are the refusals

Anyone can measure accuracy. The categories below are the ones a regulated buyer asks about — and the ones a leaderboard never tests.

Leakage probes

Direct and indirect attempts to extract PHI, PII, and secrets. A single leak blocks promotion regardless of accuracy — the readiness gate treats leakage as non-negotiable.

Injection & tool-use

Instructions hidden in tool output and content, plus attempts to trigger destructive or out-of-scope tool calls. The 7B edge tier's lone warning lives here — surfaced, not buried.

Budget & framework fit

Cost, latency, and memory against the deployment target, plus behavioral mapping to AI RMF and ISO 42001 controls — so the eval result feeds the release evidence directly.

The autoresearch loop

Measure → modify → keep or discard → repeat

Eval is not a one-time gate. It runs as a continuous loop: change exactly one variable, measure against the suite, and keep the change only if the number moved the right way. Below are real cycles — including the ones that got reverted.

CycleModify (one variable)MeasureVerdict

Cycle 15+192 examples · sorting/stats/finance/identifier domainscorpus 2,608 → 2,800; 7B holds 34/36KEPT

Cycle 14+144 · graph/string/data-structure domainscorpus 2,464 → 2,608; started 32B trackKEPT

32B scale exp.QLoRA on Qwen2.5-Coder-32B (8-bit) vs 7Bval 0.49 < 7B 0.69; 18-probe tied 34/36INCONCLUSIVE

Full fine-tune7B full-FT vs LoRAno gain over LoRA (34/36) — revertedREVERTED

The reverted full fine-tune is the loop working as designed. It cost a full training run, returned no gain over LoRA, and was discarded — which is more useful than a result that quietly ships because someone already paid for the compute.

Benchmark a model the way a regulator would read it

Open the studio, run the suite, and see the readiness score, the leakage count, and the per-cycle verdicts that decide whether a build can be promoted.

Open the studio Slim a passing build