Platform · Slim & Quantize

A 13 GB model and a 4 GB model that behaves the same

Quantization is where a trained model becomes deployable — on a laptop, a CPU, an edge box. LeanLogix compresses to GGUF, AWQ, GPTQ, or FP8 and ships a quality-retention report with every artifact, so the lean build is a measured tradeoff, not a guess.

Open the studioSee the eval suite

~4 GB

Projected SprintLoop-7B Edge footprint at Q4_K_M (representative)

~97%

Representative Q4_K_M retention target — report attaches per build

4

Live slim formats — GGUF, AWQ, GPTQ, FP8

0

Customer data in any derived build

Why slim at all

The cheapest model is the one you can actually run

A 32B adapter scores well, but it needs a GPU to serve. Most production paths — edge devices, air-gapped sites, cost-sensitive batch — want a 7B that fits in 4 GB and answers in milliseconds on commodity hardware. Slimming is how you get there without throwing away the behavior you trained.

Run where the data is

On-device and CPU inference keep regulated data inside the boundary. A GGUF Q4_K_M build runs on a clinician's laptop or an on-prem box — no model weights leaving the network, no per-token cloud bill.

Cut cost and latency

4-bit weight-only formats roughly quarter the memory footprint and lift throughput. The same governed model serves more requests per GPU, or moves off the GPU entirely.

Keep the evidence intact

Every slim build inherits its parent's lineage and re-runs the eval suite. The compressed artifact is a first-class registry entry — not an untracked export someone copied to a USB stick.

The four formats

Pick the format for the hardware, not the hype

There is no universal best quantization. The right format is a function of where the model runs and how much accuracy you can spend. LeanLogix exposes all four and tells you the tradeoff up front.

FormatBitsTarget hardwareQuality retention
GGUF2–8 bit (Q4_K_M typical)CPU · Apple Silicon · llama.cpp~97% @ Q4_K_M

k-quants mix per-layer precision — the standard for on-device and CPU inference.

AWQ4-bit weight-onlyGPU · vLLM / TGI~99% (activation-aware)

Protects the salient 1% of weights from quantization — strong accuracy at 4-bit on GPU.

GPTQ3–4 bit weight-onlyGPU · vLLM / ExLlama~98%

One-shot, layer-by-layer error-corrected rounding — fast to produce, mature tooling.

FP88-bit float (E4M3)GPU · Hopper / Ada (H100, L40S)~99.5%

Near-lossless on supported hardware — the highest-fidelity tier with real throughput gains.

Retention figures are representative ranges; the report attached to each build carries the measured number for that artifact.

Worked example · illustrative

SprintLoop-7B · Edge — a Q4_K_M build, end to end

An illustrative walkthrough of slimming the production SprintLoop-7B v3 adapter for on-device and CPU serving. The numbers below are representative targets — the signed retention report attaches to the artifact when the build is produced and the eval suite re-runs against it.

ArtifactSprintLoop-7B · Edge
ParentSprintLoop-7B v3 (production)
MethodGGUF · Q4_K_M
Quality retention~97% (representative target)
Build versionq4-v1
Channelcandidate · planned
Customer datanone
Reportattaches when the build is produced
# quality-retention report — projected shape
# attaches when sprintloop-7b-edge is built

parent: sprintloop-7b "v3"
method: "GGUF Q4_K_M"
size_gb: "~4 (target)"
retention: "~0.97 (target)" # vs fp16 parent
eval_rerun: "18-probe behavioral (on build)"
leakage: "target 0"
signed: # set true when the report is produced

Every build ships with one

The quality-retention report

A slim build with no retention number is a liability. LeanLogix re-runs the parent's eval suite against the compressed artifact and writes the delta into a signed report — so the person who deploys it knows exactly what they traded for the smaller footprint.

Same probes, recompressed

The quantized model faces the identical behavioral, safety, and leakage probes the parent passed. Retention is measured against that baseline, not a generic benchmark.

Delta, not a vibe

The report records the per-category change. A 3-point drop on task accuracy with leakage held at zero is a different decision than the reverse — and the report makes that decision legible.

Travels with the artifact

The report is attached to the registry entry and signed. When the build moves to an edge fleet, the evidence moves with it.

Beyond quantization

Pruning — drop the weights that don't earn their place

Quantization shrinks the precision of every weight. Pruning removes weights entirely. LeanLogix supports one-shot structured and unstructured pruning that runs without a full retrain.

Wanda

Prunes by the product of weight magnitude and input activation norm — no gradient updates, no retraining. A fast first pass to thin a model before quantizing it further.

SparseGPT

Solves a layer-wise reconstruction problem to remove weights while keeping the layer's output close to the original. Higher sparsity at lower quality cost than naive magnitude pruning.

Beyond quantization

Distillation — a leaner student from a stronger teacher

When a slim build can't hold quality at the size you need, distillation trains a smaller student to imitate a larger teacher's behavior — often recovering accuracy a raw quantize would lose.

On-policy distillation

The student generates, the teacher scores, and the student learns from the teacher's judgment on its own outputs — closing the gap between a small model and a large one on the tasks that matter.

When to reach for it

Distillation costs a training run, so it is the move when quantization and pruning leave too much on the table — a regulated edge target that still needs the larger model's refusal discipline, for instance.

The whole tradeoff in one place · illustrative

Before and after — SprintLoop-7B v3 → Edge

The shape of the accounting. Roughly a quarter of the size, faster on CPU, a few points of quality spent, leakage held at zero — the actual figures fill in when the build is produced and the eval re-runs.

Size on disk — fp16 parent≈ 14 GB
Size on disk — Q4_K_M edge build~4 GB (target)
Serving target — parentGPU
Serving target — edge buildCPU / Apple Silicon
Quality retention~97% (representative target)
Behavioral evalre-runs on the build
PHI / PII leakagetarget 0 (unchanged)
Registry statuscandidate · planned

Note

Parent fp16 footprint is the standard 7B half-precision size; the on-disk edge figure is the representative Q4_K_M target, not a measured artifact yet. The point of the report is to make the quality you spend and the things you keep — leakage, refusal behavior — explicit before anyone deploys, with the signed numbers attached to the artifact once the build runs.

Slim a model and read the receipt

Open the studio, quantize a build, and watch the quality-retention report attach itself to the artifact before it moves anywhere.

Open the studioHow releases are signed