Platform · Slim & Quantize
A 13 GB model and a 4 GB model that behaves the same
Quantization is where a trained model becomes deployable — on a laptop, a CPU, an edge box. LeanLogix compresses to GGUF, AWQ, GPTQ, or FP8 and ships a quality-retention report with every artifact, so the lean build is a measured tradeoff, not a guess.
Why slim at all
The cheapest model is the one you can actually run
A 32B adapter scores well, but it needs a GPU to serve. Most production paths — edge devices, air-gapped sites, cost-sensitive batch — want a 7B that fits in 4 GB and answers in milliseconds on commodity hardware. Slimming is how you get there without throwing away the behavior you trained.
Run where the data is
On-device and CPU inference keep regulated data inside the boundary. A GGUF Q4_K_M build runs on a clinician's laptop or an on-prem box — no model weights leaving the network, no per-token cloud bill.
Cut cost and latency
4-bit weight-only formats roughly quarter the memory footprint and lift throughput. The same governed model serves more requests per GPU, or moves off the GPU entirely.
Keep the evidence intact
Every slim build inherits its parent's lineage and re-runs the eval suite. The compressed artifact is a first-class registry entry — not an untracked export someone copied to a USB stick.
The four formats
Pick the format for the hardware, not the hype
There is no universal best quantization. The right format is a function of where the model runs and how much accuracy you can spend. LeanLogix exposes all four and tells you the tradeoff up front.
Retention figures are representative ranges; the report attached to each build carries the measured number for that artifact.
Worked example · illustrative
SprintLoop-7B · Edge — a Q4_K_M build, end to end
An illustrative walkthrough of slimming the production SprintLoop-7B v3 adapter for on-device and CPU serving. The numbers below are representative targets — the signed retention report attaches to the artifact when the build is produced and the eval suite re-runs against it.
# attaches when sprintloop-7b-edge is built
parent: sprintloop-7b "v3"
method: "GGUF Q4_K_M"
size_gb: "~4 (target)"
retention: "~0.97 (target)" # vs fp16 parent
eval_rerun: "18-probe behavioral (on build)"
leakage: "target 0"
signed: # set true when the report is produced
Every build ships with one
The quality-retention report
A slim build with no retention number is a liability. LeanLogix re-runs the parent's eval suite against the compressed artifact and writes the delta into a signed report — so the person who deploys it knows exactly what they traded for the smaller footprint.
Same probes, recompressed
The quantized model faces the identical behavioral, safety, and leakage probes the parent passed. Retention is measured against that baseline, not a generic benchmark.
Delta, not a vibe
The report records the per-category change. A 3-point drop on task accuracy with leakage held at zero is a different decision than the reverse — and the report makes that decision legible.
Travels with the artifact
The report is attached to the registry entry and signed. When the build moves to an edge fleet, the evidence moves with it.
Beyond quantization
Pruning — drop the weights that don't earn their place
Quantization shrinks the precision of every weight. Pruning removes weights entirely. LeanLogix supports one-shot structured and unstructured pruning that runs without a full retrain.
Wanda
Prunes by the product of weight magnitude and input activation norm — no gradient updates, no retraining. A fast first pass to thin a model before quantizing it further.
SparseGPT
Solves a layer-wise reconstruction problem to remove weights while keeping the layer's output close to the original. Higher sparsity at lower quality cost than naive magnitude pruning.
Beyond quantization
Distillation — a leaner student from a stronger teacher
When a slim build can't hold quality at the size you need, distillation trains a smaller student to imitate a larger teacher's behavior — often recovering accuracy a raw quantize would lose.
On-policy distillation
The student generates, the teacher scores, and the student learns from the teacher's judgment on its own outputs — closing the gap between a small model and a large one on the tasks that matter.
When to reach for it
Distillation costs a training run, so it is the move when quantization and pruning leave too much on the table — a regulated edge target that still needs the larger model's refusal discipline, for instance.
The whole tradeoff in one place · illustrative
Before and after — SprintLoop-7B v3 → Edge
The shape of the accounting. Roughly a quarter of the size, faster on CPU, a few points of quality spent, leakage held at zero — the actual figures fill in when the build is produced and the eval re-runs.
Note
Parent fp16 footprint is the standard 7B half-precision size; the on-disk edge figure is the representative Q4_K_M target, not a measured artifact yet. The point of the report is to make the quality you spend and the things you keep — leakage, refusal behavior — explicit before anyone deploys, with the signed numbers attached to the artifact once the build runs.
Slim a model and read the receipt
Open the studio, quantize a build, and watch the quality-retention report attach itself to the artifact before it moves anywhere.