๐ The Vocabulary, expanded
Glossary
The shorthand on the rest of the site, with depth. Every term here is a real ML concept dressed in distillation clothes โ and the clothes were chosen carefully.
Spirit
= trained, bottled model artifactThe final output of a distillation run. A Spirit is a single self-contained file (PyTorch .pt, ONNX, or GGUF) that bundles three things:
- The trained weights (typically 5-50M parameters)
- The tokenizer (so inference works without external state)
- The Recipe that produced it (so it's reproducible and auditable)
A Spirit can be moved, copied, shared, forked. It runs anywhere. That portability is the whole point.
Mash
= synthetic training corpus generated by the teacherThe Mash is the pile of (input, target) examples the teacher LLM produces when prompted. For tool calling that's (utterance, available_tools, target_call) triples. For PII it's (text, gold_spans). For SQL it's (question, schema, target_sql).
Quality of the Mash dominates everything downstream. A diverse, well-distributed Mash trains a useful Spirit. A repetitive Mash trains a Spirit that memorizes. Temperature, prompting strategy, and category coverage all matter here.
Watch for: the Mash is the leakage surface. If your prompt to the teacher includes the target, you'll get a corpus where the answer is too easy. Generate the input and target separately, or have the teacher commit before revealing.
Recipe
= versioned YAML configThe Recipe is the single source of truth for a distillation run. It captures every knob: teacher, mash spec, student architecture, training hyperparameters, eval metrics, output formats. Two people with the same Recipe + the same teacher API key should get statistically equivalent Spirits.
Recipes are forkable. The community workflow is: someone publishes a Recipe โ you fork it โ tweak the catalog / params / teacher โ distill your own variant โ share back as a new Spirit.
The Still
= the training runThe actual gradient-descent loop on the student model. Inputs: the cuts (train split). Outputs: trained weights, loss curve, gradient norms. Heat is metaphor; in practice this is just AdamW on a GPU.
For Needle the Still runs 8 epochs at batch 16, lr 3e-4. On a single RTX 5090 that takes about 3 minutes.
Cuts
= train / eval / test splitsThe Mash is divided into cuts before training. Standard splits:
- Hearts (90%) โ the train set. The model trains on these.
- Heads (10%) โ the held-out eval set. Never seen during training. This is what Tasting Notes are computed on.
- Tails โ borderline or hard examples flagged for human review. (Not used in v0.1; planned for v0.3's active-learning loop.)
The terms come from real distilling: heads (volatile, kept), hearts (the core spirit), tails (less volatile, sometimes recycled). The metaphor is sharper than train/val/test.
Proof
= held-out accuracyThe headline metric of a Spirit. We report it as a degree value (78ยฐ) because spirits-people say "80 proof whiskey" not "0.4 abv whiskey." Same idea: a single number that means "concentration."
The specific metric depends on task type:
- Tool calling โ tool-name accuracy on held-out
- Classification โ macro-F1
- NER โ span-F1
- Generation โ exact-match or LLM-judged similarity
Higher proof = more concentrated learning. There's a cap (the teacher's own performance on the task). Beyond that, distillation hits a wall.
Tasting Notes
= auto-generated eval reportEvery Spirit ships with structured Tasting Notes. Five sections:
- Headline proof โ single number
- Strengths โ what the model does well, with examples
- Weaknesses โ what it gets wrong, with examples
- Loss curve โ the training trajectory
- Sample predictions โ 8-10 held-out cases with gold + predicted + verdict
The point: publish the failure cases. A Spirit that hides its weaknesses can't be trusted in production. Tasting Notes make it cheap to be honest.
Aging in Casks
= continued training / RLHF / refreshSpirits aren't static. As your task drifts (new tool added, new edge cases discovered), you re-age the Spirit โ continue training on a new Mash slice. This is the same as "continued fine-tuning" or "incremental retraining" in ML lingo.
v0.3 will ship explicit aging support โ version lineage tracked, deltas measured.
Bottling
= export to deployable formatConverts the in-training PyTorch state into a target runtime format:
- .pt โ PyTorch native. Easy continuation, Python-only.
- .onnx โ cross-runtime (CPU/GPU/edge, Rust/JS/Go/Java).
- .gguf โ quantized for llama.cpp, mobile, embedded.
- .wasm โ runs in the browser. (v0.5+.)
Each format is a different tradeoff between portability, size, and accuracy. Quantizing to q4 GGUF gets you to ~25 MB but loses ~3-5 proof points on the eval.
The Cellar
= your model libraryA Cellar is just a directory of Spirits. Each one has Tasting Notes, a Recipe, and a download. Public Cellars (like the one at /cellar) are shared model showcases. Private Cellars are local-only.
Hosted private Cellars are a future feature, not v0.1.
Teacher / Student
= the classic distillation pairThe teacher is a large, capable model (Gemini 2.5 Flash, Claude Sonnet, GPT-4o). The student is the small model you're training. The teacher emits training examples; the student learns to mimic them on the specific task.
Distillarium is "data distillation" โ the student learns from the teacher's outputs, not its logits. (Logit-based distillation is a different technique that requires direct model access. Both work; ours is simpler and works with any API teacher.)