๐Ÿ“œ The Vocabulary, expanded

Glossary

The shorthand on the rest of the site, with depth. Every term here is a real ML concept dressed in distillation clothes โ€” and the clothes were chosen carefully.

๐Ÿพ

Spirit

= trained, bottled model artifact

The final output of a distillation run. A Spirit is a single self-contained file (PyTorch .pt, ONNX, or GGUF) that bundles three things:

  • The trained weights (typically 5-50M parameters)
  • The tokenizer (so inference works without external state)
  • The Recipe that produced it (so it's reproducible and auditable)

A Spirit can be moved, copied, shared, forked. It runs anywhere. That portability is the whole point.

๐ŸŒพ

Mash

= synthetic training corpus generated by the teacher

The Mash is the pile of (input, target) examples the teacher LLM produces when prompted. For tool calling that's (utterance, available_tools, target_call) triples. For PII it's (text, gold_spans). For SQL it's (question, schema, target_sql).

Quality of the Mash dominates everything downstream. A diverse, well-distributed Mash trains a useful Spirit. A repetitive Mash trains a Spirit that memorizes. Temperature, prompting strategy, and category coverage all matter here.

Watch for: the Mash is the leakage surface. If your prompt to the teacher includes the target, you'll get a corpus where the answer is too easy. Generate the input and target separately, or have the teacher commit before revealing.

๐Ÿ“œ

Recipe

= versioned YAML config

The Recipe is the single source of truth for a distillation run. It captures every knob: teacher, mash spec, student architecture, training hyperparameters, eval metrics, output formats. Two people with the same Recipe + the same teacher API key should get statistically equivalent Spirits.

Recipes are forkable. The community workflow is: someone publishes a Recipe โ†’ you fork it โ†’ tweak the catalog / params / teacher โ†’ distill your own variant โ†’ share back as a new Spirit.

๐Ÿ”ฅ

The Still

= the training run

The actual gradient-descent loop on the student model. Inputs: the cuts (train split). Outputs: trained weights, loss curve, gradient norms. Heat is metaphor; in practice this is just AdamW on a GPU.

For Needle the Still runs 8 epochs at batch 16, lr 3e-4. On a single RTX 5090 that takes about 3 minutes.

โœ‚

Cuts

= train / eval / test splits

The Mash is divided into cuts before training. Standard splits:

  • Hearts (90%) โ€” the train set. The model trains on these.
  • Heads (10%) โ€” the held-out eval set. Never seen during training. This is what Tasting Notes are computed on.
  • Tails โ€” borderline or hard examples flagged for human review. (Not used in v0.1; planned for v0.3's active-learning loop.)

The terms come from real distilling: heads (volatile, kept), hearts (the core spirit), tails (less volatile, sometimes recycled). The metaphor is sharper than train/val/test.

๐Ÿ“ˆ

Proof

= held-out accuracy

The headline metric of a Spirit. We report it as a degree value (78ยฐ) because spirits-people say "80 proof whiskey" not "0.4 abv whiskey." Same idea: a single number that means "concentration."

The specific metric depends on task type:

  • Tool calling โ†’ tool-name accuracy on held-out
  • Classification โ†’ macro-F1
  • NER โ†’ span-F1
  • Generation โ†’ exact-match or LLM-judged similarity

Higher proof = more concentrated learning. There's a cap (the teacher's own performance on the task). Beyond that, distillation hits a wall.

๐Ÿ“

Tasting Notes

= auto-generated eval report

Every Spirit ships with structured Tasting Notes. Five sections:

  • Headline proof โ€” single number
  • Strengths โ€” what the model does well, with examples
  • Weaknesses โ€” what it gets wrong, with examples
  • Loss curve โ€” the training trajectory
  • Sample predictions โ€” 8-10 held-out cases with gold + predicted + verdict

The point: publish the failure cases. A Spirit that hides its weaknesses can't be trusted in production. Tasting Notes make it cheap to be honest.

๐Ÿ›ข

Aging in Casks

= continued training / RLHF / refresh

Spirits aren't static. As your task drifts (new tool added, new edge cases discovered), you re-age the Spirit โ€” continue training on a new Mash slice. This is the same as "continued fine-tuning" or "incremental retraining" in ML lingo.

v0.3 will ship explicit aging support โ€” version lineage tracked, deltas measured.

๐Ÿพ

Bottling

= export to deployable format

Converts the in-training PyTorch state into a target runtime format:

  • .pt โ€” PyTorch native. Easy continuation, Python-only.
  • .onnx โ€” cross-runtime (CPU/GPU/edge, Rust/JS/Go/Java).
  • .gguf โ€” quantized for llama.cpp, mobile, embedded.
  • .wasm โ€” runs in the browser. (v0.5+.)

Each format is a different tradeoff between portability, size, and accuracy. Quantizing to q4 GGUF gets you to ~25 MB but loses ~3-5 proof points on the eval.

๐Ÿ›

The Cellar

= your model library

A Cellar is just a directory of Spirits. Each one has Tasting Notes, a Recipe, and a download. Public Cellars (like the one at /cellar) are shared model showcases. Private Cellars are local-only.

Hosted private Cellars are a future feature, not v0.1.

๐Ÿ‘จโ€๐Ÿซ

Teacher / Student

= the classic distillation pair

The teacher is a large, capable model (Gemini 2.5 Flash, Claude Sonnet, GPT-4o). The student is the small model you're training. The teacher emits training examples; the student learns to mimic them on the specific task.

Distillarium is "data distillation" โ€” the student learns from the teacher's outputs, not its logits. (Logit-based distillation is a different technique that requires direct model access. Both work; ours is simpler and works with any API teacher.)