⚙ The Engineering
How it works
Distillation isn't magic — it's synthetic data + a small network + a careful eval loop. This page walks through every stage of what happens when you run distillery distill.
The five stages, end-to-end
┌─────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Recipe │ ─→ │ Teacher │ ─→ │ Tokenize │ ─→ │ Still │ ─→ │ Tasting │ ─→ Spirit
│ (YAML) │ │ → Mash │ │ + Split │ │ (8 ep.) │ │ Notes │ (.pt)
└─────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
↑ ↑ ↑ ↑ ↑
you write Gemini Flash WordPiece PyTorch held-out
it once ~$0.30/1k 4k vocab 5090 100 ex.
│ │ │ │ │
└──── 30 sec ──┴── 20-30 min ───┴── 1 min ──────┴── 3 min ──────┘
wall-clock for the reference Needle Spirit Stage 1 · The Recipe
Every distillation run is fully defined by a single YAML file. No hidden state, no flags, no environment quirks.
name: needle.tool-calling version: 1 teacher: # what generates the training data provider: gemini model: gemini-2.5-flash temperature: 0.9 # high → diversity mash: # the synthetic corpus total_examples: 1000 examples_per_call: 10 tools_per_call: { min: 3, max: 6 } student: # the model being trained arch: attention-only-glu d_model: 384 n_heads: 6 n_layers: 8 max_seq_len: 256 cuts: { train: 0.9, eval: 0.1 } still: # the training loop epochs: 8 batch_size: 16 lr: 3.0e-4 tasting: # held-out evaluation metrics: [tool_name_accuracy, arg_key_f1, exact_call_accuracy] held_out: 100
The Recipe is the contract. Two people with the same Recipe + the same teacher API key get statistically equivalent Spirits. That's the reproducibility property.
Stage 2 · The Mash — synthetic data from the teacher
This is the most important and most misunderstood step. We ask the teacher to generate examples of (input, output) pairs for our task.
For tool calling, each batch looks like:
# Prompt sent to Gemini Flash You are a data generator for training a small function-calling model. Available tools: [send_message, set_timer, get_weather, ...] Produce 10 training examples as JSON. Each: { "utterance": <user query>, "target_call": [{"name": ..., "args": ...}] } Diverse phrasing. Vary contact names, times, etc.
The teacher returns realistic queries paired with the correct tool calls:
{ utterance: "Hey can you log my weight? I'm at 75kg now.",
tools: [{name: log_health_metric, params: ...}, {name: send_message, params: ...}, ...],
target_call: [{name: log_health_metric, args: {metric: weight, value: 75}}] } For 1000 examples we make ~100 batched calls (~10 examples each) at temperature 0.9 for diversity. Cost: about $0.30 in Gemini Flash. Time: 20-30 minutes.
⚠ The non-obvious detail. Each training row needs the available tools AS DISTINCT FROM the target call. If the student is shown only the target tool, it learns to predict the tool name from the input — but never learns to choose between alternatives. That's a label leak. We caught and fixed exactly this bug while building the reference Needle Spirit.
Stage 3 · Tokenize and cut
We train a tiny WordPiece tokenizer (4096 vocab) on the full corpus — utterances, tool definitions, target calls. This keeps the student's embedding table small (4096 × 384 = ~1.5M params just for embeddings).
The data is then split into cuts:
- Hearts (90%) — the train set. The model sees these during training.
- Heads (10%) — the held-out eval set. Never seen during training. Used to compute Tasting Notes.
- Tails — in the future, borderline / hard examples we flag for human review.
Stage 4 · The Still — the actual training
The student is an attention-only transformer. We replaced the standard feedforward block (FFN) with a Gated Linear Unit (GLU). This shaves parameters and matches our empirical finding that attention does most of the heavy lifting on structured tasks like tool calling.
Student architecture (one block)
input tokens schema tokens
↓ ↓
┌─────────────┐ ┌──────────────┐
│ Embeddings │ │ Embeddings │
│ + RoPE │ │ │
└──────┬──────┘ └──────┬───────┘
↓ │
┌─────────────┐ │
│ Self-Attn │ ← LayerNorm + Residual │
└──────┬──────┘ │
↓ ↓
┌─────────────────────────────────────────┐
│ Cross-Attn (input ← schema) │
│ — this is where the model "looks up" │
│ which tool to call from the catalog │
└────────────────────┬────────────────────┘
↓
┌────────────────┐
│ GLU block │ ← replaces FFN
│ down(GELU(up)*gate) │ more params/FLOPs efficient
└────────┬───────┘
↓
output Total: 8 layers × this block + token embeddings + final projection = 20.7M parameters.
The training loop
# pseudocode of one training step for batch in dataloader: input_ids = tokenize("[QUERY] " + utterance + " [/QUERY] [CALL] " + target_json + " [/CALL]") schema_ids = tokenize(serialize(available_tools)) # the tool catalog, NOT the target! logits = model(input_ids, schema_ids) loss = cross_entropy(logits[:-1], input_ids[1:]) # next-token prediction loss.backward() optimizer.step()
8 epochs at batch 16, lr 3e-4 with AdamW. On a single RTX 5090 this run takes ~3 minutes. Loss curve drops from 4.65 to 0.73.
Stage 5 · Tasting Notes
After training, we run inference on the 100 held-out cuts and compute three metrics:
| Metric | What it measures | Needle (this Spirit) |
|---|---|---|
| Tool-name accuracy | Did the model pick the right tool out of the catalog? | 78% (random baseline ~25%) |
| Arg-key F1 | Of the argument names it produced, how many matched gold? | 0.73 (p=0.85, r=0.64) |
| Exact-call accuracy | Did it match the FULL gold call — tool + every arg key + every arg value? | 3% (value-level prediction is the weak spot) |
Plus a samples table — the first 8 held-out predictions with their gold values, so you can eyeball where it succeeds and fails. Every Spirit ships with its failure cases.
Stage 6 (optional) · Bottling for deployment
The trained Spirit is just a PyTorch .pt file (model weights + tokenizer + Recipe). For production deployment we re-pack it:
| Format | Size | Runs on | Use when |
|---|---|---|---|
| PyTorch .pt | 249 MB | Python anywhere | Default; training continuation |
| ONNX | ~100 MB | Most runtimes: CPU/GPU/edge | Cross-language inference (Rust, JS, Go, Java) |
| GGUF q4 | ~25 MB | llama.cpp, mobile, embedded | Resource-constrained deployment |
| WASM | ~50 MB | The browser | Run entirely client-side, no backend |
What's NOT solved yet (the honest part)
- Argument-value accuracy. 3% exact-call is rough — the WordPiece tokenizer splits JSON values awkwardly. Byte-level BPE is the v0.2 fix.
- Only one teacher backend. Gemini Flash is wired. Claude and OpenAI providers are stubs.
- Single-shot recipes. No iterative refinement yet — you can't say "regenerate just the failed examples and retrain on the union."
- No quantization-aware training. We bottle to q4 GGUF post-hoc, which loses accuracy. v0.4 will train aware.
- Tasting Notes are statistical. They don't catch semantic failures (predicting "Twitter" instead of "Instagram"). LLM-as-judge eval is on the roadmap.