⚙ The Engineering

How it works

Distillation isn't magic — it's synthetic data + a small network + a careful eval loop. This page walks through every stage of what happens when you run distillery distill.

The five stages, end-to-end

   ┌─────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
   │ Recipe  │ ─→ │ Teacher  │ ─→ │ Tokenize │ ─→ │  Still   │ ─→ │ Tasting  │ ─→ Spirit
   │ (YAML)  │    │  → Mash  │    │  + Split │    │ (8 ep.)  │    │  Notes   │   (.pt)
   └─────────┘    └──────────┘    └──────────┘    └──────────┘    └──────────┘
       ↑              ↑                ↑               ↑                ↑
   you write     Gemini Flash      WordPiece       PyTorch         held-out
   it once        ~$0.30/1k        4k vocab        5090            100 ex.
       │              │                │               │                │
       └──── 30 sec ──┴── 20-30 min ───┴── 1 min ──────┴── 3 min ──────┘
                     wall-clock for the reference Needle Spirit

Stage 1 · The Recipe

Every distillation run is fully defined by a single YAML file. No hidden state, no flags, no environment quirks.

name: needle.tool-calling
version: 1

teacher:                    # what generates the training data
  provider: gemini
  model: gemini-2.5-flash
  temperature: 0.9        # high → diversity

mash:                       # the synthetic corpus
  total_examples: 1000
  examples_per_call: 10
  tools_per_call: { min: 3, max: 6 }

student:                    # the model being trained
  arch: attention-only-glu
  d_model: 384
  n_heads: 6
  n_layers: 8
  max_seq_len: 256

cuts: { train: 0.9, eval: 0.1 }

still:                      # the training loop
  epochs: 8
  batch_size: 16
  lr: 3.0e-4

tasting:                    # held-out evaluation
  metrics: [tool_name_accuracy, arg_key_f1, exact_call_accuracy]
  held_out: 100

The Recipe is the contract. Two people with the same Recipe + the same teacher API key get statistically equivalent Spirits. That's the reproducibility property.

Stage 2 · The Mash — synthetic data from the teacher

This is the most important and most misunderstood step. We ask the teacher to generate examples of (input, output) pairs for our task.

For tool calling, each batch looks like:

# Prompt sent to Gemini Flash
You are a data generator for training a small function-calling model.
Available tools: [send_message, set_timer, get_weather, ...]

Produce 10 training examples as JSON. Each:
  { "utterance": <user query>,
    "target_call": [{"name": ..., "args": ...}] }

Diverse phrasing. Vary contact names, times, etc.

The teacher returns realistic queries paired with the correct tool calls:

{ utterance: "Hey can you log my weight? I'm at 75kg now.",
  tools: [{name: log_health_metric, params: ...}, {name: send_message, params: ...}, ...],
  target_call: [{name: log_health_metric, args: {metric: weight, value: 75}}] }

For 1000 examples we make ~100 batched calls (~10 examples each) at temperature 0.9 for diversity. Cost: about $0.30 in Gemini Flash. Time: 20-30 minutes.

⚠ The non-obvious detail. Each training row needs the available tools AS DISTINCT FROM the target call. If the student is shown only the target tool, it learns to predict the tool name from the input — but never learns to choose between alternatives. That's a label leak. We caught and fixed exactly this bug while building the reference Needle Spirit.

Stage 3 · Tokenize and cut

We train a tiny WordPiece tokenizer (4096 vocab) on the full corpus — utterances, tool definitions, target calls. This keeps the student's embedding table small (4096 × 384 = ~1.5M params just for embeddings).

The data is then split into cuts:

Hearts (90%) — the train set. The model sees these during training.
Heads (10%) — the held-out eval set. Never seen during training. Used to compute Tasting Notes.
Tails — in the future, borderline / hard examples we flag for human review.

Stage 4 · The Still — the actual training

The student is an attention-only transformer. We replaced the standard feedforward block (FFN) with a Gated Linear Unit (GLU). This shaves parameters and matches our empirical finding that attention does most of the heavy lifting on structured tasks like tool calling.

Student architecture (one block)

     input tokens                      schema tokens
         ↓                                  ↓
  ┌─────────────┐                    ┌──────────────┐
  │ Embeddings  │                    │ Embeddings   │
  │ + RoPE      │                    │              │
  └──────┬──────┘                    └──────┬───────┘
         ↓                                  │
  ┌─────────────┐                           │
  │ Self-Attn   │ ← LayerNorm + Residual    │
  └──────┬──────┘                           │
         ↓                                  ↓
  ┌─────────────────────────────────────────┐
  │ Cross-Attn (input ← schema)             │
  │ — this is where the model "looks up"    │
  │   which tool to call from the catalog   │
  └────────────────────┬────────────────────┘
                       ↓
              ┌────────────────┐
              │ GLU block      │   ←  replaces FFN
              │ down(GELU(up)*gate) │  more params/FLOPs efficient
              └────────┬───────┘
                       ↓
                    output

Total: 8 layers × this block + token embeddings + final projection = 20.7M parameters.

The training loop

# pseudocode of one training step
for batch in dataloader:
    input_ids = tokenize("[QUERY] " + utterance + " [/QUERY] [CALL] " + target_json + " [/CALL]")
    schema_ids = tokenize(serialize(available_tools))   # the tool catalog, NOT the target!

    logits = model(input_ids, schema_ids)
    loss = cross_entropy(logits[:-1], input_ids[1:])    # next-token prediction
    loss.backward()
    optimizer.step()

8 epochs at batch 16, lr 3e-4 with AdamW. On a single RTX 5090 this run takes ~3 minutes. Loss curve drops from 4.65 to 0.73.

Stage 5 · Tasting Notes

After training, we run inference on the 100 held-out cuts and compute three metrics:

Metric	What it measures	Needle (this Spirit)
Tool-name accuracy	Did the model pick the right tool out of the catalog?	78% (random baseline ~25%)
Arg-key F1	Of the argument names it produced, how many matched gold?	0.73 (p=0.85, r=0.64)
Exact-call accuracy	Did it match the FULL gold call — tool + every arg key + every arg value?	3% (value-level prediction is the weak spot)

Plus a samples table — the first 8 held-out predictions with their gold values, so you can eyeball where it succeeds and fails. Every Spirit ships with its failure cases.

Stage 6 (optional) · Bottling for deployment

The trained Spirit is just a PyTorch .pt file (model weights + tokenizer + Recipe). For production deployment we re-pack it:

Format	Size	Runs on	Use when
PyTorch .pt	249 MB	Python anywhere	Default; training continuation
ONNX	~100 MB	Most runtimes: CPU/GPU/edge	Cross-language inference (Rust, JS, Go, Java)
GGUF q4	~25 MB	llama.cpp, mobile, embedded	Resource-constrained deployment
WASM	~50 MB	The browser	Run entirely client-side, no backend

What's NOT solved yet (the honest part)

Argument-value accuracy. 3% exact-call is rough — the WordPiece tokenizer splits JSON values awkwardly. Byte-level BPE is the v0.2 fix.
Only one teacher backend. Gemini Flash is wired. Claude and OpenAI providers are stubs.
Single-shot recipes. No iterative refinement yet — you can't say "regenerate just the failed examples and retrain on the union."
No quantization-aware training. We bottle to q4 GGUF post-hoc, which loses accuracy. v0.4 will train aware.
Tasting Notes are statistical. They don't catch semantic failures (predicting "Twitter" instead of "Instagram"). LLM-as-judge eval is on the roadmap.