🧪 The Gallery

Use cases

Six concrete tasks people are distilling, plus the pattern that connects them. If your task fits the pattern, distillation will probably work. If it doesn't, don't force it.

The pattern

Every good distillation target shares the same three properties:

1

Narrow

The task is a single well-defined function: "given X, return Y." Not "be a helpful assistant."

2

Repeated

You'll run it thousands or millions of times. Each call is a chance to amortize the distillation cost.

3

Verifiable

You can write a held-out test set where each example has a clear correct answer. Otherwise Tasting Notes are meaningless.

Six examples

🛠

Tool calling for agents

✓ shipped · 78° proof

Problem: Your agent makes a Gemini/Claude call for every tool-routing decision. At 10 turns per session × 1000 sessions/day × $0.001/call = $10/day per workflow. And the latency adds up.

Why distill: Tool calling is the canonical narrow-and-repeated task. The student just needs to look at the utterance + the catalog and emit a JSON call. A 20M model does this in 45 ms on CPU.

Reference Spirit: needle.tool-calling — 20.7M params, 15 tool categories, $0.30 to distill.

🔒

PII redaction at the edge

✓ shipped · 82° proof

Problem: You want to send user prompts to a frontier LLM, but compliance says "don't ship PII." Regex misses everything (slang names, obfuscated emails, context-dependent identifiers).

Why distill: Detection is local-only (the prompt never leaves the box). Sub-50ms means you can run it inline without adding perceptible latency.

Reference Spirit: pii-guard — 14M params, 0.82 F1.

🎯

Intent classifier / agent router

✓ shipped · 74° proof

Problem: Your support chatbot needs to route incoming messages — "billing question? bug report? feature request?" — to the right team. You currently use a frontier model for this trivial decision.

Why distill: Intent classification is what BERT was good at five years ago. You don't need 1.5T parameters to tell billing apart from bugs. 8M will do.

Reference Spirit: routor — 8M params, multi-label macro-F1 0.74.

📊

Natural language → SQL

○ coming v0.2

Problem: Your data product lets users ask questions in English. You can pipe them through GPT, but (a) your schema leaks to OpenAI, (b) the round-trip ruins interactivity, (c) the bill scales linearly with active users.

Why distill: A schema-aware student trained on your specific data model converges fast. Locally hosted = no schema leak. Latency-cheap = autocomplete-quality UX.

The recipe: data.sql-parse — teacher emits (question, schema_excerpt, target_sql) triples. Student is schema-conditioned via cross-attention (same as Needle).

🧾

Receipt / form structurer

○ coming v0.2

Problem: Expense apps need to turn OCR text → structured line items. Vendors charge per-page; latency means users wait while their phone uploads to your backend.

Why distill: On-device inference (via GGUF or browser-WASM) means scan-to-structured happens before the user looks up from their phone. Privacy + speed + zero per-call cost.

The recipe: data.receipt-ner — teacher generates synthetic receipt OCR text + the gold parse. Student learns the mapping at <25 MB quantized.

💻

Code-snippet explainer (IDE plugin)

⚗ experimental

Problem: You want "hover for an English explanation" in your IDE plugin. Shipping every hover to an LLM is wasteful — most code is mundane and the explanations are repetitive.

Why distill: Inline IDE features need millisecond response. A 30M-param explainer fine-tuned on a teacher's outputs can answer "what does this function do?" without leaving the editor.

Status: No reference Spirit yet — but the pattern fits and the recipe is straightforward. Build it and share?

Things you should NOT distill (be honest)

These tasks look distillable but burn the budget without good Tasting Notes:

✗ "A helpful assistant"

Too open-ended. The teacher's surface area is unbounded; you'll generate a corpus that mostly samples one slice. The student will look fine on canned questions and bomb on real ones.

✗ Creative writing

Output diversity is part of the deliverable. A student that memorizes 1000 teacher outputs produces 1000 templates. Use the teacher.

✗ Tasks you don't have a held-out eval for

Without a way to compute proof, you have no idea if the Spirit is good or just confident. Distill after you've shipped a manual eval set.

✗ Multi-step reasoning

Chain-of-thought is a property of scale. A 20M model can't do 6-step deductions reliably. Use the teacher; distill the final step if that's narrow and repeatable.

How to know if YOUR task fits

Answer these four questions honestly:

  1. Can you describe the task as a single function signature? If "given X, return Y" works, ✓.
  2. Will you call it more than a few thousand times? Setup is ~$0.30 + 30 min of GPU. You need volume to recoup.
  3. Do you have ~100 hand-labeled examples for the eval cut? Without these, you can't measure proof. (You don't need 1000 training examples — the teacher generates those.)
  4. Does the teacher itself do this task well? A student can match the teacher; it can't exceed it. If GPT/Claude/Gemini struggles, distillation won't help.

4 out of 4 yes → distill. 3 out of 4 → maybe, with caveats. ≤2 → don't.