📜 The Case

Your agent pipeline is 6 frontier LLM calls. Most of them don't need to be.

Modern AI products chain together half a dozen LLM calls per user action — intent, routing, redaction, classification, generation, validation. The frontier model is great at one of those. The other five are narrow tasks dressed up as general intelligence. That's the distillation opportunity.

The pipeline replacement story

Here's a typical customer-support pipeline. Every box is a separate call to Claude / GPT / Gemini.

Before — every step is a frontier call

user msg ──→ ┌────────────────────┐
             │  intent classifier │ ─→ $0.005 / 400 ms   (Claude/GPT)
             └──────────┬─────────┘
                        ↓
             ┌────────────────────┐
             │  PII redactor      │ ─→ $0.005 / 400 ms   (Claude/GPT)
             └──────────┬─────────┘
                        ↓
             ┌────────────────────┐
             │  urgency scorer    │ ─→ $0.005 / 400 ms   (Claude/GPT)
             └──────────┬─────────┘
                        ↓
             ┌────────────────────┐
             │  agent router      │ ─→ $0.005 / 400 ms   (Claude/GPT)
             └──────────┬─────────┘
                        ↓
             ┌────────────────────┐
             │ response generator │ ─→ $0.020 / 1500 ms  (Claude/GPT)  ← actually hard
             └──────────┬─────────┘
                        ↓
             ┌────────────────────┐
             │ brand/tone check   │ ─→ $0.005 / 400 ms   (Claude/GPT)
             └──────────┬─────────┘
                        ↓
                   response
            ━━━━━━━━━━━━━━━━━━━━━━━━━━
            $0.045  /  3,500 ms  /  6 vendor dependencies

After — distill the narrow tasks, keep frontier for the hard one

user msg ──→ ┌────────────────────┐
             │ 🍾 Spirit: intent   │ ─→ $0 / 30 ms        (8M, CPU, local)
             └──────────┬─────────┘
                        ↓
             ┌────────────────────┐
             │ 🍾 Spirit: PII      │ ─→ $0 / 30 ms        (14M, CPU, local)
             └──────────┬─────────┘
                        ↓
             ┌────────────────────┐
             │ 🍾 Spirit: urgency  │ ─→ $0 / 30 ms        (8M, CPU, local)
             └──────────┬─────────┘
                        ↓
             ┌────────────────────┐
             │ 🍾 Spirit: router   │ ─→ $0 / 50 ms        (20M, CPU, local)
             └──────────┬─────────┘
                        ↓
             ┌────────────────────┐
             │  Frontier LLM      │ ─→ $0.020 / 1500 ms   ← keep ONE call for the hard step
             └──────────┬─────────┘
                        ↓
             ┌────────────────────┐
             │ 🍾 Spirit: tone     │ ─→ $0 / 30 ms        (10M, CPU, local)
             └──────────┬─────────┘
                        ↓
                   response
            ━━━━━━━━━━━━━━━━━━━━━━━━━━
            $0.020  /  1,670 ms  /  1 vendor dependency

The deltas, at scale

	Before (all frontier)	After (compound)	Δ
Cost per conversation	$0.045	$0.020	−56%
End-to-end latency	3,500 ms	1,670 ms	−52%
Vendor dependencies	6 different LLM endpoints	1	−83%
Rate-limit surfaces	6 separate quotas	1	−83%
Compliance audit scope	6 vendor relationships	1 vendor + your weights	huge
At 1M conversations/day	$45,000/day	$20,000/day	$9M/yr saved
At 10M conversations/day	$450,000/day	$200,000/day	$90M/yr saved

Cost numbers use Claude Sonnet 4.6 pricing as of 2026-05. Latency is end-to-end including network. Numbers shift with model choice — but the shape of the win holds.

Why this matters (the three walls)

Every team building with LLMs eventually hits the same three walls. The pipeline replacement above is what knocks each one down.

Wall 1

Cost

Frontier calls are $0.001 – $0.10 each. Multiply by every interaction, every retry, every tool turn. The bill is unbounded — and most of it pays for capability you don't use.

A distilled 20M-param Spirit costs $0.00 per call. Forever.

Wall 2

Latency

Even the fastest API roundtrip is 200-800 ms. For autocomplete, voice, real-time agents — that's the floor you can't get below.

A Spirit on the same machine answers in 30-80 ms. No network at all.

Wall 3

Lock-in

Your product's reliability = the reliability of someone else's API. Outages, rate limits, model deprecations, ToS changes — all theirs to decide.

A bottled Spirit runs forever, exactly as it was the day you bottled it.

The three options, compared honestly

Distillation isn't always the answer. Here's where each one wins:

	Frontier API	LoRA / Full FT	Distillation
Best for	General reasoning, novel tasks, prototyping	Capability extension on a generalist	Single, repeatable task in production
Final model size	~1T+ (theirs)	7B+ (same as base)	5-50M
Inference target	Cloud API	GPU	CPU, edge, browser
Cost per call	$0.001 – $0.10	~$0.0001 (your GPU)	$0
Latency	200-800 ms	50-200 ms	30-80 ms
Vendor lock-in	Total	Base model only	None — you own the weights
Setup cost	~$0	$$$$ (GPU + data)	~$0.30 (teacher) + ~30 min GPU
Quality ceiling	Teacher's full capability	Base + your delta	≤ Teacher's capability on the narrow task

The honest tradeoff: distillation gives up generality for size, speed, and ownership. If your task is narrow and you do it a lot — that's a trade you'll happily make.

How is 1.5T → 20M even possible?

A frontier LLM is a Swiss Army knife with a thousand blades. Your tool-calling endpoint uses one of them.

Distillation works because your task doesn't need the other 999 blades. The teacher generates synthetic examples that pin down exactly the surface area you care about, and the student learns just that surface.

TEACHER (1.5T params)                  STUDENT (20M params)
┌──────────────────────────┐           ┌──────────────┐
│  reasoning · code · poem │           │              │
│  translation · math      │           │  tool-call   │
│  rag · summarize · chat  │  ──→──→   │  parsing     │
│  ▶ tool-call parsing ◀   │           │              │
│  audio · vision · …      │           │              │
└──────────────────────────┘           └──────────────┘
   "knows everything"                    "knows the part you need"
                                          ~72,000× smaller
                                          ~10,000× cheaper at inference

The compression ratio is a function of how narrow your task is. Wide tasks (general assistant): hard to distill, large student needed. Narrow tasks (intent classifier, NER, function-call parser): drop 4-5 orders of magnitude with minor quality loss.

When to distill — and when not to

✓ Good fit

The task is narrow and repeatable (function calling, classification, NER, routing, redaction)
You make many calls per user session
Latency matters (autocomplete, voice, on-device)
You need offline / edge / browser inference
Privacy / compliance forbids shipping data to a vendor
You want a model you can audit, fork, and version

✗ Bad fit

The task is open-ended ("be a helpful assistant")
Each call is unique (one-off creative writing)
You're still figuring out what the task IS — distill after you've validated
You don't have a held-out eval set to measure proof against
The teacher itself is unreliable on this task — a student can't beat a bad teacher

The bet

The trajectory of frontier models is bigger, slower, more expensive, more political. The trajectory of efficient inference is smaller, faster, edge-runnable, owned.

Most production AI workloads are narrow tasks dressed up as general intelligence. Once you have a teacher that can do the task at all, the engineering problem is to compress it to where it costs nothing and runs everywhere.

That's the bet behind The Distillery. Not that frontier models stop mattering — they remain the source. But that the delivery vehicle for production AI is increasingly a tiny student trained on a tiny pile of teacher-generated examples — and that production pipelines stop being "frontier LLM all the way down" and start being "a constellation of cheap Spirits with one frontier call where the reasoning actually lives."

💼 For enterprise teams

Bleeding $$$ on frontier calls for narrow tasks?

If your stack looks like the "before" diagram above and your monthly LLM bill has a comma in it — we can help you replace it with a constellation of distilled Spirits, end to end.

Discovery

We audit your existing pipeline, identify the steps where distillation will work (and where it won't), and produce a ranked savings forecast.

2 weeks · fixed fee

Build

We distill each narrow task into a production-ready Spirit. Includes Tasting Notes, deployment package, integration code, and a 30-day maintenance handover.

$50k–200k · 6-12 weeks per pipeline

Run

Monthly retainer: monitor drift, re-age Spirits on new data, ship updates, on-call for incidents. We act as your distillation team-of-one.

$5k/month · cancellable anytime

📨 Talk to us about your pipeline

No sales team. No SDR funnel. You email me, I read it. — Andrew