๐ The Case
Your agent pipeline is 6 frontier LLM calls. Most of them don't need to be.
Modern AI products chain together half a dozen LLM calls per user action โ intent, routing, redaction, classification, generation, validation. The frontier model is great at one of those. The other five are narrow tasks dressed up as general intelligence. That's the distillation opportunity.
The pipeline replacement story
Here's a typical customer-support pipeline. Every box is a separate call to Claude / GPT / Gemini.
Before โ every step is a frontier call
user msg โโโ โโโโโโโโโโโโโโโโโโโโโโ
โ intent classifier โ โโ $0.005 / 400 ms (Claude/GPT)
โโโโโโโโโโโโฌโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโ
โ PII redactor โ โโ $0.005 / 400 ms (Claude/GPT)
โโโโโโโโโโโโฌโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโ
โ urgency scorer โ โโ $0.005 / 400 ms (Claude/GPT)
โโโโโโโโโโโโฌโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโ
โ agent router โ โโ $0.005 / 400 ms (Claude/GPT)
โโโโโโโโโโโโฌโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโ
โ response generator โ โโ $0.020 / 1500 ms (Claude/GPT) โ actually hard
โโโโโโโโโโโโฌโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโ
โ brand/tone check โ โโ $0.005 / 400 ms (Claude/GPT)
โโโโโโโโโโโโฌโโโโโโโโโโ
โ
response
โโโโโโโโโโโโโโโโโโโโโโโโโโ
$0.045 / 3,500 ms / 6 vendor dependencies After โ distill the narrow tasks, keep frontier for the hard one
user msg โโโ โโโโโโโโโโโโโโโโโโโโโโ
โ ๐พ Spirit: intent โ โโ $0 / 30 ms (8M, CPU, local)
โโโโโโโโโโโโฌโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโ
โ ๐พ Spirit: PII โ โโ $0 / 30 ms (14M, CPU, local)
โโโโโโโโโโโโฌโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโ
โ ๐พ Spirit: urgency โ โโ $0 / 30 ms (8M, CPU, local)
โโโโโโโโโโโโฌโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโ
โ ๐พ Spirit: router โ โโ $0 / 50 ms (20M, CPU, local)
โโโโโโโโโโโโฌโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโ
โ Frontier LLM โ โโ $0.020 / 1500 ms โ keep ONE call for the hard step
โโโโโโโโโโโโฌโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโ
โ ๐พ Spirit: tone โ โโ $0 / 30 ms (10M, CPU, local)
โโโโโโโโโโโโฌโโโโโโโโโโ
โ
response
โโโโโโโโโโโโโโโโโโโโโโโโโโ
$0.020 / 1,670 ms / 1 vendor dependency The deltas, at scale
| Before (all frontier) | After (compound) | ฮ | |
|---|---|---|---|
| Cost per conversation | $0.045 | $0.020 | โ56% |
| End-to-end latency | 3,500 ms | 1,670 ms | โ52% |
| Vendor dependencies | 6 different LLM endpoints | 1 | โ83% |
| Rate-limit surfaces | 6 separate quotas | 1 | โ83% |
| Compliance audit scope | 6 vendor relationships | 1 vendor + your weights | huge |
| At 1M conversations/day | $45,000/day | $20,000/day | $9M/yr saved |
| At 10M conversations/day | $450,000/day | $200,000/day | $90M/yr saved |
Cost numbers use Claude Sonnet 4.6 pricing as of 2026-05. Latency is end-to-end including network. Numbers shift with model choice โ but the shape of the win holds.
Why this matters (the three walls)
Every team building with LLMs eventually hits the same three walls. The pipeline replacement above is what knocks each one down.
Cost
Frontier calls are $0.001 โ $0.10 each. Multiply by every interaction, every retry, every tool turn. The bill is unbounded โ and most of it pays for capability you don't use.
A distilled 20M-param Spirit costs $0.00 per call. Forever.
Latency
Even the fastest API roundtrip is 200-800 ms. For autocomplete, voice, real-time agents โ that's the floor you can't get below.
A Spirit on the same machine answers in 30-80 ms. No network at all.
Lock-in
Your product's reliability = the reliability of someone else's API. Outages, rate limits, model deprecations, ToS changes โ all theirs to decide.
A bottled Spirit runs forever, exactly as it was the day you bottled it.
The three options, compared honestly
Distillation isn't always the answer. Here's where each one wins:
| Frontier API | LoRA / Full FT | Distillation | |
|---|---|---|---|
| Best for | General reasoning, novel tasks, prototyping | Capability extension on a generalist | Single, repeatable task in production |
| Final model size | ~1T+ (theirs) | 7B+ (same as base) | 5-50M |
| Inference target | Cloud API | GPU | CPU, edge, browser |
| Cost per call | $0.001 โ $0.10 | ~$0.0001 (your GPU) | $0 |
| Latency | 200-800 ms | 50-200 ms | 30-80 ms |
| Vendor lock-in | Total | Base model only | None โ you own the weights |
| Setup cost | ~$0 | $$$$ (GPU + data) | ~$0.30 (teacher) + ~30 min GPU |
| Quality ceiling | Teacher's full capability | Base + your delta | โค Teacher's capability on the narrow task |
The honest tradeoff: distillation gives up generality for size, speed, and ownership. If your task is narrow and you do it a lot โ that's a trade you'll happily make.
How is 1.5T โ 20M even possible?
A frontier LLM is a Swiss Army knife with a thousand blades. Your tool-calling endpoint uses one of them.
Distillation works because your task doesn't need the other 999 blades. The teacher generates synthetic examples that pin down exactly the surface area you care about, and the student learns just that surface.
TEACHER (1.5T params) STUDENT (20M params)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ reasoning ยท code ยท poem โ โ โ
โ translation ยท math โ โ tool-call โ
โ rag ยท summarize ยท chat โ โโโโโโ โ parsing โ
โ โถ tool-call parsing โ โ โ โ
โ audio ยท vision ยท โฆ โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
"knows everything" "knows the part you need"
~72,000ร smaller
~10,000ร cheaper at inference The compression ratio is a function of how narrow your task is. Wide tasks (general assistant): hard to distill, large student needed. Narrow tasks (intent classifier, NER, function-call parser): drop 4-5 orders of magnitude with minor quality loss.
When to distill โ and when not to
โ Good fit
- The task is narrow and repeatable (function calling, classification, NER, routing, redaction)
- You make many calls per user session
- Latency matters (autocomplete, voice, on-device)
- You need offline / edge / browser inference
- Privacy / compliance forbids shipping data to a vendor
- You want a model you can audit, fork, and version
โ Bad fit
- The task is open-ended ("be a helpful assistant")
- Each call is unique (one-off creative writing)
- You're still figuring out what the task IS โ distill after you've validated
- You don't have a held-out eval set to measure proof against
- The teacher itself is unreliable on this task โ a student can't beat a bad teacher
The bet
The trajectory of frontier models is bigger, slower, more expensive, more political. The trajectory of efficient inference is smaller, faster, edge-runnable, owned.
Most production AI workloads are narrow tasks dressed up as general intelligence. Once you have a teacher that can do the task at all, the engineering problem is to compress it to where it costs nothing and runs everywhere.
That's the bet behind The Distillery. Not that frontier models stop mattering โ they remain the source. But that the delivery vehicle for production AI is increasingly a tiny student trained on a tiny pile of teacher-generated examples โ and that production pipelines stop being "frontier LLM all the way down" and start being "a constellation of cheap Spirits with one frontier call where the reasoning actually lives."
๐ผ For enterprise teams
Bleeding $$$ on frontier calls for narrow tasks?
If your stack looks like the "before" diagram above and your monthly LLM bill has a comma in it โ we can help you replace it with a constellation of distilled Spirits, end to end.
Discovery
We audit your existing pipeline, identify the steps where distillation will work (and where it won't), and produce a ranked savings forecast.
2 weeks ยท fixed fee
Build
We distill each narrow task into a production-ready Spirit. Includes Tasting Notes, deployment package, integration code, and a 30-day maintenance handover.
$50kโ200k ยท 6-12 weeks per pipeline
Run
Monthly retainer: monitor drift, re-age Spirits on new data, ship updates, on-call for incidents. We act as your distillation team-of-one.
$5k/month ยท cancellable anytime
No sales team. No SDR funnel. You email me, I read it. โ Andrew