๐Ÿ“œ The Case

Your agent pipeline is 6 frontier LLM calls. Most of them don't need to be.

Modern AI products chain together half a dozen LLM calls per user action โ€” intent, routing, redaction, classification, generation, validation. The frontier model is great at one of those. The other five are narrow tasks dressed up as general intelligence. That's the distillation opportunity.

The pipeline replacement story

Here's a typical customer-support pipeline. Every box is a separate call to Claude / GPT / Gemini.

Before โ€” every step is a frontier call

user msg โ”€โ”€โ†’ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”‚  intent classifier โ”‚ โ”€โ†’ $0.005 / 400 ms   (Claude/GPT)
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ†“
             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”‚  PII redactor      โ”‚ โ”€โ†’ $0.005 / 400 ms   (Claude/GPT)
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ†“
             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”‚  urgency scorer    โ”‚ โ”€โ†’ $0.005 / 400 ms   (Claude/GPT)
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ†“
             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”‚  agent router      โ”‚ โ”€โ†’ $0.005 / 400 ms   (Claude/GPT)
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ†“
             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”‚ response generator โ”‚ โ”€โ†’ $0.020 / 1500 ms  (Claude/GPT)  โ† actually hard
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ†“
             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”‚ brand/tone check   โ”‚ โ”€โ†’ $0.005 / 400 ms   (Claude/GPT)
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ†“
                   response
            โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
            $0.045  /  3,500 ms  /  6 vendor dependencies

After โ€” distill the narrow tasks, keep frontier for the hard one

user msg โ”€โ”€โ†’ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”‚ ๐Ÿพ Spirit: intent   โ”‚ โ”€โ†’ $0 / 30 ms        (8M, CPU, local)
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ†“
             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”‚ ๐Ÿพ Spirit: PII      โ”‚ โ”€โ†’ $0 / 30 ms        (14M, CPU, local)
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ†“
             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”‚ ๐Ÿพ Spirit: urgency  โ”‚ โ”€โ†’ $0 / 30 ms        (8M, CPU, local)
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ†“
             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”‚ ๐Ÿพ Spirit: router   โ”‚ โ”€โ†’ $0 / 50 ms        (20M, CPU, local)
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ†“
             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”‚  Frontier LLM      โ”‚ โ”€โ†’ $0.020 / 1500 ms   โ† keep ONE call for the hard step
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ†“
             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”‚ ๐Ÿพ Spirit: tone     โ”‚ โ”€โ†’ $0 / 30 ms        (10M, CPU, local)
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ†“
                   response
            โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
            $0.020  /  1,670 ms  /  1 vendor dependency

The deltas, at scale

Before (all frontier)After (compound)ฮ”
Cost per conversation$0.045$0.020โˆ’56%
End-to-end latency3,500 ms1,670 msโˆ’52%
Vendor dependencies6 different LLM endpoints1โˆ’83%
Rate-limit surfaces6 separate quotas1โˆ’83%
Compliance audit scope6 vendor relationships1 vendor + your weightshuge
At 1M conversations/day$45,000/day$20,000/day$9M/yr saved
At 10M conversations/day$450,000/day$200,000/day$90M/yr saved

Cost numbers use Claude Sonnet 4.6 pricing as of 2026-05. Latency is end-to-end including network. Numbers shift with model choice โ€” but the shape of the win holds.

Why this matters (the three walls)

Every team building with LLMs eventually hits the same three walls. The pipeline replacement above is what knocks each one down.

Wall 1

Cost

Frontier calls are $0.001 โ€“ $0.10 each. Multiply by every interaction, every retry, every tool turn. The bill is unbounded โ€” and most of it pays for capability you don't use.

A distilled 20M-param Spirit costs $0.00 per call. Forever.

Wall 2

Latency

Even the fastest API roundtrip is 200-800 ms. For autocomplete, voice, real-time agents โ€” that's the floor you can't get below.

A Spirit on the same machine answers in 30-80 ms. No network at all.

Wall 3

Lock-in

Your product's reliability = the reliability of someone else's API. Outages, rate limits, model deprecations, ToS changes โ€” all theirs to decide.

A bottled Spirit runs forever, exactly as it was the day you bottled it.

The three options, compared honestly

Distillation isn't always the answer. Here's where each one wins:

Frontier APILoRA / Full FTDistillation
Best forGeneral reasoning, novel tasks, prototypingCapability extension on a generalistSingle, repeatable task in production
Final model size~1T+ (theirs)7B+ (same as base)5-50M
Inference targetCloud APIGPUCPU, edge, browser
Cost per call$0.001 โ€“ $0.10~$0.0001 (your GPU)$0
Latency200-800 ms50-200 ms30-80 ms
Vendor lock-inTotalBase model onlyNone โ€” you own the weights
Setup cost~$0$$$$ (GPU + data)~$0.30 (teacher) + ~30 min GPU
Quality ceilingTeacher's full capabilityBase + your deltaโ‰ค Teacher's capability on the narrow task

The honest tradeoff: distillation gives up generality for size, speed, and ownership. If your task is narrow and you do it a lot โ€” that's a trade you'll happily make.

How is 1.5T โ†’ 20M even possible?

A frontier LLM is a Swiss Army knife with a thousand blades. Your tool-calling endpoint uses one of them.

Distillation works because your task doesn't need the other 999 blades. The teacher generates synthetic examples that pin down exactly the surface area you care about, and the student learns just that surface.

TEACHER (1.5T params)                  STUDENT (20M params)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  reasoning ยท code ยท poem โ”‚           โ”‚              โ”‚
โ”‚  translation ยท math      โ”‚           โ”‚  tool-call   โ”‚
โ”‚  rag ยท summarize ยท chat  โ”‚  โ”€โ”€โ†’โ”€โ”€โ†’   โ”‚  parsing     โ”‚
โ”‚  โ–ถ tool-call parsing โ—€   โ”‚           โ”‚              โ”‚
โ”‚  audio ยท vision ยท โ€ฆ      โ”‚           โ”‚              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
   "knows everything"                    "knows the part you need"
                                          ~72,000ร— smaller
                                          ~10,000ร— cheaper at inference

The compression ratio is a function of how narrow your task is. Wide tasks (general assistant): hard to distill, large student needed. Narrow tasks (intent classifier, NER, function-call parser): drop 4-5 orders of magnitude with minor quality loss.

When to distill โ€” and when not to

โœ“ Good fit

  • The task is narrow and repeatable (function calling, classification, NER, routing, redaction)
  • You make many calls per user session
  • Latency matters (autocomplete, voice, on-device)
  • You need offline / edge / browser inference
  • Privacy / compliance forbids shipping data to a vendor
  • You want a model you can audit, fork, and version

โœ— Bad fit

  • The task is open-ended ("be a helpful assistant")
  • Each call is unique (one-off creative writing)
  • You're still figuring out what the task IS โ€” distill after you've validated
  • You don't have a held-out eval set to measure proof against
  • The teacher itself is unreliable on this task โ€” a student can't beat a bad teacher

The bet

The trajectory of frontier models is bigger, slower, more expensive, more political. The trajectory of efficient inference is smaller, faster, edge-runnable, owned.

Most production AI workloads are narrow tasks dressed up as general intelligence. Once you have a teacher that can do the task at all, the engineering problem is to compress it to where it costs nothing and runs everywhere.

That's the bet behind The Distillery. Not that frontier models stop mattering โ€” they remain the source. But that the delivery vehicle for production AI is increasingly a tiny student trained on a tiny pile of teacher-generated examples โ€” and that production pipelines stop being "frontier LLM all the way down" and start being "a constellation of cheap Spirits with one frontier call where the reasoning actually lives."

๐Ÿ’ผ For enterprise teams

Bleeding $$$ on frontier calls for narrow tasks?

If your stack looks like the "before" diagram above and your monthly LLM bill has a comma in it โ€” we can help you replace it with a constellation of distilled Spirits, end to end.

Discovery

We audit your existing pipeline, identify the steps where distillation will work (and where it won't), and produce a ranked savings forecast.

2 weeks ยท fixed fee

Run

Monthly retainer: monitor drift, re-age Spirits on new data, ship updates, on-call for incidents. We act as your distillation team-of-one.

$5k/month ยท cancellable anytime

๐Ÿ“จ Talk to us about your pipeline

No sales team. No SDR funnel. You email me, I read it. โ€” Andrew