VINTAGE 2026.05.13 · BATCH #001

NEEDLE

⚙ Tool Calling Spirit

Distilled from Gemini 2.5 Flash. An attention-only transformer that maps natural language to structured function calls. 15 tool categories. Designed for edge inference: 250 MB checkpoint, runs on CPU in under 50 ms per call. Zero API dependency at inference.

★ 78 PROOF PASSED CUTS 20.7M PARAMS CPU-RUNNABLE

Download Bottle View Recipe Fork Recipe

tool-name accuracy

78%

on held-out cuts

Arg-key F1

0.729

p=0.85 · r=0.64

Exact-call accuracy

value-level — our weak spot (v0.2 fix)

Final loss

0.732

8 epochs

Compression

72,000×

~1.5T → 20.7M

Tasting Notes

✓ Strengths

Strong tool-name selection on imperative queries (Set a timer..., Send a message to...)
High precision on argument keys (0.85) — when it produces an arg, it's usually a real one
Holds JSON structure consistently — never produces malformed nesting
Gracefully handles 3-6 tools in context window (the typical user scenario)
Inference cost: ~$0/call, runs CPU-only, 45ms median latency

✕ Weaknesses

Confused between adjacent social tools (predicts Twitter for an Instagram prompt)
Arg-key recall (0.64) — sometimes drops optional but useful args like unit
Exact-call accuracy only 3% — value-level prediction is the soft spot
WordPiece tokenizer introduces whitespace noise in decoded JSON
Untested on chained 2-tool calls (~20% of training data, untested in eval)

SAMPLE PREDICTIONS

Utterance	Gold	Predicted	Verdict
Hey, can you log my weight? I'm at 75 kilograms now.	`log_health_metric(metric=weight, value=75)`	`log_health_metric(metric=weight, value=75)`	✓ exact
Post on my Instagram story: 'Beach blast!'	`post_social(platform=Instagram, ...)`	`post_social(platform=Twitter, ...)`	✕ wrong value
Text my wife, 'Don't forget the milk!'	`send_message(contact=wife, message=...)`	`send_message(contact=wife, message=...)`	✓ exact
What's the weather in Lisbon tomorrow?	`get_weather(location=Lisbon, day=tomorrow)`	`get_weather(location=Lisbon)`	✕ missing arg

Distillation Run

8 epochs over distilled examples. Final loss: 0.732.

Provenance

Source paper: Needle — Distilling Gemini Tool Calling into a 26M Model

Teacher: gemini-2.5-flash (~1.5T params)

Recipe: needle.tool-calling-v1.yaml

Distilled: 2026-05-13 by The Distillery v0.1 engine

Author: @cactus

Download Bottle

Self-contained — model weights, tokenizer, and recipe in a single artifact. No API key needed at inference.

PYTORCH

249 MB

For training continuation, HuggingFace, custom Python.

Download

ONNX

100 MB (coming)

CPU/GPU inference, edge, browser via onnxruntime-web.

GGUF

25 MB (coming)

Quantized for llama.cpp, mobile, embedded targets.