Fine-tuned Qwen3-0.6B model outperforms 120B teacher on structured function calling

What this is
Distil Labs released a complete pipeline that fine-tunes a small 0.6B parameter Qwen3 model to outperform a 120B parameter teacher model on structured function calling tasks. The pipeline extracts production traces, generates synthetic training data, and trains a specialist model that's 200x smaller than the teacher.
Performance results
- Teacher (GPT-OSS-120B): 50.0% tool call equivalence
- Base Qwen3-0.6B (no fine-tuning): 10.3% tool call equivalence
- Fine-tuned Qwen3-0.6B: 79.5% tool call equivalence
The task is IoT smart home function calling: routing natural language commands like "turn on the kitchen lights" or "make me a coffee at 7am" to the correct function with the right parameters. Scoring is based on exact structured match, not fuzzy scoring.
Why the small model wins
The 120B teacher is a general-purpose model that has never seen these specific function schemas or user phrasing patterns. It often produces verbose or slightly off-format responses. The 0.6B student is a specialist trained exclusively on this task, so it nails the exact output format consistently.
Pipeline architecture
The three-stage pipeline:
- Data extraction: dlt extracts production traces from databases, APIs, cloud storage, or log aggregators and writes them to Hugging Face as clean Parquet datasets
- Automatic curation: An LLM judge scores and filters traces to select high-quality seed examples (no manual annotation required)
- Synthetic data generation and training: Distil Labs uses the traces as domain context, generates ~10,000 synthetic training examples with a large teacher, validates and filters them, then fine-tunes the student model
The key insight: instead of training on raw traces directly, they're used as context so the synthetic data generator produces examples matching real vocabulary, function schemas, and phrasing patterns from actual users.
Dataset and practical details
- Used Amazon MASSIVE dataset (16k+ utterances, 60 intents) as stand-in for production traffic
- Filtered to IoT scenario with 9 smart home functions
- ~75 labeled seed examples were enough (automatic curation, zero manual annotation)
- Training completed in under 12 hours
- Model inference: under 50ms locally vs. 400-700ms for cloud API calls
- Model available in safetensors and GGUF formats on Hugging Face
Production considerations
The model scores 79.5% exact match, meaning roughly 1 in 5 queries may need a fallback. For production use, you'd want a confidence threshold routing low-confidence predictions to a larger model.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Mastering Antropic Subscription Modes: Haiku, Sonnet, and Opus
Explore Antropic's innovative subscription modes—Haiku, Sonnet, and Opus—designed to enhance your AI coding experience with tailored features and pricing.

OpenClaw Plugin Categories and Their Practical Functions
A Reddit post categorizes OpenClaw plugins by function, listing specific tools like commit-guard for preventing secret leaks, dep-audit for vulnerability scanning, and cortex-memory for layered memory management.

Claude Code Plugin 'nice-figures' Creates Research-Blog Style Matplotlib Plots
nice-figures is a Claude Code plugin that generates matplotlib figures matching Anthropic's soft-pastel research blog style. Includes 16 chart recipes, zero extra dependencies, and automatic styling.

Shipwright: An Open-Source Project Management Tool Built on Claude Code
Shipwright is an open-source project management tool that runs on Claude Code with 44 skills, 7 specialized agents, and 16 workflows. It includes binary quality gates and recovery playbooks, and was used to audit credential registries and evaluate automation platforms before engineering work began.