Fine-tuned Qwen3 Small Models Outperform Frontier LLMs on Specific Tasks at Lower Cost

A systematic comparison of small distilled Qwen3 models against frontier API models shows that fine-tuned small language models can outperform larger, more expensive models on specific structured tasks.
Benchmark Results
The study compared Qwen3 models (0.6B to 8B parameters) against frontier APIs including GPT-5 nano/mini/5.2, Gemini 2.5 Flash Lite/Flash, Claude Haiku 4.5/Sonnet 4.6/Opus 4.6, and Grok 4.1 Fast/Grok 4 across 9 datasets. All distilled models were trained using open-weight teachers only, with as few as 50 examples. Inference was run on vLLM on a single H100.
Key Performance Findings
- Smart Home function calling: Qwen3-0.6B achieved 98.7% accuracy vs. Gemini Flash at 92.0%
- Text2SQL: Qwen3-4B distilled got 98.0% vs. Claude Haiku at 98.7% and GPT-5 nano at 96.0%
- Cost comparison: Text2SQL cost per million requests: Qwen3-4B ~$3 vs. Claude Haiku $378 and GPT-5 nano $24
- Classification tasks: Distilled models performed within 0–1.5 percentage points of the best frontier option on Banking77, E-commerce, and TREC datasets
- Frontier advantage: HotpotQA (open-ended reasoning + world knowledge) — 92.0% vs. Haiku's 98.0%
Performance Metrics
For Text2SQL with Qwen3-4B on H100:
- 222 RPS sustained
- p50: 390ms | p95: 640ms | p99: 870ms
- 7.6 GiB VRAM (BF16, no quantization)
- FP8 gave +15% throughput, −44% VRAM, no measurable accuracy loss in brief experiments
Methodology
- Same test sets, prompts, and evaluation criteria for all models
- Frontier models run 3× per dataset (reporting mean ± std), distilled at temperature=0
- Evaluation: exact-match for classification, tool_call_equivalence (JSON comparison with default parameter normalization) for function calling, Claude Sonnet 4.6 as LLM-judge for generation tasks
- Cost calculation: frontier = measured token usage × published pricing (Feb 2026); distilled = H100 at $2.40/hr ÷ sustained RPS
Practical Recommendations
- Use distilled models when: You have structured tasks, well-defined schemas, high volume, or data sovereignty needs
- Use frontier APIs when: You need broad world knowledge, freeform generation, or volume is low enough that cost doesn't matter
- Hybrid approach: Route between the two based on task requirements
Availability
All code, models, data, and evaluation scripts are open source on GitHub: https://github.com/distil-labs/inference-efficiency-benchmarks/
Full analysis with charts available on the blog: https://www.distillabs.ai/blog/the-10x-inference-tax-you-dont-have-to-pay
📖 Read the full source: r/LocalLLaMA
👀 See Also

Decoupled DiLoCo: Resilient Distributed Training Across Data Centers with Low Bandwidth
Google DeepMind's Decoupled DiLoCo trains LLMs across distant data centers using 2-5 Gbps WAN, with self-healing islands of compute that isolate hardware failures without degrading ML performance.

Pentagon Gives Anthropic 72 Hours to Allow Military Use of Claude AI
The Pentagon has issued a 72-hour ultimatum to Anthropic to allow the U.S. military to use its Claude AI, threatening to invoke a 1950 law to force compliance if the startup doesn't comply.

Local LLM Benchmark: Backend Generation by Function Calling – GLM, Qwen, DeepSeek Compared
A rigorous benchmark of local and frontier LLMs for backend code generation via function calling, with scoring rubric. Key findings: qwen3.5-35b-a3b matches gpt-5.4 on DB/API design, and dense Qwen 27B beats 397B MoE. Frontier models dropped due to cost.

Claude Code 2.1.83 Release: Prompt Caching, Verify Skill, and SDK Updates
Claude Code 2.1.83 adds prompt caching with design guidance, replaces the verification specialist skill with a new Verify skill, and updates SDK references across seven languages including PHP beta tool runner support.