Distilled Qwen3 Models Beat Frontier LLMs at 10x Lower Cost

Benchmark Results: Distilled vs. Frontier Models

Researchers conducted a comprehensive comparison of small distilled models against frontier LLMs across 9 datasets covering classification, function calling, QA, and open-book QA tasks. All distilled models are from the Qwen3 family (0.6B to 8B), trained with as few as 50 examples using open-weight teacher models without frontier API outputs for training.

Key Performance Findings

Distilled models match or beat the best mid-tier frontier model (<$1/MTok input) on 6/9 tasks, effectively tie on a 7th
Text2SQL: Qwen3-4B distilled hits 98.0% vs Claude Haiku 98.7%, GPT-5 nano 96.0% at $3/M requests vs $378 and $24 respectively
Smart Home (function calling): Qwen3-0.6B scores 98.7% vs Gemini Flash's 92.0%
HotpotQA: Distilled models score 92.0% vs Haiku's 98.0% - open-ended reasoning with world knowledge remains frontier territory
Classification tasks (Banking77, E-commerce, TREC): Distilled models are within 0-1.5 percentage points of the best frontier option

Inference Performance

Models were served via vLLM on a single H100 with the following Text2SQL 4B model performance:

222 RPS sustained
p50: 390ms, p95: 640ms, p99: 870ms
7.6 GiB VRAM (BF16, no quantization)
FP8 gave +15% throughput, -44% memory, no accuracy loss in brief experiments

Methodology

Same test sets, same prompts, same eval criteria across all models
Frontier models run 3x per dataset (mean ± std reported), distilled at temp=0
Eval: exact-match for classification, tool_call_equivalence (JSON comparison with default param normalization) for function calling, Claude Sonnet 4.6 as LLM-as-a-judge for generation
Cost: frontier = measured API token usage × published pricing (Feb 2026). Distilled = H100 at $2.40/hr ÷ measured sustained RPS

Practical Recommendations

Distill: structured tasks, well-defined schemas, high volume, data sovereignty requirements
Frontier API: broad world knowledge, freeform generation, low volume
Best setup: route between both

Available Resources

All code, models, data, and eval scripts are open source at https://github.com/distil-labs/inference-efficiency-benchmarks/

Full blog post with charts and per-dataset breakdowns: https://www.distillabs.ai/blog/the-10x-inference-tax-you-dont-have-to-pay

📖 Read the full source: r/LocalLLaMA

Benchmarks Show Distilled Models Match Frontier LLMs on Structured Tasks at 10x Lower Cost

Benchmark Results: Distilled vs. Frontier Models

Key Performance Findings

Inference Performance

Methodology

Practical Recommendations

Available Resources

👀 See Also

OpenClaw 2026.4.2 and 2026.3.31 break local LLM connections

Analysis of 'Clausage': User Anxiety Patterns in AI Subscription Models

Claude Cowork UX Problem: Persistent Input Box Creates False Continuity Expectations

NVIDIA announces NemoClaw with OpenShell security features