Qwen3 0.6B-8B Models Beat GPT-5, Gemini on 6/9 Tasks for $3/M

A systematic comparison of small distilled Qwen3 models against frontier API models shows that fine-tuned small language models can outperform larger, more expensive models on specific structured tasks.

Benchmark Results

The study compared Qwen3 models (0.6B to 8B parameters) against frontier APIs including GPT-5 nano/mini/5.2, Gemini 2.5 Flash Lite/Flash, Claude Haiku 4.5/Sonnet 4.6/Opus 4.6, and Grok 4.1 Fast/Grok 4 across 9 datasets. All distilled models were trained using open-weight teachers only, with as few as 50 examples. Inference was run on vLLM on a single H100.

Key Performance Findings

Smart Home function calling: Qwen3-0.6B achieved 98.7% accuracy vs. Gemini Flash at 92.0%
Text2SQL: Qwen3-4B distilled got 98.0% vs. Claude Haiku at 98.7% and GPT-5 nano at 96.0%
Cost comparison: Text2SQL cost per million requests: Qwen3-4B ~$3 vs. Claude Haiku $378 and GPT-5 nano $24
Classification tasks: Distilled models performed within 0–1.5 percentage points of the best frontier option on Banking77, E-commerce, and TREC datasets
Frontier advantage: HotpotQA (open-ended reasoning + world knowledge) — 92.0% vs. Haiku's 98.0%

Performance Metrics

For Text2SQL with Qwen3-4B on H100:

222 RPS sustained
p50: 390ms | p95: 640ms | p99: 870ms
7.6 GiB VRAM (BF16, no quantization)
FP8 gave +15% throughput, −44% VRAM, no measurable accuracy loss in brief experiments

Methodology

Same test sets, prompts, and evaluation criteria for all models
Frontier models run 3× per dataset (reporting mean ± std), distilled at temperature=0
Evaluation: exact-match for classification, tool_call_equivalence (JSON comparison with default parameter normalization) for function calling, Claude Sonnet 4.6 as LLM-judge for generation tasks
Cost calculation: frontier = measured token usage × published pricing (Feb 2026); distilled = H100 at $2.40/hr ÷ sustained RPS

Practical Recommendations

Use distilled models when: You have structured tasks, well-defined schemas, high volume, or data sovereignty needs
Use frontier APIs when: You need broad world knowledge, freeform generation, or volume is low enough that cost doesn't matter
Hybrid approach: Route between the two based on task requirements

Availability

All code, models, data, and evaluation scripts are open source on GitHub: https://github.com/distil-labs/inference-efficiency-benchmarks/

Full analysis with charts available on the blog: https://www.distillabs.ai/blog/the-10x-inference-tax-you-dont-have-to-pay

📖 Read the full source: r/LocalLLaMA

Fine-tuned Qwen3 Small Models Outperform Frontier LLMs on Specific Tasks at Lower Cost

Benchmark Results

Key Performance Findings

Performance Metrics

Methodology

Practical Recommendations

Availability

👀 See Also

Decoupled DiLoCo: Resilient Distributed Training Across Data Centers with Low Bandwidth

Pentagon Gives Anthropic 72 Hours to Allow Military Use of Claude AI

Local LLM Benchmark: Backend Generation by Function Calling – GLM, Qwen, DeepSeek Compared

Claude Code 2.1.83 Release: Prompt Caching, Verify Skill, and SDK Updates