Benchmarks Show Distilled Models Match Frontier LLMs on Structured Tasks at 10x Lower Cost

Benchmark Results: Distilled vs. Frontier Models
Researchers conducted a comprehensive comparison of small distilled models against frontier LLMs across 9 datasets covering classification, function calling, QA, and open-book QA tasks. All distilled models are from the Qwen3 family (0.6B to 8B), trained with as few as 50 examples using open-weight teacher models without frontier API outputs for training.
Key Performance Findings
- Distilled models match or beat the best mid-tier frontier model (<$1/MTok input) on 6/9 tasks, effectively tie on a 7th
- Text2SQL: Qwen3-4B distilled hits 98.0% vs Claude Haiku 98.7%, GPT-5 nano 96.0% at $3/M requests vs $378 and $24 respectively
- Smart Home (function calling): Qwen3-0.6B scores 98.7% vs Gemini Flash's 92.0%
- HotpotQA: Distilled models score 92.0% vs Haiku's 98.0% - open-ended reasoning with world knowledge remains frontier territory
- Classification tasks (Banking77, E-commerce, TREC): Distilled models are within 0-1.5 percentage points of the best frontier option
Inference Performance
Models were served via vLLM on a single H100 with the following Text2SQL 4B model performance:
- 222 RPS sustained
- p50: 390ms, p95: 640ms, p99: 870ms
- 7.6 GiB VRAM (BF16, no quantization)
- FP8 gave +15% throughput, -44% memory, no accuracy loss in brief experiments
Methodology
- Same test sets, same prompts, same eval criteria across all models
- Frontier models run 3x per dataset (mean ± std reported), distilled at temp=0
- Eval: exact-match for classification, tool_call_equivalence (JSON comparison with default param normalization) for function calling, Claude Sonnet 4.6 as LLM-as-a-judge for generation
- Cost: frontier = measured API token usage × published pricing (Feb 2026). Distilled = H100 at $2.40/hr ÷ measured sustained RPS
Practical Recommendations
- Distill: structured tasks, well-defined schemas, high volume, data sovereignty requirements
- Frontier API: broad world knowledge, freeform generation, low volume
- Best setup: route between both
Available Resources
All code, models, data, and eval scripts are open source at https://github.com/distil-labs/inference-efficiency-benchmarks/
Full blog post with charts and per-dataset breakdowns: https://www.distillabs.ai/blog/the-10x-inference-tax-you-dont-have-to-pay
📖 Read the full source: r/LocalLLaMA
👀 See Also

Project Health Check: Bus Factor and Commit Activity Across Claw/Assistant Repos
A Reddit user scraped commit data from major claw/assistant projects and found many with a bus factor of 1—meaning a single author accounts for over 50% of commits. Some projects show drastic drops in April activity.

HN data confirms arXiv paper share dropping, LLM hype peak behind us?
Dylan Castillo used Claude to query HN BigQuery data, finding that the percentage of front-page stories linking to arXiv has been decreasing rapidly in recent months, after an LLM-dominated peak in 2023–2026.

Claude API experienced elevated error rates across multiple models on February 25, 2026
Claude's API at api.anthropic.com experienced elevated error rates across multiple models on February 25, 2026, with investigation starting at 17:15 UTC and resolution confirmed at 17:46 UTC.

Effortless Deployment: New One-Click AWS Setup for Open Claw Released
Open Claw enthusiasts now have a reason to celebrate. A new one-click AWS deployment tool simplifies the setup process for Open Claw, making it more accessible to developers and hobbyists alike.