Benchmarks Show Distilled Models Match Frontier LLMs on Structured Tasks at 10x Lower Cost

Benchmark Results: Distilled vs. Frontier Models
Researchers conducted a comprehensive comparison of small distilled models against frontier LLMs across 9 datasets covering classification, function calling, QA, and open-book QA tasks. All distilled models are from the Qwen3 family (0.6B to 8B), trained with as few as 50 examples using open-weight teacher models without frontier API outputs for training.
Key Performance Findings
- Distilled models match or beat the best mid-tier frontier model (<$1/MTok input) on 6/9 tasks, effectively tie on a 7th
- Text2SQL: Qwen3-4B distilled hits 98.0% vs Claude Haiku 98.7%, GPT-5 nano 96.0% at $3/M requests vs $378 and $24 respectively
- Smart Home (function calling): Qwen3-0.6B scores 98.7% vs Gemini Flash's 92.0%
- HotpotQA: Distilled models score 92.0% vs Haiku's 98.0% - open-ended reasoning with world knowledge remains frontier territory
- Classification tasks (Banking77, E-commerce, TREC): Distilled models are within 0-1.5 percentage points of the best frontier option
Inference Performance
Models were served via vLLM on a single H100 with the following Text2SQL 4B model performance:
- 222 RPS sustained
- p50: 390ms, p95: 640ms, p99: 870ms
- 7.6 GiB VRAM (BF16, no quantization)
- FP8 gave +15% throughput, -44% memory, no accuracy loss in brief experiments
Methodology
- Same test sets, same prompts, same eval criteria across all models
- Frontier models run 3x per dataset (mean ± std reported), distilled at temp=0
- Eval: exact-match for classification, tool_call_equivalence (JSON comparison with default param normalization) for function calling, Claude Sonnet 4.6 as LLM-as-a-judge for generation
- Cost: frontier = measured API token usage × published pricing (Feb 2026). Distilled = H100 at $2.40/hr ÷ measured sustained RPS
Practical Recommendations
- Distill: structured tasks, well-defined schemas, high volume, data sovereignty requirements
- Frontier API: broad world knowledge, freeform generation, low volume
- Best setup: route between both
Available Resources
All code, models, data, and eval scripts are open source at https://github.com/distil-labs/inference-efficiency-benchmarks/
Full blog post with charts and per-dataset breakdowns: https://www.distillabs.ai/blog/the-10x-inference-tax-you-dont-have-to-pay
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw 2026.4.2 and 2026.3.31 break local LLM connections
OpenClaw versions 2026.4.2 and 2026.3.31 are causing connection timeouts to locally hosted Ollama instances. The issue appears when connecting to Ubuntu boxes running locally, with error logs showing LLM request timeouts and failover decisions.

Analysis of 'Clausage': User Anxiety Patterns in AI Subscription Models
A user analysis identifies 'Clausage' or 'The Claude Syndrome'—behavioral patterns where premium AI subscribers experience chronic usage anxiety, avoidance behavior, and compulsive resource monitoring. The source details specific symptoms like anticipatory avoidance, usage hypervigilance, and paradoxical underutilization of paid services.

Claude Cowork UX Problem: Persistent Input Box Creates False Continuity Expectations
A user identifies a UX problem in Claude Cowork where the persistent text input box maintains draft text across task switches but resets context and loses attachments, creating contradictory signals about continuity.

NVIDIA announces NemoClaw with OpenShell security features
NVIDIA announced NemoClaw at GTC, building on OpenClaw to add enterprise-grade security through OpenShell, which enforces policy-based privacy and security guardrails for AI agents.