Benchmark shows smaller 4B model outperforms larger LLMs for phone-to-home chat applications

Phone-to-home chat benchmark results
A recent benchmark evaluated 8 local LLMs for phone-to-home chat applications where inference runs on a home computer. The test involved 640 evaluations (8 models × 8 datasets × 10 samples) on Mac mini M4 Pro 24Gb hardware.
Fitness formula and weighting
The composite fitness formula weighted three factors: 50% chat UX, 30% speed, and 20% shortform quality. This weighting prioritizes user experience for mobile applications where latency matters most.
Key findings
- Gemma3:4B won with a composite fitness score of 88.7 despite being the smallest model tested
- It achieved the lowest TTFT (11.2s), highest throughput (89.3 tok/s), and coolest thermals (45°C)
- Larger models like GPT-OSS:20B passed 70% of tasks but ranked 6th due to 25.4s mean TTFT
- Thermal performance varied significantly: Qwen3:14B peaked at 83°C, DeepSeek-R1:14B at 81°C
- Magistral:24B was excluded from final ranking after triggering timeout loops and reaching 97°C GPU temperature
Why smaller models performed better
The benchmark revealed that for phone chat applications, faster first-token response (TTFT) and lower thermal load matter more than raw accuracy. A model scoring 77.5% accuracy but requiring 25s first-token wait loses to one that replies at 72.5% but responds in 11s. The thermal gap is significant for personal hardware reliability and longevity.
Independent analysis
An independent analysis using Claude on the same 640-evaluation dataset weighted reliability and TTFT more aggressively and reached a slightly different top-4 order, confirming that KPI weighting is a choice rather than ground truth.
Use case considerations
The author notes that for different use cases like coding or long-form writing, the weighting formula would flip entirely, prioritizing quality over speed and chat UX.
📖 Read the full source: r/LocalLLaMA
👀 See Also

AI Data Center Water Use in California: Estimates from Physics and AI Models
A California WaterBlog analysis using physics and four AI models estimates AI data center water use in California at 2,300–400,000 acre-ft/year, with a realistic range of 32,000–290,000 acre-ft/year — modest compared to agriculture.

Claude for Word Add-in Evidence Found in Analytics API
Anthropic's analytics API now returns metrics for Claude for Word alongside existing Excel and PowerPoint add-ins, indicating the Word integration is in development. The API shows zero usage counts for Word, suggesting it's not yet publicly available.

Open-weight models under 100GB can't beat Claude Haiku on coding benchmarks
A comparison of open-weight models on LiveBench and Arena Code/WebDev benchmarks shows no model under 100GB comes close to Claude Haiku 4.5. The nearest competitor is Minimax M2.5 at 136GB, which roughly matches Haiku's performance.

Claude Code v2.1.139 Adds Agent View, /goal Command, and Major MCP Improvements
Claude Code v2.1.139 introduces a new agent view for session management, a /goal command for multi-turn tasks, expanded hook capabilities, and fixes for MCP server memory issues and terminal corruption.