Benchmark shows smaller 4B model outperforms larger LLMs for phone-to-home chat applications

Phone-to-home chat benchmark results
A recent benchmark evaluated 8 local LLMs for phone-to-home chat applications where inference runs on a home computer. The test involved 640 evaluations (8 models × 8 datasets × 10 samples) on Mac mini M4 Pro 24Gb hardware.
Fitness formula and weighting
The composite fitness formula weighted three factors: 50% chat UX, 30% speed, and 20% shortform quality. This weighting prioritizes user experience for mobile applications where latency matters most.
Key findings
- Gemma3:4B won with a composite fitness score of 88.7 despite being the smallest model tested
- It achieved the lowest TTFT (11.2s), highest throughput (89.3 tok/s), and coolest thermals (45°C)
- Larger models like GPT-OSS:20B passed 70% of tasks but ranked 6th due to 25.4s mean TTFT
- Thermal performance varied significantly: Qwen3:14B peaked at 83°C, DeepSeek-R1:14B at 81°C
- Magistral:24B was excluded from final ranking after triggering timeout loops and reaching 97°C GPU temperature
Why smaller models performed better
The benchmark revealed that for phone chat applications, faster first-token response (TTFT) and lower thermal load matter more than raw accuracy. A model scoring 77.5% accuracy but requiring 25s first-token wait loses to one that replies at 72.5% but responds in 11s. The thermal gap is significant for personal hardware reliability and longevity.
Independent analysis
An independent analysis using Claude on the same 640-evaluation dataset weighted reliability and TTFT more aggressively and reached a slightly different top-4 order, confirming that KPI weighting is a choice rather than ground truth.
Use case considerations
The author notes that for different use cases like coding or long-form writing, the weighting formula would flip entirely, prioritizing quality over speed and chat UX.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Apple Using Google Gemini Access for On-Device AI Model Distillation
Apple has full access to Google's Gemini model for distillation, creating smaller on-device AI models for Siri and other features in iOS 27 without internet connectivity.

Mistral's Open-Weight Strategy: $14B Valuation on Sovereignty, Not Benchmarks
Mistral built a $14B AI empire by offering open-weight models for governments and enterprises seeking AI independence from US and Chinese tech. Revenue hit $200M in 2025, targeting $80M/month by Dec 2026.

Atlassian lays off 10% of workforce to fund AI investments
Atlassian is cutting 1,600 jobs (10% of workforce) to self-fund AI investments and strengthen its financial profile, with 900 positions in software development affected. CEO Mike Cannon-Brookes says AI doesn't replace people but changes skill requirements.

Claude Opus 4.7 System Prompt Changes: Platform Renaming, Tool Integration, and Behavioral Updates
Anthropic updated the Claude Opus system prompt from version 4.6 (February 5, 2026) to 4.7 (April 16, 2026), renaming the 'developer platform' to 'Claude Platform', adding Claude in Powerpoint to the tools list, expanding child safety instructions, and implementing new behavioral guidelines for tool usage and response conciseness.