Benchmark shows smaller 4B model outperforms larger LLMs for phone-to-home chat applications

✍️ OpenClawRadar📅 Published: April 20, 2026🔗 Source
Benchmark shows smaller 4B model outperforms larger LLMs for phone-to-home chat applications
Ad

Phone-to-home chat benchmark results

A recent benchmark evaluated 8 local LLMs for phone-to-home chat applications where inference runs on a home computer. The test involved 640 evaluations (8 models × 8 datasets × 10 samples) on Mac mini M4 Pro 24Gb hardware.

Fitness formula and weighting

The composite fitness formula weighted three factors: 50% chat UX, 30% speed, and 20% shortform quality. This weighting prioritizes user experience for mobile applications where latency matters most.

Key findings

  • Gemma3:4B won with a composite fitness score of 88.7 despite being the smallest model tested
  • It achieved the lowest TTFT (11.2s), highest throughput (89.3 tok/s), and coolest thermals (45°C)
  • Larger models like GPT-OSS:20B passed 70% of tasks but ranked 6th due to 25.4s mean TTFT
  • Thermal performance varied significantly: Qwen3:14B peaked at 83°C, DeepSeek-R1:14B at 81°C
  • Magistral:24B was excluded from final ranking after triggering timeout loops and reaching 97°C GPU temperature
Ad

Why smaller models performed better

The benchmark revealed that for phone chat applications, faster first-token response (TTFT) and lower thermal load matter more than raw accuracy. A model scoring 77.5% accuracy but requiring 25s first-token wait loses to one that replies at 72.5% but responds in 11s. The thermal gap is significant for personal hardware reliability and longevity.

Independent analysis

An independent analysis using Claude on the same 640-evaluation dataset weighted reliability and TTFT more aggressively and reached a slightly different top-4 order, confirming that KPI weighting is a choice rather than ground truth.

Use case considerations

The author notes that for different use cases like coding or long-form writing, the weighting formula would flip entirely, prioritizing quality over speed and chat UX.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also