Benchmark shows smaller 4B model outperforms larger LLMs for phone-to-home chat applications

✍️ OpenClawRadar📅 Published: April 20, 2026🔗 Source

Phone-to-home chat benchmark results

A recent benchmark evaluated 8 local LLMs for phone-to-home chat applications where inference runs on a home computer. The test involved 640 evaluations (8 models × 8 datasets × 10 samples) on Mac mini M4 Pro 24Gb hardware.

Fitness formula and weighting

The composite fitness formula weighted three factors: 50% chat UX, 30% speed, and 20% shortform quality. This weighting prioritizes user experience for mobile applications where latency matters most.

Key findings

Gemma3:4B won with a composite fitness score of 88.7 despite being the smallest model tested
It achieved the lowest TTFT (11.2s), highest throughput (89.3 tok/s), and coolest thermals (45°C)
Larger models like GPT-OSS:20B passed 70% of tasks but ranked 6th due to 25.4s mean TTFT
Thermal performance varied significantly: Qwen3:14B peaked at 83°C, DeepSeek-R1:14B at 81°C
Magistral:24B was excluded from final ranking after triggering timeout loops and reaching 97°C GPU temperature

Why smaller models performed better

The benchmark revealed that for phone chat applications, faster first-token response (TTFT) and lower thermal load matter more than raw accuracy. A model scoring 77.5% accuracy but requiring 25s first-token wait loses to one that replies at 72.5% but responds in 11s. The thermal gap is significant for personal hardware reliability and longevity.

Independent analysis

An independent analysis using Claude on the same 640-evaluation dataset weighted reliability and TTFT more aggressively and reached a slightly different top-4 order, confirming that KPI weighting is a choice rather than ground truth.

Use case considerations

The author notes that for different use cases like coding or long-form writing, the weighting formula would flip entirely, prioritizing quality over speed and chat UX.

📖 Read the full source: r/LocalLLaMA

👀 See Also

News

NHS England retreats from open source: open letter urges reversal of SDLC-8 policy

An open letter with 74 signatures calls on NHS England to withdraw SDLC-8 — a policy that hides all NHS source code — and to reaffirm Principle 12 of the NHS Service Standard: 'Make new source code open.'

May 1, 2026, 06:18 PM UTC

OpenClawRadar

News

Claude Opus 4.7 System Prompt Changes: Platform Renaming, Tool Integration, and Behavioral Updates

Anthropic updated the Claude Opus system prompt from version 4.6 (February 5, 2026) to 4.7 (April 16, 2026), renaming the 'developer platform' to 'Claude Platform', adding Claude in Powerpoint to the tools list, expanding child safety instructions, and implementing new behavioral guidelines for tool usage and response conciseness.

Apr 19, 2026, 03:45 PM UTC

OpenClawRadar

News

Claude Cowork unifies slash commands and skills under single concept

Claude Cowork has unified slash commands and skills under a single concept called 'skills', eliminating separate headers in the / menu. Legacy commands continue to function as before.

Mar 18, 2026, 01:45 PM UTC

OpenClawRadar

News

Claude Code v2.1.91 Updates: Agent Design Patterns, Memory Rules, and Tool Improvements

Claude Code v2.1.91 adds a reference guide for agent design patterns covering tool surface design, context management, and caching strategies. The update simplifies memory selection rules, adds security monitoring for memory poisoning, and improves tool descriptions for Edit, ReadFile, and Write operations.

Apr 6, 2026, 02:45 PM UTC

OpenClawRadar