Open-weight models under 100GB can't beat Claude Haiku on coding benchmarks

✍️ OpenClawRadar📅 Published: February 26, 2026🔗 Source

A recent analysis of open-weight language models reveals a significant performance gap compared to Anthropic's Claude Haiku on coding benchmarks. The comparison was conducted using specific testing parameters and memory requirements.

Benchmark methodology

The evaluation compared models on two coding benchmarks: LiveBench (January 2026) and Arena Code/WebDev. Testing was performed against Claude Haiku 4.5 with thinking capabilities enabled. Models were plotted according to memory requirements for local deployment.

Technical specifications

Quantization: Q4_K_M
Context length: 32K
KV cache: q8_0
VRAM estimation: Calculated using the author's custom calculator

Key findings

No open-weight model under 100GB of memory comes close to Claude Haiku's performance on either benchmark. The nearest competitor is Minimax M2.5, which requires approximately 136GB of memory and roughly matches Haiku's performance on both benchmarks.

The analysis highlights the current gap between proprietary and open-weight models in the under-100GB category for coding tasks. The author expresses frustration with this limitation and calls for development of smaller models that could at least match Haiku's capabilities.

📖 Read the full source: r/LocalLLaMA

👀 See Also

News

Anthropic Secures 300MW Compute at Colossus 1 with 220,000 NVIDIA GPUs via SpaceX Partnership

Anthropic announced a partnership with SpaceX to use all compute capacity at the Colossus 1 data center, gaining over 300MW and more than 220,000 NVIDIA GPUs within a month.

May 6, 2026, 06:16 PM UTC

OpenClawRadar

News

Claude Code v2.1.150 Adds Remote System Prompt Injection via Network

Claude Code v2.1.150 fetches system prompts from Anthropic servers at startup and every 60 seconds via a GrowthBook feature flag, allowing remote injection—bypassed with CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1.

May 25, 2026, 12:16 PM UTC

OpenClawRadar

News

Analysis of TB2 Benchmarking Issues in db-wal-recovery Task

A Reddit analysis reveals problems with Terminal Bench 2.0's db-wal-recovery task, where agents can accidentally destroy evidence by opening SQLite databases, and shows how prompt injection affects leaderboard results.

Mar 17, 2026, 09:45 AM UTC

OpenClawRadar

News

ETH Zurich Study Questions Value of AGENTS.md Files for AI Coding Agents

New research from ETH Zurich finds LLM-generated AGENTS.md files reduce AI agent task success by 3% and increase inference costs by over 20%, while human-written files offer only marginal 4% gains with similar cost increases.

Mar 8, 2026, 03:45 PM UTC

OpenClawRadar