Bonsai 1.7B Ternary Model Hits 442 T/s on M4 Max with Autonomously Tuned Metal Kernels

Bonsai 1.7B — a ternary model from PrismML — has been optimized for Apple Silicon using autonomously tuned Metal kernels. The work was performed by ata, an autonomous engineering agent from Agents2Agents, which ran an agentic evolution search for 6 hours to produce custom GPU kernels.
Benchmark Results
Measured against the upstream llama.cpp at the same Bonsai/Q2_0 commit on an M4 Max (same model file, same llama-bench -p 512 -n 128 -r 10 -fa 1 -ngl 99 config):
- Decode (tg128): 311.66 → 442.42 t/s (+42.0%)
- Prefill (pp512): 4250.32 → 4622.63 t/s (+8.8%)
For context, the Bonsai 8B whitepaper reports MLX-upstream Q2_0 decode at 235 t/s on Apple Silicon. This build achieves 442 t/s on the 1.7B variant via custom Metal kernels (different framework, smaller model — directionally indicative of headroom in the stack).
What's Included
The build is a drop-in optimized inference package for M-series Macs (arm64 only). Inside the 358 MB tar.xz:
chat.sh— interactive REPLcomplete.sh— non-interactive completionbench.sh— reproduce the benchmarksserver.sh— OpenAI-compatible HTTP API on :8080Bonsai-1.7B-Q2_0.gguf— the model file (442 MB)
Quick Start
tar -xJf bonsai-1.7b-ternary-M4Max.tar.xz
cd bonsai-1.7b-ternary-M4Max
./chat.shTechnical Details
Every Metal kernel was authored and tuned by ata without human intervention. The work focused on custom GPU kernels at the matvec / FFN / KV-cache layer, shape-specialized for the Bonsai 1.7B Q2_0 decode path. Numerical output matches the reference build (verified top-1 token match). Tested on M4 Max; proportional gains expected on M1+.
Caveats
- Apple Silicon only (arm64) — no Intel Mac or CPU-only builds.
- Numbers from M4 Max; M1/M2/M3 will be lower due to less memory bandwidth.
- Model is Q2_0 quantized — small accuracy delta vs F16.
📖 Read the full source: HN AI Agents
👀 See Also

Stop Letting AI Agents Design Your Architecture
AI agents like Claude are pathologically agreeable, producing plausible but context-free architectures. They can't say no, don't know your team's constraints, and turn senior engineers into ticket implementers.

Claude AI Spends 81 Minutes on 'Real Thinking' – User Report Spikes Around Major Updates
A user reports Claude AI spent 1 hour 21 minutes on a simple task, speculating that performance spikes happen briefly after major updates. Example: a research request scanned 5,113 sources in one session but later only 100-200 sources for similar queries.

Claude Code on the Web Partial Outage Reported
An automatic status update from r/ClaudeAI reports a partial outage for Claude Code on the web starting 2026-05-09T23:33:21.000Z. Check the official status page and community megathread for updates.

Claude.ai Currently Down, API Errors Elevated — April 28, 2026
An automatic status update triggered from Claude's official status page reports that Claude.ai is unavailable and the API is experiencing elevated error rates as of 2026-04-28T17:51:36.000Z.