Bonsai 1.7B Ternary Model Hits 442 T/s on M4 Max with Autonomously Tuned Metal Kernels

Bonsai 1.7B — a ternary model from PrismML — has been optimized for Apple Silicon using autonomously tuned Metal kernels. The work was performed by ata, an autonomous engineering agent from Agents2Agents, which ran an agentic evolution search for 6 hours to produce custom GPU kernels.
Benchmark Results
Measured against the upstream llama.cpp at the same Bonsai/Q2_0 commit on an M4 Max (same model file, same llama-bench -p 512 -n 128 -r 10 -fa 1 -ngl 99 config):
- Decode (tg128): 311.66 → 442.42 t/s (+42.0%)
- Prefill (pp512): 4250.32 → 4622.63 t/s (+8.8%)
For context, the Bonsai 8B whitepaper reports MLX-upstream Q2_0 decode at 235 t/s on Apple Silicon. This build achieves 442 t/s on the 1.7B variant via custom Metal kernels (different framework, smaller model — directionally indicative of headroom in the stack).
What's Included
The build is a drop-in optimized inference package for M-series Macs (arm64 only). Inside the 358 MB tar.xz:
chat.sh— interactive REPLcomplete.sh— non-interactive completionbench.sh— reproduce the benchmarksserver.sh— OpenAI-compatible HTTP API on :8080Bonsai-1.7B-Q2_0.gguf— the model file (442 MB)
Quick Start
tar -xJf bonsai-1.7b-ternary-M4Max.tar.xz
cd bonsai-1.7b-ternary-M4Max
./chat.shTechnical Details
Every Metal kernel was authored and tuned by ata without human intervention. The work focused on custom GPU kernels at the matvec / FFN / KV-cache layer, shape-specialized for the Bonsai 1.7B Q2_0 decode path. Numerical output matches the reference build (verified top-1 token match). Tested on M4 Max; proportional gains expected on M1+.
Caveats
- Apple Silicon only (arm64) — no Intel Mac or CPU-only builds.
- Numbers from M4 Max; M1/M2/M3 will be lower due to less memory bandwidth.
- Model is Q2_0 quantized — small accuracy delta vs F16.
📖 Read the full source: HN AI Agents
👀 See Also

Slurm Coding: The AI-Powered Development Pattern Where Time Disappears
A developer describes 'Slurm coding' as an intense development pattern enabled by AI coding tools, where small ideas rapidly escalate into complete systems through a feedback loop of quick implementation and dopamine hits.

Encyclopedia Britannica Files Lawsuit Against OpenAI Over AI Training Data
Encyclopedia Britannica has filed a lawsuit against OpenAI, alleging copyright infringement related to AI training data. The case was reported by Reuters on March 16, 2026, and has generated discussion on Hacker News.

Qwen3.5-27B 8-bit vs 16-bit Performance Comparison
A Reddit user tested Qwen3.5-27B with vLLM comparing bf16 weights and 16-bit KV cache against Qwen's fp8 quantization with 8-bit KV cache, finding practically identical results on the Aider benchmark using an RTX 6000 Pro.

Claude Code v2.1.98 adds Vertex AI wizard, security fixes, and subprocess sandboxing
Claude Code v2.1.98 introduces an interactive Google Vertex AI setup wizard, adds subprocess sandboxing with PID namespace isolation on Linux, and fixes multiple security vulnerabilities including Bash permission bypasses and arbitrary code execution risks.