PrismML's Bonsai 1-bit Qwen models tested: 107 t/s generation on 8GB VRAM

Bonsai models: 1-bit Qwen quantization from PrismML
PrismML has released Bonsai, a set of 1-bit quantized versions of Qwen3 models (8B, 4B, and 1.7B parameters). These models use extreme quantization to dramatically reduce memory requirements while maintaining usable performance for certain tasks.
Performance benchmarks from testing
Testing on an RTX 4060 with 8GB VRAM showed:
- 107 tokens/second generation speed
- >1114 tokens/second prompt processing
- Significantly lower RAM usage compared to Q4 quantized models
For comparison, Qwen 3.5 4B Q4 achieved 56 t/s using the same prompts on the same hardware.
Practical implications
The reduced memory footprint enables running 8B parameter models on 8GB VRAM systems. Smaller models can be used with longer context windows due to the memory savings.
Quality assessment
Initial testing focused on text summarization, where the model performed well. The tester noted they didn't evaluate coding or tool-using capabilities.
Technical limitations
The current implementation has CPU inference issues. When tested on a GPU-less mini PC:
- The llama.cpp fork compiles successfully
- The model loads but hangs during prompt processing
- Analysis suggests no CPU implementation exists - it likely dequantizes to FP32 and attempts regular inference, which would be extremely slow on CPU
Technical potential
1-bit models could reduce not just bandwidth and memory requirements, but also compute requirements. Matrix multiplication on 1-bit matrices could use XOR operations, which are much faster than floating-point operations. Even with scaling to FP16 after XOR operations, significant compute savings should be possible, potentially benefiting CPU-only inference and edge computing scenarios.
Setup details
The tester downloaded:
- The 8B Bonsai model
- PrismML's llama.cpp fork
- Tested on Windows with CUDA
📖 Read the full source: r/LocalLLaMA
👀 See Also

Anthropic Refuses Pentagon Safety Removal Demands, Loses Federal Contracts
Anthropic refused Pentagon demands to remove safety guardrails from Claude for military applications, leading to a $200M contract cancellation and a presidential order banning federal agency use of their technology.

Claude Code source code reportedly leaked, revealing agent architecture details
The source code for Claude Code, Anthropic's AI coding agent, appears to have been leaked, containing the full repository with system prompts, agent loop implementation, and tool calling infrastructure.
Google DeepMind's AI Pointer: Reimagining the Mouse for Gemini Interactions
Google DeepMind introduces an AI-powered mouse pointer that uses Gemini to understand context, enabling commands like pointing at an image and saying 'Show me directions,' integrated into Chrome and Googlebook.

Claude Code IDE Extension Fails to Load on Windows – Status Update
An official status update reports that the Claude Code IDE extension is unable to load on Windows as of 2026-05-08T22:32:19Z. Track progress and resolution via the status page.