PrismML's Bonsai 1-bit Qwen models tested: 107 t/s generation on 8GB VRAM

Bonsai models: 1-bit Qwen quantization from PrismML
PrismML has released Bonsai, a set of 1-bit quantized versions of Qwen3 models (8B, 4B, and 1.7B parameters). These models use extreme quantization to dramatically reduce memory requirements while maintaining usable performance for certain tasks.
Performance benchmarks from testing
Testing on an RTX 4060 with 8GB VRAM showed:
- 107 tokens/second generation speed
- >1114 tokens/second prompt processing
- Significantly lower RAM usage compared to Q4 quantized models
For comparison, Qwen 3.5 4B Q4 achieved 56 t/s using the same prompts on the same hardware.
Practical implications
The reduced memory footprint enables running 8B parameter models on 8GB VRAM systems. Smaller models can be used with longer context windows due to the memory savings.
Quality assessment
Initial testing focused on text summarization, where the model performed well. The tester noted they didn't evaluate coding or tool-using capabilities.
Technical limitations
The current implementation has CPU inference issues. When tested on a GPU-less mini PC:
- The llama.cpp fork compiles successfully
- The model loads but hangs during prompt processing
- Analysis suggests no CPU implementation exists - it likely dequantizes to FP32 and attempts regular inference, which would be extremely slow on CPU
Technical potential
1-bit models could reduce not just bandwidth and memory requirements, but also compute requirements. Matrix multiplication on 1-bit matrices could use XOR operations, which are much faster than floating-point operations. Even with scaling to FP16 after XOR operations, significant compute savings should be possible, potentially benefiting CPU-only inference and edge computing scenarios.
Setup details
The tester downloaded:
- The 8B Bonsai model
- PrismML's llama.cpp fork
- Tested on Windows with CUDA
📖 Read the full source: r/LocalLLaMA
👀 See Also

Google Account Suspended After OpenClaw Integration Attempt
A developer's brand-new Google account was suspended within 48 hours after setting up API access for OpenClaw integration, flagged as bot activity despite manual creation.

Anthropic Raises Claude Limits and Adds SpaceX Compute Capacity
Anthropic has increased Claude usage limits and secured a compute deal with SpaceX. The Reddit discussion weighs whether this is just infra scaling or a strategic move toward making Claude a better platform for agentic work.

Anthropic Copyright Settlement Details for Developers
Anthropic settled a $1.5 billion copyright class action over using works to train AI models. Eligible copyright owners can claim $500–$3,000 per validated work with a March 23, 2026 deadline.

Claude Service Incident: Elevated Errors Across Platforms
Claude experienced elevated errors across claude.ai, console, and Claude Code platforms on March 2, 2026, with issues affecting login/logout paths and some API methods. The incident was resolved after approximately 4 hours.