PrismML's Bonsai 1-bit Qwen models tested: 107 t/s generation on 8GB VRAM

✍️ OpenClawRadar📅 Published: April 5, 2026🔗 Source
PrismML's Bonsai 1-bit Qwen models tested: 107 t/s generation on 8GB VRAM
Ad

Bonsai models: 1-bit Qwen quantization from PrismML

PrismML has released Bonsai, a set of 1-bit quantized versions of Qwen3 models (8B, 4B, and 1.7B parameters). These models use extreme quantization to dramatically reduce memory requirements while maintaining usable performance for certain tasks.

Performance benchmarks from testing

Testing on an RTX 4060 with 8GB VRAM showed:

  • 107 tokens/second generation speed
  • >1114 tokens/second prompt processing
  • Significantly lower RAM usage compared to Q4 quantized models

For comparison, Qwen 3.5 4B Q4 achieved 56 t/s using the same prompts on the same hardware.

Practical implications

The reduced memory footprint enables running 8B parameter models on 8GB VRAM systems. Smaller models can be used with longer context windows due to the memory savings.

Quality assessment

Initial testing focused on text summarization, where the model performed well. The tester noted they didn't evaluate coding or tool-using capabilities.

Ad

Technical limitations

The current implementation has CPU inference issues. When tested on a GPU-less mini PC:

  • The llama.cpp fork compiles successfully
  • The model loads but hangs during prompt processing
  • Analysis suggests no CPU implementation exists - it likely dequantizes to FP32 and attempts regular inference, which would be extremely slow on CPU

Technical potential

1-bit models could reduce not just bandwidth and memory requirements, but also compute requirements. Matrix multiplication on 1-bit matrices could use XOR operations, which are much faster than floating-point operations. Even with scaling to FP16 after XOR operations, significant compute savings should be possible, potentially benefiting CPU-only inference and edge computing scenarios.

Setup details

The tester downloaded:

  • The 8B Bonsai model
  • PrismML's llama.cpp fork
  • Tested on Windows with CUDA

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also