PrismML's Bonsai 1-bit Qwen models tested: 107 t/s generation on 8GB VRAM

✍️ OpenClawRadar📅 Published: April 5, 2026🔗 Source

Bonsai models: 1-bit Qwen quantization from PrismML

PrismML has released Bonsai, a set of 1-bit quantized versions of Qwen3 models (8B, 4B, and 1.7B parameters). These models use extreme quantization to dramatically reduce memory requirements while maintaining usable performance for certain tasks.

Performance benchmarks from testing

Testing on an RTX 4060 with 8GB VRAM showed:

107 tokens/second generation speed
>1114 tokens/second prompt processing
Significantly lower RAM usage compared to Q4 quantized models

For comparison, Qwen 3.5 4B Q4 achieved 56 t/s using the same prompts on the same hardware.

Practical implications

The reduced memory footprint enables running 8B parameter models on 8GB VRAM systems. Smaller models can be used with longer context windows due to the memory savings.

Quality assessment

Initial testing focused on text summarization, where the model performed well. The tester noted they didn't evaluate coding or tool-using capabilities.

Technical limitations

The current implementation has CPU inference issues. When tested on a GPU-less mini PC:

The llama.cpp fork compiles successfully
The model loads but hangs during prompt processing
Analysis suggests no CPU implementation exists - it likely dequantizes to FP32 and attempts regular inference, which would be extremely slow on CPU

Technical potential

1-bit models could reduce not just bandwidth and memory requirements, but also compute requirements. Matrix multiplication on 1-bit matrices could use XOR operations, which are much faster than floating-point operations. Even with scaling to FP16 after XOR operations, significant compute savings should be possible, potentially benefiting CPU-only inference and edge computing scenarios.

Setup details

The tester downloaded:

The 8B Bonsai model
PrismML's llama.cpp fork
Tested on Windows with CUDA

📖 Read the full source: r/LocalLLaMA

👀 See Also

News

Ubuntu Linux to Integrate AI Features Over the Next Year, Starting with Local Inferencing

Canonical announces a multi-year AI push for Ubuntu, focusing on local inferencing, agentic workflows, and context-aware OS capabilities, with features rolling out throughout 2026.

Apr 27, 2026, 04:15 PM UTC

OpenClawRadar

News

Anthropic Responds to Code Leak Involving Claude AI Agent

Anthropic is working to contain a leak of code related to its Claude AI agent, according to a WSJ report discussed on Hacker News with 13 points and 6 comments.

Apr 3, 2026, 01:45 AM UTC

OpenClawRadar

News

ICML 2026 Desk-Rejects 2% of Papers for LLM Review Policy Violations

ICML 2026 rejected 497 papers (~2% of submissions) after detecting 795 reviews (~1% of all reviews) where reviewers violated explicit agreements not to use LLMs. The detection method involved watermarking PDFs with hidden LLM instructions.

Mar 19, 2026, 02:45 PM UTC

OpenClawRadar

News

Granite 4.1: IBM's 8B Dense Model Matches 32B MoE in Benchmarks

IBM's Granite 4.1 8B dense model matches or beats the previous 32B MoE model on ArenaHard, BFCL V3, GSM8K, and more, thanks to improved training data quality.

Apr 30, 2026, 12:15 PM UTC

OpenClawRadar