NexQuant: Rust-native 3-bit KV-cache engine for edge deployment

✍️ OpenClawRadar📅 Published: April 2, 2026🔗 Source
NexQuant: Rust-native 3-bit KV-cache engine for edge deployment
Ad

NexQuant is a Rust-native engine for running high-context models on consumer hardware that would normally struggle with memory constraints. It's positioned as a production-hardened successor to Tom Turney's TurboQuant+ research.

Key technical details

  • 3-5x Memory Reduction: 14B models now fit in 4GB of VRAM or unified memory
  • MSE-Only Stability: Replaces noisy QJL paths with stable MSE-only trajectory (27/27 logic tests passed)
  • Integrated Sparse-V: Sparsity is integrated into the real-time decode loop rather than just being a benchmark feature
  • Zero-Alloc Prefill: Written in 100% Safe Rust for speed without C++ prototype segfault issues
  • Hardware Support: Native runtime dispatch for Metal, CUDA, and Vulkan, with CPU-AVX2/NEON backend support for older laptops and Raspberry Pi
Ad

Implementation specifics

The project uses Walsh-Hadamard Transforms and Rust GGUF parsing. It builds on Tom Turney's PolarQuant/TurboQuant+ breakthroughs that proved 3-bit KV-caches were mathematically possible. The development involved Claude (Anthropic) as a high-speed pair programmer.

The goal is to ensure that as models scale, the ability to run them remains local and decentralized. The team is specifically seeking feedback on Vulkan SPIR-V kernels.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also