Google Research introduces TurboQuant for AI model compression

What TurboQuant does
TurboQuant is a set of advanced quantization algorithms that enable massive compression for large language models and vector search engines. It specifically addresses bottlenecks in the key-value cache - a high-speed storage system that stores frequently used information under simple labels for instant retrieval.
How it works
TurboQuant achieves high reduction in model size with zero accuracy loss through two key steps:
- High-quality compression (PolarQuant method): Starts by randomly rotating data vectors to simplify geometry, then applies a standard quantizer to each part of the vector individually. This stage uses most of the compression power to capture the main concept and strength of the original vector.
- Eliminating hidden errors: Uses a small residual amount of compression power (just 1 bit) to apply the QJL algorithm to the tiny amount of error left over from the first stage. QJL acts as a mathematical error-checker that eliminates bias, leading to more accurate attention scores.
Key components
QJL (Quantized Johnson-Lindenstrauss): Uses the Johnson-Lindenstrauss Transform to shrink high-dimensional data while preserving distances between data points. It reduces each resulting vector number to a single sign bit (+1 or -1) with zero memory overhead. Uses a special estimator that balances high-precision queries with low-precision data to accurately calculate attention scores.
PolarQuant: Addresses memory overhead by converting vectors into polar coordinates using a Cartesian coordinate system. Instead of standard coordinates (X, Y, Z), it uses a format comparable to "Go 5 blocks total at a 37-degree angle" rather than "Go 3 blocks East, 4 blocks North."
Technical context
Traditional vector quantization typically introduces memory overhead of 1-2 extra bits per number due to storing quantization constants for every small data block. TurboQuant optimally addresses this challenge. The techniques showed promise in testing for reducing key-value bottlenecks without sacrificing AI model performance.
TurboQuant will be presented at ICLR 2026, while PolarQuant will be presented at AISTATS 2026.
📖 Read the full source: HN AI Agents
👀 See Also

MuninnDB adds Dream Engine for LLM memory consolidation with vault isolation
MuninnDB, a Go-based cognitive memory database, now includes a Dream Engine that performs LLM-driven memory consolidation between sessions using deduplication thresholds and semantic review. The system features vault trust tiers for data isolation and runs locally with Ollama.

ANE Optimization Through Phone-Steered AI Experiments Shows Kernel Fusion Benefits
A developer ran 55 experiments on Apple Neural Engine optimization, steering the process from their phone using Claude for brainstorming. Key improvements included fusing 3 ANE kernels into 1 mega-kernel, reducing validation loss from 3.75 to 2.49 and step time from 176ms to 96ms.

Claudlytics: Self-Hosted Dashboard for Tracking Claude Code Token Usage and Costs
Claudlytics is a Node.js web server that reads Claude Code's local .jsonl session files to provide real-time tracking of token usage and costs. It runs locally on 127.0.0.1 and can be accessed via SSH tunnel for remote servers.

alogin: A Go-based Security Gateway for AI Agents with Human-in-the-Loop
alogin is an open-source Go-based security gateway that provides a secure conduit between AI agents and infrastructure, featuring built-in MCP server support for Claude Desktop, human-in-the-loop safety rails, and encrypted credential storage.