Codebook Lossless LLM Compression: 10-25% RAM Reduction with Bitwise Packing

A developer has published proof-of-concept code for lossless LLM compression that reduces memory usage by 10-25% through bitwise generic packing of indexed weights. The technique trades some inference speed for smaller model size, making it possible to run larger models on hardware with limited VRAM.
How It Works
The developer started by asking how many unique values actually exist in LLM layers. Analysis revealed that while fp16 uses 16 bits, most models only utilize about 12-13 bits of unique values. By packing these values into blocks, the technique achieves compression without losing precision.
Performance Characteristics
- RAM reduction: 10-25%+ across tested models
- Speed impact: Inference speed approximately halved in example tests
- Test hardware: NVIDIA P2200 (5GB) and CPU, with updates being developed for AMD MI50 (32GB)
Implementation Details
The developer worked on this project for several weeks using AI coding assistants including Claude, Qwen, and Gemini. The repository includes both lossless and lossy/balanced versions, though the lossy version hasn't been extensively tested yet.
The developer suggests this compression approach might serve as a way to measure a model's "compactness" - how efficiently it uses its parameter space.
Code Availability
The proof-of-concept code is available on GitHub: https://github.com/bigattichouse/Codebook-Quantization
📖 Read the full source: r/LocalLLaMA
👀 See Also

Code Evolution Method Triples LLM Performance on ARC-AGI-2 Benchmark
Researchers achieved a 2.8x improvement on the ARC-AGI-2 benchmark using code evolution with open-weight models, reaching 34% accuracy at $2.67 per task. The same method pushed Gemini 3.1 Pro to 95% accuracy at $8.71 per task.

Werld: Open-Ended Artificial Life Simulation with Evolving Neural Networks
Werld is a real-time artificial life simulation where agents with NEAT neural networks evolve their own neural architecture, sensory processing, and behaviors without hardcoded rules or reward functions. The simulation starts with 30 agents on a Watts-Strogatz small-world graph with 64 sensory channels, 7 continuous motor functions, and 29 heritable genome traits.

MemAware Benchmark Tests AI Memory Beyond Keyword Search
MemAware is a benchmark with 900 questions across 3 difficulty levels that tests whether AI assistants with memory can surface relevant context when queries don't hint at it. Results show BM25 search scored 2.8% vs 0.8% with no memory, while vector search drops to 0.7% on cross-domain connections.

NotebookLM MCP Structured: Free Server Connects Claude to NotebookLM with Automatic Prompt Structuring
A free MCP server called NotebookLM MCP Structured connects Claude Desktop to NotebookLM notebooks with automatic prompt structuring. The server restructures queries based on type (comparison, list, analysis, explanation, or extraction) and adds completeness checks and fidelity constraints.