Monarch v3: NES-Inspired KV Paging for 78% Faster LLM Inference

What Monarch v3 Does
Monarch v3 is an open-source implementation of NES-inspired memory paging for transformer inference that addresses the linear growth of KV cache with sequence length. By 4K tokens, most KV cache sits unused while consuming VRAM at full precision.
How It Works
The system splits KV cache into two regions:
- Hot region: Recent tokens kept at full precision
- Cold region: Older tokens compressed to ~20 bytes each (vs 64-128 bytes hot)
Four components work together:
- TurboQuant Compression: Quantizes KV to 4-bit integers with polar encoding and residual correction, achieving ~97% size reduction with ~0.3% perplexity loss
- Sliding Window Eviction: Recent N tokens stay hot by default, old tokens compress to cold storage
- Attention-Weighted Promotion: High-attention tokens move back to hot with sticky mechanism to prevent thrashing
- Page Swaps: Small batches of cold tokens materialize on access with local decode loop replacing batch matmul
Benchmark Results
Setup: TinyLlama-1.1B fp16, 50 generated tokens
- Standard: 17.01 tok/s, 2112 MB VRAM
- Monarch-v3: 30.42 tok/s, 2131 MB VRAM, 512 hot tokens, 1024 cold tokens
- Gain: +78.7% throughput, +0.9% VRAM
Simplified Decode Loop
for step in 1..100:
q = project_query(next_token)
# Compute attention: hot only (fast)
scores_hot = q @ kv_hot.T
# Access cold if high attention (rare)
if max(scores_hot) < threshold:
kv_cold_promoted = decompress(cold_pages)
scores_cold = q @ kv_cold_promoted.T
# Move to hot for next step
# Aggregate, softmax, apply attn ...
# Evict old tokens from hot → cold
if len(kv_hot) > window_size:
evict_oldest_to_cold()Current Status
- Implementation: Working on Hugging Face Transformers with custom cache backend
- License: Apache 2.0
- Paper: Full technical spec available
- Next: CUDA kernel fusion for cold decompression planned
How to Try It
git clone https://github.com/JohannaWeb/Monarch.git
cd Monarch
pip install -r requirements.txt
python train_tinyllama_fp16.py
python src/benchmark_monarch.py \
--model models/tinyllama_fp16 \
--mode both \
--max-new-tokens 100 \
--promotion-threshold 0.15 \
--sticky-threshold 3 \
--jsonLimitations
The approach relies on recency (recent tokens = high attention), which works for most tasks but may not for retrieval-heavy workloads. Attention extraction is available in base models but not chat variants; fallback uses window-only paging.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Antigravity 2.0 Tops OpenSCAD Architectural 3D Benchmark – ModelRift Tests 6 LLMs on the Pantheon
ModelRift benchmarked 6 LLMs on building the Pantheon in OpenSCAD. Antigravity scored 4.5/5 in architectural quality, beating baseline Codex 5.5. Cursor 3.5 was fastest but weakest.

LLM-Memory.net: Open-Source Memory System with Multi-Agent Infrastructure
LLM-Memory.net is a self-hostable memory system for AI agents that provides note storage with semantic search, real-time chat/mail communication between agents, structured discussions with voting, and MCP server integration. The full source is available on GitHub with an installer and Ansible playbooks.

Head-to-head code review experiment compares three AI tools on same codebase
A video experiment tests Codex, Claude Code, and Claude Code with Sextant on identical code review tasks, with Codex verifying findings and judging which report is more valuable. The focus is on how workflow and structure affect what AI notices and prioritizes.

Claude Code Hooks Implementation Project Covers All 23 Hooks
A developer has built a project entirely with Claude code that implements all 23 Claude code hooks, with a video explaining each hook's use case and a GitHub repository available.