KV Cache Memory Optimization: From 300 KiB to 68.6 KiB

KV Cache Memory Costs Across Model Architectures

A recent analysis of KV cache architecture evolution reveals significant improvements in memory efficiency across transformer models. The progression shows how different attention mechanisms have reduced the GPU memory required for maintaining conversation context during inference.

Specific Architecture Comparisons

GPT-2 (2019): 300 KiB/token. Uses multi-head attention where every head maintains its own keys and values with no sharing. A 4,000-token conversation requires approximately 1.2 GB of GPU memory just for the cache, separate from model weights.
Llama 3 (2024): 128 KiB/token. Implements grouped-query attention where multiple query heads share the same KV pairs. This is less than half of GPT-2's cost, based on the insight that many heads were learning redundant representations.
DeepSeek V3 (2024): 68.6 KiB/token. Uses multi-head latent attention that compresses KV pairs into a lower-dimensional latent space and decompresses at inference. This is a 671B parameter model with 37B active via MoE. DeepSeek V2's ablation studies, which V3's architecture builds on, showed the compressed representation matched or slightly beat standard MHA on several benchmarks.
Gemma 3 (2025): Uses GQA plus a sliding window with 5:1 local-to-global attention layers, where local layers attend to only 1,024 tokens. Shows almost no perplexity loss from the aggressive filtering.
Mamba/SSMs (2023): No KV cache at all. Uses fixed-size hidden state updated per token. The model decides what to compress in real time rather than storing everything and attending later.

Architectural Gaps and Practical Implications

The analysis highlights a gap between working memory and permanent knowledge in current architectures. KV cache persists for seconds to minutes (reported cache lifetimes are 5-10 minutes, varying by provider and load), then disappears. Between the temporary cache and permanent weights, there's no native medium-term memory or architectural slot for information like "I talked to this user last Tuesday."

Current solutions like RAG, file systems, vector DBs, and system prompts carrying curated context are described as "bridges over an architectural void" - lookup systems bolted onto models with no internal medium-term storage.

The compaction problem exemplifies this limitation. When context grows too large, models summarize their own history, clear the cache, and continue from the summary. This can lead to loss of precision (a publishing policy with six rules becomes "something about editorial guidelines") and models confidently operating on degraded context without knowing what was lost.

Cursor's learned compaction approach trains models to self-summarize well via RL rather than just prompting compression, but evidence is limited to one coding benchmark. Code provides clean reward signals (tests pass or fail), unlike scenarios like compacting editorial notes, strategic planning, or conversations where critical details won't be needed for many messages.

📖 Read the full source: r/LocalLLaMA

KV Cache Architecture Evolution: From GPT-2 to Mamba

KV Cache Memory Costs Across Model Architectures

Specific Architecture Comparisons

Architectural Gaps and Practical Implications

👀 See Also

Claude-Code v2.1.32: Enhancing Automation and Coding Precision

Reddit User Argues Developers Should Shift from Clean Coding to Model Architecture with AI Agents

Claude Opus 4.5 and Sonnet 4.5 removed from /model selection, require launch flag

Anthropic's DoD Meeting and Chinese AI Labs Distilling Claude