KV Cache Architecture Evolution: From GPT-2 to Mamba

KV Cache Memory Costs Across Model Architectures
A recent analysis of KV cache architecture evolution reveals significant improvements in memory efficiency across transformer models. The progression shows how different attention mechanisms have reduced the GPU memory required for maintaining conversation context during inference.
Specific Architecture Comparisons
- GPT-2 (2019): 300 KiB/token. Uses multi-head attention where every head maintains its own keys and values with no sharing. A 4,000-token conversation requires approximately 1.2 GB of GPU memory just for the cache, separate from model weights.
- Llama 3 (2024): 128 KiB/token. Implements grouped-query attention where multiple query heads share the same KV pairs. This is less than half of GPT-2's cost, based on the insight that many heads were learning redundant representations.
- DeepSeek V3 (2024): 68.6 KiB/token. Uses multi-head latent attention that compresses KV pairs into a lower-dimensional latent space and decompresses at inference. This is a 671B parameter model with 37B active via MoE. DeepSeek V2's ablation studies, which V3's architecture builds on, showed the compressed representation matched or slightly beat standard MHA on several benchmarks.
- Gemma 3 (2025): Uses GQA plus a sliding window with 5:1 local-to-global attention layers, where local layers attend to only 1,024 tokens. Shows almost no perplexity loss from the aggressive filtering.
- Mamba/SSMs (2023): No KV cache at all. Uses fixed-size hidden state updated per token. The model decides what to compress in real time rather than storing everything and attending later.
Architectural Gaps and Practical Implications
The analysis highlights a gap between working memory and permanent knowledge in current architectures. KV cache persists for seconds to minutes (reported cache lifetimes are 5-10 minutes, varying by provider and load), then disappears. Between the temporary cache and permanent weights, there's no native medium-term memory or architectural slot for information like "I talked to this user last Tuesday."
Current solutions like RAG, file systems, vector DBs, and system prompts carrying curated context are described as "bridges over an architectural void" - lookup systems bolted onto models with no internal medium-term storage.
The compaction problem exemplifies this limitation. When context grows too large, models summarize their own history, clear the cache, and continue from the summary. This can lead to loss of precision (a publishing policy with six rules becomes "something about editorial guidelines") and models confidently operating on degraded context without knowing what was lost.
Cursor's learned compaction approach trains models to self-summarize well via RL rather than just prompting compression, but evidence is limited to one coding benchmark. Code provides clean reward signals (tests pass or fail), unlike scenarios like compacting editorial notes, strategic planning, or conversations where critical details won't be needed for many messages.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Project Health Check: Bus Factor and Commit Activity Across Claw/Assistant Repos
A Reddit user scraped commit data from major claw/assistant projects and found many with a bus factor of 1—meaning a single author accounts for over 50% of commits. Some projects show drastic drops in April activity.

Developer switches to Minimax 2.7 after Claude ban and MiMo credit issues
A developer tested multiple AI models for OpenClaw after Claude was banned, finding GLM 5.1 and 5 Turbo ineffective for agentic tasks, MiMo V2 Pro's credit system inefficient, and settling on Minimax 2.7 for its generous quota and ability to handle automation tasks.

Anthropic Launches Remote Control for Claude Code
Anthropic has launched remote control functionality for Claude Code, allowing users to continue coding sessions from mobile devices. The feature is documented at code.claude.com/docs/en/remote-control.

Context Quality Degradation in AI Agents: Hallucination Rates Increase with Token Count
Testing shows hallucination rates increase from ~3% at 10K tokens to ~28% at 200K tokens, with recall accuracy dropping below 90% for early-session information once context exceeds 50K tokens.