Context Quality Degradation: Hallucinations Rise from 3% to 28% with Token Count

Context Window Performance Testing Results

A developer tested context quality degradation across different token counts in AI agents, revealing significant performance issues as context size increases.

Key Findings from Testing

The testing measured several critical metrics:

Hallucination rates by context size:
- 10K tokens: ~3%
- 50K tokens: ~11%
- 200K tokens: ~28%
- 1M tokens: unclear, but the trend shows increasing degradation
Recall accuracy: No tested model (including GPT-4, Claude, or local models) achieved 90% recall on information from the first 10 turns once context exceeded 50K tokens.
Token efficiency: At 200K tokens, the percentage of context actually relevant to the current query drops below 12% in most agent tasks, meaning approximately 188K tokens add noise that the model must reason around.

Problem Analysis

The issue appears to be attention starvation rather than forgetting. Early context competes with recent context, with recent context usually winning due to higher positional relevance. This causes constraints set early in sessions (like "use PostgreSQL, no ORMs") to become progressively diluted as more context accumulates.

By turn 89 with 200K tokens, the model's attention is so spread across the context that early constraints effectively disappear.

Current Solutions and Limitations

Many developers add vector databases to retrieve "relevant" memories, which helps somewhat. However, this approach retrieves semantically similar content rather than what the agent needs for correct reasoning. For example, "use PostgreSQL" is not semantically similar to "write me a login endpoint" even though it needs to be in context for proper execution.

The developer is seeking feedback on whether these findings match production experiences and what approaches have actually worked for others.

📖 Read the full source: r/LocalLLaMA

Context Quality Degradation in AI Agents: Hallucination Rates Increase with Token Count

Context Window Performance Testing Results

Key Findings from Testing

Problem Analysis

Current Solutions and Limitations

👀 See Also

Claude Code v2.1.136: Hard Deny for Auto Mode, MCP OAuth Fixes, and 40+ Bug Fixes

Claude Code v2.1.147: Pinned Sessions, /code-review, and Dozens of Fixes

OpenClaw users report high API costs from vague prompts, developer advises structured workflows

OpenClaw: Dive Into the First AMA on r/clawdbot