ETH Zurich Study: Excessive Context Reduces AI Coding Agent Performance

A recent study from ETH Zurich provides concrete evidence that more context doesn't necessarily mean better performance for AI coding agents. The research tested four coding agents across 138 real GitHub tasks, with clear quantitative results.
Key Findings
The study revealed that LLM-generated context files actually reduced task success rates by 2-3% while inference costs increased by 20%. Even human-written context files only improved success by approximately 4%, while still significantly increasing costs.
The Core Problem
Researchers discovered that agents treated every instruction in context files as something that must be executed. In one experiment, when they stripped repositories down to only the generated context file, performance improved again. This indicates that agents struggle to distinguish between essential instructions and irrelevant historical information.
Practical Recommendations
The study recommends only including information that the agent genuinely cannot discover on its own, keeping context minimal. This is particularly relevant for communication data like email threads, which might look like context but are often interpreted as instructions when they're actually historical noise.
Context API Solution
To address this issue, researchers developed a context API (iGPT) that focuses on email processing. The API:
- Reconstructs email threads into conversation graphs before context reaches the model
- Deduplicates quoted text
- Detects who said what and when
- Returns structured JSON instead of raw text
This approach ensures agents receive filtered context rather than entire conversation histories, improving their ability to focus on relevant information.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Top AI Models Show Performance Gap in Non-English Languages
A recent analysis shows leading AI models perform worse in languages other than English, with the article receiving 16 points and 3 comments on Hacker News.

Reddit user reports 18.8 tok/s CPU inference with Qwen 3 30B Q4 on Zen 4
A user on r/LocalLLaMA tested Qwen 3 30B Q4 on CPU and achieved 18.8 tokens per second with a Zen 4 processor and DDR5 memory, significantly exceeding expectations of 3-5 tok/s.

AI Carb Counting Fails Reproducibility: 27K Queries Show 429g Spread on One Photo
A study of 26,904 AI queries across 4 models found that Gemini 2.5 Pro varied its carb estimates for a single paella photo from 55g to 484g — a potential 42.9U insulin swing. Claude showed only 2.4% median variation.

Claude Code source code reportedly leaked, revealing agent architecture details
The source code for Claude Code, Anthropic's AI coding agent, appears to have been leaked, containing the full repository with system prompts, agent loop implementation, and tool calling infrastructure.