Analysis of 413K AI Agent Runs Reveals What Makes Them Succeed

A new analysis of 413,278 AI software engineering agent runs from the CoderForge-Preview dataset reveals what separates successful from failing runs. The study examined 17 billion tokens of behavioral data, comparing passing versus failing runs on identical problems.
Key Findings from the Data
The analysis shows that common human software engineering practices can actually reduce AI agent performance. Here are the specific patterns that emerged:
- Stop telling agents to "look around first": Forcing agents to grep or view files before editing reduces effectiveness. Unlike humans with limited working memory, agents already have the codebase in their context window. Early turns spent searching and exploring indicate the agent is flailing rather than learning.
- Test-driven approaches are mandatory: The single biggest predictor of successful runs is the fraction of early bash commands dedicated exclusively to running tests. Agents should not edit blindly—system prompts should enforce running the test suite immediately.
- Keep agents on a tight leash: If an agent tries to edit 3 or more files in the first 30% of its run, success rates drop significantly. Scattering edits across multiple files indicates confusion. Force agents to fix one thing at a time.
- Perseverance is an illusion: If an agent runs the exact same bash command twice early in the run, it's stuck in a loop rather than "thinking hard" or "trying again." Break the loop or restart the run.
Practical Implementation Changes
The analysis recommends specific changes to agent scaffolding:
- Stop using prompts like:
"Explore the codebase, read the relevant files, and figure out the bug." - Instead, use:
"Run the test suite immediately to verify the baseline. Make targeted changes to a maximum of 1 or 2 files. Rerun tests."
The key insight is to stop projecting human limitations onto LLMs. Let them use their massive context windows and force them to prove their work with tests.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Anthropic Launches Claude Partner Network with $100M Investment
Anthropic is launching the Claude Partner Network with an initial $100 million investment for 2026, providing training, technical support, and joint market development for organizations helping enterprises adopt Claude. Partners get access to technical certification, a Partner Portal with training materials, and a Code Modernization starter kit for legacy code migration.

Developer's Obsidian AI Agent Project Goes Viral Overnight
A PhD researcher built a crew of AI agents to manage their Obsidian vault, shared it on GitHub, and woke up to 700+ stars in less than 13 hours. The sudden attention led to panic, making the repo private temporarily before reopening with improvements.

Anthropic API Billing Bug: Sonnet Model Charged at Opus Rates
A user discovered that the Anthropic API is incorrectly billing the claude-sonnet-4-6 model at Opus pricing rates, despite returning the correct model string. The bug was identified through analysis of raw event data showing a cost discrepancy.

Google, Microsoft, and xAI Agree to Share Early AI Models with US Government
Google, Microsoft, and xAI (Elon Musk's AI firm) have agreed to voluntarily provide early access to their AI models to the US government for safety testing, as part of an initiative reported by the Wall Street Journal.