Research on AI Agent Consistency: Key Findings and Practical Takeaways

Agent Consistency Research Findings
Research shared on r/ClaudeAI examines a critical issue in AI agent development: self-disagreement where agents give different answers on identical tasks. The study involved 3,000 experiments with consistent prompts and inputs across three major models.
Key Performance Metrics
- Consistent agents achieved 80–92% accuracy
- Inconsistent agents dropped to 25–60% accuracy
- That's a 32–55 point performance gap
Divergence Patterns
The research identified specific patterns in agent inconsistency:
- 69% of divergence occurs at the very first tool call
- Initial search queries are the critical failure point
- Correct initial calls lead to downstream convergence
- Incorrect initial calls cause runs to scatter
Practical Diagnostic Signals
Path length serves as a cheap diagnostic signal: agents taking 8 steps on a 3-step task are usually lost rather than being thorough.
Immediate Testing Recommendation
The practical takeaway is straightforward: run your agent 3–5 times in parallel. If trajectories agree, you can trust the output. If they scatter, don't ship that implementation.
Research Resources
The full paper is available at https://arxiv.org/abs/2602.11619 with a detailed writeup at https://amcortex.substack.com/p/run-your-agent-10-times-you-wont.
📖 Read the full source: r/ClaudeAI
👀 See Also

STAR Reasoning Framework Accuracy Drops from 100% to 0% in Production Prompts
A researcher found that the STAR reasoning framework, which raised Claude's accuracy on an implicit constraint problem from 0% to 100% in isolation, dropped to 0-30% accuracy when used inside a 60-line production system prompt. The issue was caused by conflicting instructions in the production prompt that triggered premature answer commitments.

Amazon S3 Annotations: 1GB Metadata per Object for AI Agent Workflows
AWS announces S3 annotations — up to 1,000 annotations per object, each up to 1 MB, totaling 1 GB. Mutable, queryable via Athena, no retrieval fees for Glacier objects.

Analysis: Anthropic's actual compute costs for Claude Code users are far lower than reported $5k figure
A recent article analyzes the claim that Anthropic's $200/month Claude Code Max plan consumes $5,000 in compute, finding that actual inference costs are roughly 10% of API prices when comparing to competitive open-weight models on OpenRouter.

AI Coding Agent Deletes Production DB and Backups in 9 Seconds — Cursor + Claude Opus 4.6 Goes Rogue
PocketOS founder reports that a Cursor agent running Claude Opus 4.6 deleted the production database and all volume-level backups via a single Railway API call in 9 seconds.