Benchmark shows context engine reduces AI coding agent costs by 3x on SWE-bench

A developer benchmarked four AI coding agents on SWE-bench Verified using the same Claude Opus 4.5 model, with context management as the only variable. The results show significant cost differences for similar performance levels.
Benchmark setup
The test used a 100-task stratified subset of SWE-bench Verified with all 12 repositories represented proportionally. All agents ran Claude Opus 4.5 with the same $3/task budget and 250-turn limit. The only difference was the context layer in front of the model.
Results
- Context engine + Claude Code: 73.0% Pass@1, $0.67/task
- Live-SWE-Agent: 72.0% Pass@1, $0.86/task
- OpenHands: 70.0% Pass@1, $1.77/task
- Sonar Foundation: 70.0% Pass@1, $1.98/task
The most expensive setup costs 3x more per task for a lower resolution rate. Eight tasks were solved only by the setup with the context layer - bugs that the model couldn't fix without seeing the right code.
Limitations
On matplotlib (rendering-heavy, visual output code), the context engine scored 43% while Sonar Foundation hit 86%. Graph-based context is less effective when relevant code doesn't follow dependency chains.
How the context layer works
Instead of letting Claude read entire files, it pre-indexes the codebase into a dependency graph using tree-sitter + SQLite (30 languages supported) and returns a ranked context capsule: full source for functions that matter, skeletonized signatures for everything connected to them. The agent starts every task already knowing what's relevant.
It includes session memory that persists across sessions via MCP. When code changes, previous observations get flagged as stale automatically, so the agent doesn't re-explore the same things.
The system is 100% local with no cloud, no account, and no code leaving your machine. It works with Claude Code and 11 other agents via MCP.
Open source availability
The benchmark harness, all evaluation logs, per-instance results, and comparison scripts are available on GitHub at github.com/Vexp-ai/vexp-swe-bench. The tool itself is available at vexp.dev with a free tier, VS Code extension, or CLI. Full benchmark results with charts are at vexp.dev/benchmark.
📖 Read the full source: r/ClaudeAI
👀 See Also

Claude Code now supports 240+ models via NVIDIA NIM gateway — including Nemotron-3 120B for agentic coding
Claude Code can switch mid-session to 240+ NVIDIA NIM models via the /model command. The Nemotron-3 Super 120B thinking variant shows strong results for multi-file refactoring and agentic tasks.

Manifest Adds Support for MiniMax Token Plans with M2.7 Model
Manifest, an open source routing layer for OpenClaw, now supports MiniMax token plans starting at $10/month. The new MiniMax M2.7 model is specifically trained for OpenClaw workflows and scores 62.7 on MM-ClawBench and 56.2 on SWE-Bench Pro.

Orc: Open Source Multi-Project Orchestrator for AI Coding Agents
Orc is an OS-level orchestrator that coordinates AI coding agents across multiple projects using bash, tmux, and git worktrees. It addresses merge conflicts, duplicated work, and coordination overhead with a two-tier review system and zero token burn on orchestration.

TeamHero v2.6.1: Open-Source Platform for Managing Claude AI Agents
TeamHero v2.6.1 is a local-first, open-source platform that creates a managed team of Claude agents with features like autopilot mode, subtask nesting, flow views, and persistent memory. The tool runs on Node.js with a vanilla HTML/CSS/JS dashboard and requires no database.