AI Code Review Benchmark: Claude, Gemini, Codex, Qwen, and MiniMax Compared

AI Code Review Performance Comparison
A recent experiment benchmarked five flagship AI models for code review using 15 pull requests from Milvus, an open-source vector database. Each PR contained known bugs that surfaced in production after merging, providing a realistic test set.
Models and Setup
The models tested were:
- Claude Opus 4.6
- Gemini 3 Pro
- GPT-5.2-Codex
- Qwen-3.5-Plus
- MiniMax-M2.5
The benchmark used Magpie, an open-source tool that prepares context by pulling in surrounding code, call chains, and related modules before feeding it to the model.
Bug Difficulty Levels
Bugs were categorized by difficulty:
- L1: Visible from diff alone (all models caught these, so excluded from scoring)
- L2 (10 cases): Requires understanding surrounding code (interface changes, concurrency races)
- L3 (5 cases): Requires system-level understanding (cross-module inconsistencies, upgrade compatibility)
Results by Model
Two evaluation modes were used:
- Raw: Model sees only PR diff and content
- R1: Magpie provides surrounding context
Overall detection rates (L2 + L3 only):
- Claude: 53% raw, 47% with context
- Gemini: 13% raw, 33% with context
- Codex: 33% raw, 27% with context
- MiniMax: 27% raw, 33% with context
- Qwen: 33% raw, 40% with context
Key Findings
Claude dominated raw review with 53% detection and perfect 5/5 on L3 bugs. It excels at organizing its own context, so additional context actually reduced its performance.
Gemini performed poorly in raw mode (13%) but improved significantly with context (33%), suggesting it needs context provided upfront.
Qwen was the strongest context-assisted performer at 40%, with the highest L2 bug detection (5/10).
Adversarial Debate Results
When models debated each other for five rounds, bug detection jumped from 53% (best single model) to 80%. The hardest L3 bugs reached 100% detection in debate mode.
The experiment reveals that different models have complementary strengths: Claude's thoroughness, Gemini's design-focused analysis when given context, Codex's concrete actionable feedback, and Qwen's strong context-assisted performance.
📖 Read the full source: HN AI Agents
👀 See Also

OpenObscure: Open-Source On-Device Privacy Firewall for AI Agents
OpenObscure is an open-source, on-device privacy firewall that sits between AI agents and LLM providers, using FF1 Format-Preserving Encryption to encrypt PII values before requests leave your device. It includes PII detection with 99.7% recall, cognitive firewall scanning, and runs on macOS/Linux/Windows with iOS/Android bindings.

Jentic Mini: Self-Hosted API and Action Execution Layer for OpenClaw
Jentic Mini is a self-hosted API and action execution layer that sits between AI agents and external APIs, storing credentials in an encrypted vault and providing scoped toolkits with individually revocable keys. It automatically imports 10,000+ OpenAPI specs and Arazzo workflow sources when credentials are added.

Fewshell: A Self-Hosted SSH Copilot That Refuses to Run Commands Without Human Approval
Fewshell is a mobile+desktop SSH copilot with mandatory human approval for every command – no setting to enable auto-approval. Built by an ex-Amazon AI SDE working on AI safety research.

Kubeez MCP Server Connects Claude to 70+ AI Media Models
Kubeez has released an MCP server that connects Claude to over 70 AI models for image, video, music, and voice generation. The server supports OAuth authentication and provides async generation with Claude polling for status and returning CDN URLs.