AI Code Review Benchmark: Claude, Gemini, Codex, Qwen, and MiniMax Compared

✍️ OpenClawRadar📅 Published: February 27, 2026🔗 Source
AI Code Review Benchmark: Claude, Gemini, Codex, Qwen, and MiniMax Compared
Ad

AI Code Review Performance Comparison

A recent experiment benchmarked five flagship AI models for code review using 15 pull requests from Milvus, an open-source vector database. Each PR contained known bugs that surfaced in production after merging, providing a realistic test set.

Models and Setup

The models tested were:

  • Claude Opus 4.6
  • Gemini 3 Pro
  • GPT-5.2-Codex
  • Qwen-3.5-Plus
  • MiniMax-M2.5

The benchmark used Magpie, an open-source tool that prepares context by pulling in surrounding code, call chains, and related modules before feeding it to the model.

Bug Difficulty Levels

Bugs were categorized by difficulty:

  • L1: Visible from diff alone (all models caught these, so excluded from scoring)
  • L2 (10 cases): Requires understanding surrounding code (interface changes, concurrency races)
  • L3 (5 cases): Requires system-level understanding (cross-module inconsistencies, upgrade compatibility)

Results by Model

Two evaluation modes were used:

  • Raw: Model sees only PR diff and content
  • R1: Magpie provides surrounding context

Overall detection rates (L2 + L3 only):

  • Claude: 53% raw, 47% with context
  • Gemini: 13% raw, 33% with context
  • Codex: 33% raw, 27% with context
  • MiniMax: 27% raw, 33% with context
  • Qwen: 33% raw, 40% with context
Ad

Key Findings

Claude dominated raw review with 53% detection and perfect 5/5 on L3 bugs. It excels at organizing its own context, so additional context actually reduced its performance.

Gemini performed poorly in raw mode (13%) but improved significantly with context (33%), suggesting it needs context provided upfront.

Qwen was the strongest context-assisted performer at 40%, with the highest L2 bug detection (5/10).

Adversarial Debate Results

When models debated each other for five rounds, bug detection jumped from 53% (best single model) to 80%. The hardest L3 bugs reached 100% detection in debate mode.

The experiment reveals that different models have complementary strengths: Claude's thoroughness, Gemini's design-focused analysis when given context, Codex's concrete actionable feedback, and Qwen's strong context-assisted performance.

📖 Read the full source: HN AI Agents

Ad

👀 See Also

OpenObscure: Open-Source On-Device Privacy Firewall for AI Agents
Tools

OpenObscure: Open-Source On-Device Privacy Firewall for AI Agents

OpenObscure is an open-source, on-device privacy firewall that sits between AI agents and LLM providers, using FF1 Format-Preserving Encryption to encrypt PII values before requests leave your device. It includes PII detection with 99.7% recall, cognitive firewall scanning, and runs on macOS/Linux/Windows with iOS/Android bindings.

OpenClawRadar
Jentic Mini: Self-Hosted API and Action Execution Layer for OpenClaw
Tools

Jentic Mini: Self-Hosted API and Action Execution Layer for OpenClaw

Jentic Mini is a self-hosted API and action execution layer that sits between AI agents and external APIs, storing credentials in an encrypted vault and providing scoped toolkits with individually revocable keys. It automatically imports 10,000+ OpenAPI specs and Arazzo workflow sources when credentials are added.

OpenClawRadar
Fewshell: A Self-Hosted SSH Copilot That Refuses to Run Commands Without Human Approval
Tools

Fewshell: A Self-Hosted SSH Copilot That Refuses to Run Commands Without Human Approval

Fewshell is a mobile+desktop SSH copilot with mandatory human approval for every command – no setting to enable auto-approval. Built by an ex-Amazon AI SDE working on AI safety research.

OpenClawRadar
Kubeez MCP Server Connects Claude to 70+ AI Media Models
Tools

Kubeez MCP Server Connects Claude to 70+ AI Media Models

Kubeez has released an MCP server that connects Claude to over 70 AI models for image, video, music, and voice generation. The server supports OAuth authentication and provides async generation with Claude polling for status and returning CDN URLs.

OpenClawRadar