Claude Code vs Codex: 6-Project Comparison Results

A developer ran a hands-on experiment comparing Claude Code and Codex across six projects to observe how each agent builds, tests, reviews its own work, reviews the other's work, admits mistakes, and revises judgments when confronted with evidence. The full source repo, including all projects, READMEs, tests, and notes, is available on GitHub: github.com/AdrielRod/codex-vs-claude-code.

Setup

Rounds: 3 rounds: web, backend, and free challenge.
Process: Each agent proposed challenges for the other. Each agent implemented the assigned challenges. Each agent reviewed both its own output and the other agent's output. The author also reviewed results manually.
Scoring emphasis: Runtime-proven bugs were weighted more heavily than unsupported claims.

Projects

Round 1: Web

Claude Code: Built cotacao-editor, a quotation editor with IndexedDB persistence, domain logic, status transitions, and a clean UI.
Codex: Built ReactiveSheet, a mini Excel-like spreadsheet with formulas, dependency graph recalculation, undo/redo, copy/paste reference shifting, virtualization, save/load, and Lighthouse validation.

Round 2: Backend

Claude Code: Built api-cotacao, a quotation API with business rules, SQLite persistence, idempotency, and outbox behavior.
Codex: Built FastBoard, a persistent leaderboard service with WAL, treap ranking, crash recovery, concurrency tests, and performance metrics.

Round 3: Free challenge

Claude Code: Worked on lead-dedupe-legacy, a legacy lead deduplication/debugging challenge involving normalization, mutation removal, idempotency, and concurrency locks.
Codex: Built RegexLab, a regex engine from scratch with parser, AST, Thompson NFA, Pike simulation, recursive backtracking with backreferences, UI visualization, and Python comparison tests.

Scoring Result

Codex 2 x 1 Claude Code (according to the author's scoring).

Key Observations

Claude Code strengths: Strong at technical explanation, written analysis, and self-correction. It admitted mistakes clearly, corrected bad claims, and produced useful reviews.
Codex strengths: More consistent at empirical validation: opening apps, clicking through flows, running kill -9 recovery tests, stress-testing concurrent writes, comparing regex output against Python, and checking actual artifacts like Lighthouse reports.

Main Takeaway

Running, breaking, measuring, and comparing against an oracle gave better signal than only reading code and reasoning about it. The hardest judgment call in round 3 was whether a more ambitious project with semantic bugs should beat a smaller project with narrower bugs.

The author is interested in hearing what other Claude Code users would change in the methodology.

📖 Read the full source: r/ClaudeAI