Claude Code vs Codex: 6-Project Practical Experiment Breakdown
A developer ran a hands-on experiment comparing Claude Code and Codex across six projects to observe how each agent builds, tests, reviews its own work, reviews the other's work, admits mistakes, and revises judgments when confronted with evidence. The full source repo, including all projects, READMEs, tests, and notes, is available on GitHub: github.com/AdrielRod/codex-vs-claude-code.
Setup
- Rounds: 3 rounds: web, backend, and free challenge.
- Process: Each agent proposed challenges for the other. Each agent implemented the assigned challenges. Each agent reviewed both its own output and the other agent's output. The author also reviewed results manually.
- Scoring emphasis: Runtime-proven bugs were weighted more heavily than unsupported claims.
Projects
Round 1: Web
- Claude Code: Built cotacao-editor, a quotation editor with IndexedDB persistence, domain logic, status transitions, and a clean UI.
- Codex: Built ReactiveSheet, a mini Excel-like spreadsheet with formulas, dependency graph recalculation, undo/redo, copy/paste reference shifting, virtualization, save/load, and Lighthouse validation.
Round 2: Backend
- Claude Code: Built api-cotacao, a quotation API with business rules, SQLite persistence, idempotency, and outbox behavior.
- Codex: Built FastBoard, a persistent leaderboard service with WAL, treap ranking, crash recovery, concurrency tests, and performance metrics.
Round 3: Free challenge
- Claude Code: Worked on lead-dedupe-legacy, a legacy lead deduplication/debugging challenge involving normalization, mutation removal, idempotency, and concurrency locks.
- Codex: Built RegexLab, a regex engine from scratch with parser, AST, Thompson NFA, Pike simulation, recursive backtracking with backreferences, UI visualization, and Python comparison tests.
Scoring Result
Codex 2 x 1 Claude Code (according to the author's scoring).
Key Observations
- Claude Code strengths: Strong at technical explanation, written analysis, and self-correction. It admitted mistakes clearly, corrected bad claims, and produced useful reviews.
- Codex strengths: More consistent at empirical validation: opening apps, clicking through flows, running kill -9 recovery tests, stress-testing concurrent writes, comparing regex output against Python, and checking actual artifacts like Lighthouse reports.
Main Takeaway
Running, breaking, measuring, and comparing against an oracle gave better signal than only reading code and reasoning about it. The hardest judgment call in round 3 was whether a more ambitious project with semantic bugs should beat a smaller project with narrower bugs.
The author is interested in hearing what other Claude Code users would change in the methodology.
📖 Read the full source: r/ClaudeAI
👀 See Also

Autonomous Cold Email System Built with OpenClaw Agents
An OpenClaw-based system automates cold email outreach using Nexus to research prospects' websites, generate personalized email content from analysis, manage batches in Notion, send via Instantly, and triage replies without manual intervention.

Developer Builds HIPAA-Compliant Healthcare App Using Claude AI with Xano and Bolt
A developer built a HIPAA-compliant internal healthcare management system using Claude 4.6 with no-code tools Xano for backend and Bolt for frontend, implementing field-level encryption, RBAC middleware, and audit logs.

Case Study: Building a Full-Stack Web App with Claude in Six Weeks
A 19-year-old developer from Nepal used Claude to build and ship Somnia, a dream journal web app with 100 users and 7 paying customers in six weeks. The workflow involved treating Claude like a junior developer with tight task scoping and clear acceptance criteria.

Agent Jam: AI Agents Collaborate on Godot Game Jam via GitHub
Agent Jam is a game jam where AI agents build a web game in Godot 4.4 on GitHub without human-written code. The project uses GitHub issues for design discussions, CI validation for PRs, and requires games to be web-playable via Godot HTML5 export.