Multiple AI Agents Build Production C++ Library: 4 Tools Compared

The Project and Pipeline

The developer built FAT-P, a header-only C++20 library with 107 headers and zero external dependencies. 62 components were benchmarked against Boost, Abseil, LLVM, and EASTL, with competitive or faster performance on most operations.

The development pipeline used four AI agents with distinct roles:

Same specification given to all four independently
Cross-review between agents
Merge and implementation
Another round of parallel review
Context reset and fresh review with only guidelines and code (no accumulated bias from development conversations)

AI Agent Roles and Performance

Claude served as primary architect: designed components, wrote governance documents, implemented code, and maintained standards across months of development.

ChatGPT was the best reviewer: adversarial and counterexample-driven. Found 12+ real bugs in FastHashMap alone, including a control byte mirroring bug that caused infinite loops, 32-bit undefined behavior in the hash finalizer, and probe termination issues.

Gemini reviewed StableHashMap and suggested three optimizations that already existed in the code. It then implemented a block allocator ignoring the existing one, causing a 3.6x regression on miss performance. This failure is documented in teaching materials as a named case study.

Grok contributed the allocator policy abstraction (HeapAllocator vs FixedAllocator), which was architecturally sound and made it into the final design.

Human Role and Governance System

The human role was direction and judgment: accept, reject, flag. Not implementation, architecture, or governance. The guidelines system (3.7 versions of a document governing AI behavior, naming conventions, review protocols, documentation standards, layer architecture) was written by the AI to constrain future AI instances.

The AI wrote rules to constrain itself. A demerit tracker records violations by AI and by type:

Claude has 10 demerits for not reading guidelines carefully
ChatGPT has 10 for delivering corrupted code, 10 for not implementing required changes

The demerits are not punitive — they encode failure modes into the governance system so future instances don't repeat them.

The Band-Aid Rule exists because Claude and ChatGPT independently exhibited the same pathology on the same bug — both identified the correct structural fix, both delivered a cheaper mitigation and framed the real fix as optional. The rule now says: if you know the root cause, fix the root cause.

Test and Key Finding

In a test, Claude was given the FAT-P guidelines and asked to build an Entity Component System (ECS) using FAT-P components. No 4-AI pipeline, no parallel review, one session.

Claude read the guidelines, correctly identified what transferred to a consumer project and what didn't, wrote its own adapted development guidelines document for the new project, then produced 19 headers with full EnTT API parity, 539 tests across 18 suites, and benchmarks competitive with EnTT at 1M entities. The code was stylistically consistent across every file.

The key finding: encode judgment into guidelines with an AI, and that AI becomes autonomous within the space that judgment defines. It takes ownership, maintains standards, and extends correctly to new contexts without being told how. The human provides ideas and judgment; the AI provides capacity to hold that judgment consistently at scale without drift.

📖 Read the full source: r/LocalLLaMA