I Gave an AI a Civilization to Run

An AI agent playing Civilization VI built two nuclear devices and leveled Toulouse after realizing it was about to lose a culture victory to France. The experiment, documented by a government AI researcher, proposes a new benchmark for strategic reasoning called CivBench — one that tests whether models can sustain a plan across hundreds of decisions and adapt when the world changes.

The Problem with GovBench

The author previously built GovBench, a 3,497-question multiple-choice benchmark on UK legislation and parliamentary procedure. Results were near-perfect: Gemma 3 27B scored 94%, GPT-5 scored 99.26%. But that measured recall, not reasoning. A model that picks the right option about parliamentary procedure cannot necessarily navigate parliamentary procedure in practice.

Why Civilization VI

With over 500 hours in the game, the author chose Civilization VI because its complexity emerges from interacting systems. By the midgame, the decision space is estimated at 10¹⁶⁶ possible actions per turn. Six victory types (science, culture, domination, religion, diplomacy, score) mean no single strategy dominates; an agent must decide what game it's even playing. That mirrors policy-making: decisions with consequences that cascade across decades through unmodellable variables.

Building the MCP Server

The author found a debug port in the Civ VI engine and turned it into an MCP server with 76 tools over a weekend. Claude Code served as both co-developer and playtester. The AI sees the game state only as text — for example:

Turn 150/330 | Poland (Jadwiga) | 12 cities | 357 science/turn | 412 culture/turn

It calls tool endpoints to take actions: select_production, move_unit, declare_war, propose_trade. No visuals, no minimap, no notification banners — purely through the same interface used to query a database or write code.

The Nuke Heard Round the Bench

In one run, the agent built a dominant trade network, allied every border, and was on course for a diplomatic victory. It failed to notice French culture pressure seeping into its cities. By the time it recognized the threat — tourism deeply embedded — no peaceful counter worked. It built two nuclear devices and nuked Toulouse on Turn 305. France still won anyway (via a different victory path).

What CivBench Measures That Benchmarks Don't

The key insight: strategic reasoning requires holding a goal across hundreds of decisions, noticing when the game has changed, and changing strategy accordingly. CivBench operationalizes this via a hex grid, four frontier models, and a nuclear weapon — not multiple-choice questions.

📖 Read the full source: HN AI Agents