CivBench: Testing AI Strategic Reasoning with Civilization VI — Agent Nuked Toulouse After Losing Culture War

An AI agent playing Civilization VI built two nuclear devices and leveled Toulouse after realizing it was about to lose a culture victory to France. The experiment, documented by a government AI researcher, proposes a new benchmark for strategic reasoning called CivBench — one that tests whether models can sustain a plan across hundreds of decisions and adapt when the world changes.
The Problem with GovBench
The author previously built GovBench, a 3,497-question multiple-choice benchmark on UK legislation and parliamentary procedure. Results were near-perfect: Gemma 3 27B scored 94%, GPT-5 scored 99.26%. But that measured recall, not reasoning. A model that picks the right option about parliamentary procedure cannot necessarily navigate parliamentary procedure in practice.
Why Civilization VI
With over 500 hours in the game, the author chose Civilization VI because its complexity emerges from interacting systems. By the midgame, the decision space is estimated at 10166 possible actions per turn. Six victory types (science, culture, domination, religion, diplomacy, score) mean no single strategy dominates; an agent must decide what game it's even playing. That mirrors policy-making: decisions with consequences that cascade across decades through unmodellable variables.
Building the MCP Server
The author found a debug port in the Civ VI engine and turned it into an MCP server with 76 tools over a weekend. Claude Code served as both co-developer and playtester. The AI sees the game state only as text — for example:
Turn 150/330 | Poland (Jadwiga) | 12 cities | 357 science/turn | 412 culture/turn
It calls tool endpoints to take actions: select_production, move_unit, declare_war, propose_trade. No visuals, no minimap, no notification banners — purely through the same interface used to query a database or write code.
The Nuke Heard Round the Bench
In one run, the agent built a dominant trade network, allied every border, and was on course for a diplomatic victory. It failed to notice French culture pressure seeping into its cities. By the time it recognized the threat — tourism deeply embedded — no peaceful counter worked. It built two nuclear devices and nuked Toulouse on Turn 305. France still won anyway (via a different victory path).
What CivBench Measures That Benchmarks Don't
The key insight: strategic reasoning requires holding a goal across hundreds of decisions, noticing when the game has changed, and changing strategy accordingly. CivBench operationalizes this via a hex grid, four frontier models, and a nuclear weapon — not multiple-choice questions.
📖 Read the full source: HN AI Agents
👀 See Also

The West Forgot How to Build: Defense Supply Chain Collapse and Lessons for Software Engineering
Raytheon had to bring back retired engineers to restart Stinger missile production from 40-year-old paper schematics. The same pattern is now playing out in software, where decades of optimizing for cost have atrophied the talent pipeline and institutional knowledge.

What's missing in the 'agentic' story: a well-defined user agent role
Mark Nottingham argues that current AI agents lack a clear user agent role, creating a trust gap between what users expect and what agents actually do.

MTP Multi-Token Prediction: 2x Faster Token Generation on AMD Strix Halo & Radeon 9700 AI Pro
MTP accelerates LLM inference up to 2x, especially for coding agents. Video covers MTP mechanics and performance on Qwen 3.6 with AMD Strix Halo and Dual Radeon 9700.

Claude Code v2.1.153 Ships Skip LFS, MCP Fixes, and Agent Autocomplete
Claude Code v2.1.153 adds skipLfs option for Git LFS avoidance, fixes MCP server reconnect-loops, and improves agent dispatch with native slash command autocomplete.