Claude Code for Game AI: Autoevolve Self-Play Evolution

Competition Results and Approach

A developer used Claude Code as their entire development team for the Game AI Cup, a competitive programming contest where participants write bots for a 2D physics-based game. The Claude-generated bot placed 6th out of 83 participants across three rounds.

The approach was inspired by Karpathy's autoresearch concept, where an LLM agent iterates on code overnight. The developer built a small framework called autoevolve that adapts this for self-play domains — instead of optimizing a single metric, versions compete against each other head-to-head.

The Evolution Loop

The workflow followed this loop:

Claude Code reads the current bot
Analyzes why it lost specific matches
Proposes a targeted change
The new version gets benchmarked against previous versions
Keep or discard the version
Repeat the process

The developer ran approximately 130 iterations over several weeks across three competition rounds.

Key Findings from the Experiment

Structural changes outperformed parameter tweaks: Every breakthrough involved adding new capabilities like model predictive control, a goalkeeper role, or energy-aware planning. Dozens of threshold and weight adjustments were flat or negative. Progress was faster when guiding Claude toward "add a new behavior" instead of "tune this number."

Emergent behaviors were readable in code: After Claude corrected an energy cost function, the optimizer started using wall bounces to reverse direction — bouncing off walls gives a free direction change without spending energy. This behavior was never explicitly programmed but is fully readable in the code, unlike neural network approaches that would create a black box.

Bug fixes compound in isolation: Mixing bug fixes with strategy changes introduced noise. Two correctness fixes alone in one version beat all top contenders, but the same fixes bundled with a strategy change in another version were flat.

The changelog was essential: Each version included Claude's proposal, expected outcome, actual result, and lessons learned. This allowed the developer to tell Claude "this approach failed three times, stop trying it" and avoid repeating failed experiments.

Broader Applications

The developer discovered the awesome-autoresearch list showing similar "LLM iterates on code overnight" patterns applied elsewhere: Shopify's CEO achieved 53% faster template rendering with 93 automated commits, someone scaled CUDA kernels from 18 to 187 TFLOPS, and the Vesuvius Challenge used it for ancient scroll deciphering.