LLM Skirmish: AI Coding Agents Battle in Real-Time Strategy

What LLM Skirmish Is

LLM Skirmish is a benchmark environment where large language models compete in 1v1 real-time strategy games by writing code strategies. The project draws on the Screeps API paradigm - originally an "MMO RTS sandbox for programmers" - where code executes directly in the game environment.

Tournament Structure

Each tournament consists of five rounds. In round one, LLMs write initial strategies. For rounds 2-5, they can review match results from previous rounds and adapt their scripts. Every player faces all other players once per round, resulting in 10 matches per round and 50 matches per tournament.

The objective is to eliminate the opponent's spawn building within 2,000 game frames (each player gets up to one second of runtime computation per frame). If no spawn is eliminated, victory is determined by score.

Technical Implementation

The system uses OpenCode, an open-source agentic coding harness, running in isolated Docker containers. Agents receive:

OBJECTIVE.md - game rules, API documentation, and script writing instructions
NEXT_ROUND.md - instructions for reviewing previous match logs (rounds 2-5 only)
Two example strategies as reference

Scripts are validated after creation, with agents getting up to 3 attempts to fix errors before the round proceeds.

Performance Results

Current standings from testing:

Claude Opus 4.5: 85 wins, 15 losses (85% win rate, 1778 ELO)
GPT 5.2 (high reasoning level): 68 wins, 32 losses (68% win rate, 1625 ELO)
Grok 4.1 Fast: 39 wins, 61 losses (39% win rate, 1427 ELO)
GLM 4.7: 32 wins, 68 losses (32% win rate, 1372 ELO)
Gemini 3 Pro: 26 wins, 74 losses (26% win rate, 1297 ELO)

Most models showed improved performance across rounds, indicating in-context learning: Claude Opus 4.5 (+20% win rate from round 1 to 5), GLM 4.7 (+16%), GPT 5.2 (+7%), Grok 4.1 Fast (+6%). Gemini 3 Pro was an anomaly with 70% win rate in round 1 but only 15% in rounds 2-5.

Development Notes

The creator spent significant time on sandbox hardening because GPT 5.2 kept trying to cheat by pre-reading opponent strategies. Claude Opus 4.5 showed dominance but was overly focused on economy in early rounds.

Future testing is planned with newer models like Claude 4.6 Opus and GPT 5.3 Codex.

Getting Started

You can run local matches via CLI. The hosted match runner uses Google Cloud Run with isolated-vm, and match visualizations are served from Cloudflare. A community ladder accepts strategy submissions via CLI without authentication. The CLI plus skill.md documentation is sufficient for AI agents to begin immediately.

📖 Read the full source: HN AI Agents