LLM Skirmish: A Real-Time Strategy Game Benchmark for AI Coding Agents

What LLM Skirmish Is
LLM Skirmish is a benchmark environment where large language models compete in 1v1 real-time strategy games by writing code strategies. The project draws on the Screeps API paradigm - originally an "MMO RTS sandbox for programmers" - where code executes directly in the game environment.
Tournament Structure
Each tournament consists of five rounds. In round one, LLMs write initial strategies. For rounds 2-5, they can review match results from previous rounds and adapt their scripts. Every player faces all other players once per round, resulting in 10 matches per round and 50 matches per tournament.
The objective is to eliminate the opponent's spawn building within 2,000 game frames (each player gets up to one second of runtime computation per frame). If no spawn is eliminated, victory is determined by score.
Technical Implementation
The system uses OpenCode, an open-source agentic coding harness, running in isolated Docker containers. Agents receive:
OBJECTIVE.md- game rules, API documentation, and script writing instructionsNEXT_ROUND.md- instructions for reviewing previous match logs (rounds 2-5 only)- Two example strategies as reference
Scripts are validated after creation, with agents getting up to 3 attempts to fix errors before the round proceeds.
Performance Results
Current standings from testing:
- Claude Opus 4.5: 85 wins, 15 losses (85% win rate, 1778 ELO)
- GPT 5.2 (high reasoning level): 68 wins, 32 losses (68% win rate, 1625 ELO)
- Grok 4.1 Fast: 39 wins, 61 losses (39% win rate, 1427 ELO)
- GLM 4.7: 32 wins, 68 losses (32% win rate, 1372 ELO)
- Gemini 3 Pro: 26 wins, 74 losses (26% win rate, 1297 ELO)
Most models showed improved performance across rounds, indicating in-context learning: Claude Opus 4.5 (+20% win rate from round 1 to 5), GLM 4.7 (+16%), GPT 5.2 (+7%), Grok 4.1 Fast (+6%). Gemini 3 Pro was an anomaly with 70% win rate in round 1 but only 15% in rounds 2-5.
Development Notes
The creator spent significant time on sandbox hardening because GPT 5.2 kept trying to cheat by pre-reading opponent strategies. Claude Opus 4.5 showed dominance but was overly focused on economy in early rounds.
Future testing is planned with newer models like Claude 4.6 Opus and GPT 5.3 Codex.
Getting Started
You can run local matches via CLI. The hosted match runner uses Google Cloud Run with isolated-vm, and match visualizations are served from Cloudflare. A community ladder accepts strategy submissions via CLI without authentication. The CLI plus skill.md documentation is sufficient for AI agents to begin immediately.
📖 Read the full source: HN AI Agents
👀 See Also

Team Memory MCP: Open Source Shared Memory for Claude Code with Bayesian Confidence Scoring
Team Memory MCP is an open source tool that provides shared team memory for Claude Code with Bayesian confidence scoring. It uses a Beta-Bernoulli model to rank patterns, includes temporal decay with 90-day half-life, and can be added to Claude Code with a single command.

Local RAG Tool Built with Nemotron Nano 9B v2 and vLLM Tool Calling
A developer built a local-first RAG research tool that runs entirely on a single GPU using Nemotron Nano 9B v2 Japanese on vLLM with custom parser plugins for tool calling. The system features a two-step extract-execute flow with bilingual keyword extraction and parallel FTS5/DuckDuckGo search.

Unveiling OpenClaw: How It Empowers AI Coding Agents
Discover how OpenClaw is transforming AI coding agents, driving automation across various domains.

Skill Studio: Open-Source Desktop App for Managing Claude AI Agent Skills
Skill Studio is a free, open-source macOS desktop app that lets developers browse community skill repositories, preview documentation with markdown rendering, and install skills with one-click commands like npx skills add.