Agent Plays Slay the Spire 2 with Qwen3.5-27B: 88% Success

A developer has built an agent that plays Slay the Spire 2 using local LLMs via KoboldCPP/Ollama. The game is exposed as a REST API through a community mod, and the agent sits in the middle: reads game state → calls LLM with tools → executes the action → repeat.

Setup and Performance

Setup uses Qwen3.5-27B (Q4_K_M) on RTX 4090 via KoboldCPP. Performance metrics: ~10 seconds per action, ~88% action success rate. Best result: beating the Act 1 boss. The project is available on GitHub at https://github.com/Alex5418/STS2-Agent.

What Works

State-based tool routing — Instead of exposing 20+ tools at once, only 1-3 tools relevant to the current game state are provided. Combat gets play_card, end_turn, use_potion. Map screen gets choose_map_node. This dramatically reduced hallucinated tool calls.
Single-tool mode — Small models can't predict how game state changes after an action (e.g., card indices shift after playing a card). So only the first tool call per response is executed, then game state is re-fetched and the model is asked again. Slower but much more reliable.
Text-based tool call parser (fallback) — KoboldCPP often outputs tool calls as text instead of structured JSON. A multi-pattern regex fallback catches formats like: json [{"name": "play_card", "arguments": {...}}], Made a function call ... to play_card with arguments = {...}, play_card({"card_index": 1, "target": "NIBBIT_0"}), and bare mentions of no-arg tools like end_turn. This recovers maybe 15-20% of actions that would otherwise be lost.
Energy guard — Client-side tracking of remaining energy. If the model tries to play a card it can't afford, the API call is blocked and the turn is auto-ended. This prevents the most common error loop (model retries the same unaffordable card 3+ times).
Smart-wait for enemy turns — During the enemy's turn, the game state says "Play Phase: False." Instead of wasting an LLM call on this, the agent polls every 1s until it's the player's turn again.

Open Problems

Model doesn't follow system prompt rules consistently — System prompt says things like "if enemy intent is Attack, play Defend cards FIRST." The model follows this maybe 30% of the time. The other 70% it just plays attacks regardless. Attempted solutions: stronger wording ("You MUST block first"), few-shot examples in the prompt, injecting computed hints ("WARNING: 15 incoming damage"). None are reliable. Question: Is there a better prompting strategy for getting small models to follow conditional rules? Or is this a fundamental limitation at 27B?
Tool calling reliability with KoboldCPP — Even with the text fallback parser, about 12% of responses produce no usable tool call. The model sometimes outputs empty <think></think> blocks followed by malformed JSON. The Ollama OpenAI compatibility layer also occasionally returns arguments as a string instead of a dict. Question: Has anyone found a model that's particularly reliable at tool calling at the 14-30B range? The developer has tried Phi-4 (14B) briefly but hasn't done a proper comparison. Considering Mistral-Small or Command-R.
Context window management — Each game state is ~800-1500 tokens as markdown. With system prompt (~500 tokens) and conversation history, context fills up fast. Currently keeps only the last 5 exchanges and resets history on state transitions (combat → map, etc.). But the model has no memory across fights — it can't learn from mistakes. Question: Would a rolling summary approach work? Like condensing the last combat into "You fought Jaw Worm. Took 15 damage because you didn't block turn 2. Won in 4 turns."
Better structured output from local models — The core problem is needing the model to output a JSON tool call, but what it really wants to do is think in natural language first. Qwen3.5 uses <think> blocks which are stripped out, but sometimes the thinking and the tool call get tangled together. Question: Would a two-stage approach work better? Stage 1: "Analyze the game state and decide what to do" (free text). Stage 2: "Now output exactly one tool call" (constrained). This doubles latency but might improve reliability. Has anyone tried this pattern?
A/B testing across models — The developer has a JSONL logging system that records actions for comparison.

📖 Read the full source: r/LocalLLaMA

Building a Slay the Spire 2 Agent with Local LLMs: Lessons and Open Problems

Setup and Performance

What Works

Open Problems

👀 See Also

Developer uses Claude Code to iterate spending chart from wireframe to production quality in one night

Steam Game Development with Claude Code: Technical Review Process and Code Restructuring

Daily Claude and ChatGPT Usage Split from a Developer's Experience

Non-Coder Builds AI Prompt Diagnostic Framework with Claude Over Many Sessions