Building a Slay the Spire 2 Agent with Local LLMs: Lessons and Open Problems

A developer has built an agent that plays Slay the Spire 2 using local LLMs via KoboldCPP/Ollama. The game is exposed as a REST API through a community mod, and the agent sits in the middle: reads game state → calls LLM with tools → executes the action → repeat.
Setup and Performance
Setup uses Qwen3.5-27B (Q4_K_M) on RTX 4090 via KoboldCPP. Performance metrics: ~10 seconds per action, ~88% action success rate. Best result: beating the Act 1 boss. The project is available on GitHub at https://github.com/Alex5418/STS2-Agent.
What Works
- State-based tool routing — Instead of exposing 20+ tools at once, only 1-3 tools relevant to the current game state are provided. Combat gets
play_card,end_turn,use_potion. Map screen getschoose_map_node. This dramatically reduced hallucinated tool calls. - Single-tool mode — Small models can't predict how game state changes after an action (e.g., card indices shift after playing a card). So only the first tool call per response is executed, then game state is re-fetched and the model is asked again. Slower but much more reliable.
- Text-based tool call parser (fallback) — KoboldCPP often outputs tool calls as text instead of structured JSON. A multi-pattern regex fallback catches formats like:
json [{"name": "play_card", "arguments": {...}}],Made a function call ... to play_card with arguments = {...},play_card({"card_index": 1, "target": "NIBBIT_0"}), and bare mentions of no-arg tools likeend_turn. This recovers maybe 15-20% of actions that would otherwise be lost. - Energy guard — Client-side tracking of remaining energy. If the model tries to play a card it can't afford, the API call is blocked and the turn is auto-ended. This prevents the most common error loop (model retries the same unaffordable card 3+ times).
- Smart-wait for enemy turns — During the enemy's turn, the game state says "Play Phase: False." Instead of wasting an LLM call on this, the agent polls every 1s until it's the player's turn again.
Open Problems
- Model doesn't follow system prompt rules consistently — System prompt says things like "if enemy intent is Attack, play Defend cards FIRST." The model follows this maybe 30% of the time. The other 70% it just plays attacks regardless. Attempted solutions: stronger wording ("You MUST block first"), few-shot examples in the prompt, injecting computed hints ("WARNING: 15 incoming damage"). None are reliable. Question: Is there a better prompting strategy for getting small models to follow conditional rules? Or is this a fundamental limitation at 27B?
- Tool calling reliability with KoboldCPP — Even with the text fallback parser, about 12% of responses produce no usable tool call. The model sometimes outputs empty
<think></think>blocks followed by malformed JSON. The Ollama OpenAI compatibility layer also occasionally returnsargumentsas a string instead of a dict. Question: Has anyone found a model that's particularly reliable at tool calling at the 14-30B range? The developer has tried Phi-4 (14B) briefly but hasn't done a proper comparison. Considering Mistral-Small or Command-R. - Context window management — Each game state is ~800-1500 tokens as markdown. With system prompt (~500 tokens) and conversation history, context fills up fast. Currently keeps only the last 5 exchanges and resets history on state transitions (combat → map, etc.). But the model has no memory across fights — it can't learn from mistakes. Question: Would a rolling summary approach work? Like condensing the last combat into "You fought Jaw Worm. Took 15 damage because you didn't block turn 2. Won in 4 turns."
- Better structured output from local models — The core problem is needing the model to output a JSON tool call, but what it really wants to do is think in natural language first. Qwen3.5 uses
<think>blocks which are stripped out, but sometimes the thinking and the tool call get tangled together. Question: Would a two-stage approach work better? Stage 1: "Analyze the game state and decide what to do" (free text). Stage 2: "Now output exactly one tool call" (constrained). This doubles latency but might improve reliability. Has anyone tried this pattern? - A/B testing across models — The developer has a JSONL logging system that records actions for comparison.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Opus Handles Frontend Cleanup by Delegating to Subagents from a Playbook
A user tuned one page, documented the fixes in an ADR playbook, then had Opus split the remaining 9 pages among 3 subagents, touching 41 files with near-perfect Lighthouse results.

OpenClaw AI agent helps team salvage demo day with rapid prototype
A development team used OpenClaw's AI agent to build a working demo website with mock data in 10 minutes after their product pivot threatened their demo day participation at South Park Commons.

Developer Uses Claude AI for C++ Game Development in Unreal Engine
A developer reports using Claude Opus for planning and Sonnet for implementation to build a cyberpunk city-builder game in C++ with Unreal Engine, replacing marketplace assets with AI-generated code for features like AI traffic control with distance-based ticking and frustum culling.

Building a 13-Agent Claude Team with Peer Review Workflow
A developer built a 13-agent Claude system where AI agents review each other's work, run on scheduled heartbeats, and track everything in a database for marketing automation.