Sokoban Benchmark: 3 LLMs Pass Strict 2D Spatial Test

A Reddit user benchmarked modern LLMs on strict 2D spatial reasoning using a custom Sokoban map. Models had to produce a correct sequence of moves with zero Chain-of-Thought — only raw directional outputs (UP, DOWN, LEFT, RIGHT) on a single line. No extra formatting allowed.

Results: Only 3 Models Passed

Passed (correct solution + perfect formatting): ChatGPT, Qwen3.7-max, Gemini 3.5-thinking
Failed (illegal moves, deadlocks, or formatting errors): Gemini 3.5-flash, Gemini 3.1 Pro, Qwen3.7-plus (fast, thinking), Qwen3.6-plus, Qwen3.6-35B-A3B, GLM-5, Gemma4-26B-A4B

Claude models were not tested due to account access limitations.

The Exact Prompt Used

You can reproduce the test with this prompt (map data trimmed for length):

You are a perfect Sokoban automatic solver. Based on the standard XSB format character map provided below, calculate the sequence of moves required to push all boxes ($) to their respective goals (. or +).

The output format requirement:

The final result [MUST ONLY] consist of a sequence of these four uppercase words: UP, DOWN, LEFT, RIGHT. All steps must be output on a single line, strictly separated by English commas (,). [DO NOT] include spaces and [DO NOT] include newlines.

Map data example from the benchmark:

[" ###", " ## # ####", " ## ### #", "## $ #", "# @$ # #", "### $### #", " # #.. #", " ## ##.# ##", " # ##", " # ##", " #######"]

The key constraints: no Chain-of-Thought, strict output formatting, and avoiding deadlocks. The benchmark highlights that even advanced open-source models struggle with precise spatial tracking under output constraints.

Who This Is For

Developers evaluating LLMs for agentic tasks requiring spatial reasoning or strict output adherence (e.g., game solving, robotics, layout planning).

📖 Read the full source: r/LocalLLaMA

LLM Spatial Reasoning Tested: Sokoban Benchmark Shows ChatGPT, Qwen3.7-max, Gemini 3.5-thinking Lead

Results: Only 3 Models Passed

The Exact Prompt Used

Who This Is For

👀 See Also

Anthropic releases free educational curriculum including Claude Code and MCP Mastery courses

Fine-tuning Phi-4-mini by training only LayerNorm parameters fails to improve performance

Anthropic Doubles Claude Code Usage Limits, Signs SpaceX Compute Deal

ChatGPT Workspace Agents Free Preview Ends Today — How It Compares to OpenClaw and Hermes