LLM Spatial Reasoning Tested: Sokoban Benchmark Shows ChatGPT, Qwen3.7-max, Gemini 3.5-thinking Lead

✍️ OpenClawRadar📅 Published: June 19, 2026🔗 Source
LLM Spatial Reasoning Tested: Sokoban Benchmark Shows ChatGPT, Qwen3.7-max, Gemini 3.5-thinking Lead
Ad

A Reddit user benchmarked modern LLMs on strict 2D spatial reasoning using a custom Sokoban map. Models had to produce a correct sequence of moves with zero Chain-of-Thought — only raw directional outputs (UP, DOWN, LEFT, RIGHT) on a single line. No extra formatting allowed.

Results: Only 3 Models Passed

  • Passed (correct solution + perfect formatting): ChatGPT, Qwen3.7-max, Gemini 3.5-thinking
  • Failed (illegal moves, deadlocks, or formatting errors): Gemini 3.5-flash, Gemini 3.1 Pro, Qwen3.7-plus (fast, thinking), Qwen3.6-plus, Qwen3.6-35B-A3B, GLM-5, Gemma4-26B-A4B

Claude models were not tested due to account access limitations.

Ad

The Exact Prompt Used

You can reproduce the test with this prompt (map data trimmed for length):

You are a perfect Sokoban automatic solver. Based on the standard XSB format character map provided below, calculate the sequence of moves required to push all boxes ($) to their respective goals (. or +).

The output format requirement:

The final result [MUST ONLY] consist of a sequence of these four uppercase words: UP, DOWN, LEFT, RIGHT. All steps must be output on a single line, strictly separated by English commas (,). [DO NOT] include spaces and [DO NOT] include newlines.

Map data example from the benchmark:

[" ###", " ## # ####", " ## ### #", "## $ #", "# @$ # #", "### $### #", " # #.. #", " ## ##.# ##", " # ##", " # ##", " #######"]

The key constraints: no Chain-of-Thought, strict output formatting, and avoiding deadlocks. The benchmark highlights that even advanced open-source models struggle with precise spatial tracking under output constraints.

Who This Is For

Developers evaluating LLMs for agentic tasks requiring spatial reasoning or strict output adherence (e.g., game solving, robotics, layout planning).

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

Anthropic releases free educational curriculum including Claude Code and MCP Mastery courses
News

Anthropic releases free educational curriculum including Claude Code and MCP Mastery courses

Anthropic has made its entire educational curriculum available for free, including courses on Claude Code, MCP Mastery, API usage, and AI Fluency. The curriculum is described as university-level and provides structured learning compared to random tutorials.

OpenClawRadar
Fine-tuning Phi-4-mini by training only LayerNorm parameters fails to improve performance
News

Fine-tuning Phi-4-mini by training only LayerNorm parameters fails to improve performance

A hobbyist tested training only LayerNorm γ values on Phi-4-mini across Python and medical domains with different learning rates and data formats. Performance degraded slightly on all benchmarks compared to baseline, with the author concluding transformers already route information dynamically through attention.

OpenClawRadar
Anthropic Doubles Claude Code Usage Limits, Signs SpaceX Compute Deal
News

Anthropic Doubles Claude Code Usage Limits, Signs SpaceX Compute Deal

Anthropic doubled five-hour usage windows for Claude Code Pro and Max subscribers, removed peak-hour reductions, and raised API limits for Opus, citing a new deal with SpaceX for 300+ MW of compute capacity from the Colossus 1 supercomputer (220,000+ NVIDIA GPUs).

OpenClawRadar
ChatGPT Workspace Agents Free Preview Ends Today — How It Compares to OpenClaw and Hermes
News

ChatGPT Workspace Agents Free Preview Ends Today — How It Compares to OpenClaw and Hermes

OpenAI's ChatGPT Workspace Agents free preview ends May 6, switching to credit-based pricing. The Reddit post compares it to OpenClaw, Hermes, and managed platforms like BetterClaw for team vs. personal use.

OpenClawRadar