LLM Spatial Reasoning Tested: Sokoban Benchmark Shows ChatGPT, Qwen3.7-max, Gemini 3.5-thinking Lead

A Reddit user benchmarked modern LLMs on strict 2D spatial reasoning using a custom Sokoban map. Models had to produce a correct sequence of moves with zero Chain-of-Thought — only raw directional outputs (UP, DOWN, LEFT, RIGHT) on a single line. No extra formatting allowed.
Results: Only 3 Models Passed
- Passed (correct solution + perfect formatting): ChatGPT, Qwen3.7-max, Gemini 3.5-thinking
- Failed (illegal moves, deadlocks, or formatting errors): Gemini 3.5-flash, Gemini 3.1 Pro, Qwen3.7-plus (fast, thinking), Qwen3.6-plus, Qwen3.6-35B-A3B, GLM-5, Gemma4-26B-A4B
Claude models were not tested due to account access limitations.
The Exact Prompt Used
You can reproduce the test with this prompt (map data trimmed for length):
You are a perfect Sokoban automatic solver. Based on the standard XSB format character map provided below, calculate the sequence of moves required to push all boxes ($) to their respective goals (. or +).
The output format requirement:
The final result [MUST ONLY] consist of a sequence of these four uppercase words: UP, DOWN, LEFT, RIGHT. All steps must be output on a single line, strictly separated by English commas (,). [DO NOT] include spaces and [DO NOT] include newlines.
Map data example from the benchmark:
[" ###", " ## # ####", " ## ### #", "## $ #", "# @$ # #", "### $### #", " # #.. #", " ## ##.# ##", " # ##", " # ##", " #######"]
The key constraints: no Chain-of-Thought, strict output formatting, and avoiding deadlocks. The benchmark highlights that even advanced open-source models struggle with precise spatial tracking under output constraints.
Who This Is For
Developers evaluating LLMs for agentic tasks requiring spatial reasoning or strict output adherence (e.g., game solving, robotics, layout planning).
📖 Read the full source: r/LocalLLaMA
👀 See Also

Anthropic releases free educational curriculum including Claude Code and MCP Mastery courses
Anthropic has made its entire educational curriculum available for free, including courses on Claude Code, MCP Mastery, API usage, and AI Fluency. The curriculum is described as university-level and provides structured learning compared to random tutorials.

Fine-tuning Phi-4-mini by training only LayerNorm parameters fails to improve performance
A hobbyist tested training only LayerNorm γ values on Phi-4-mini across Python and medical domains with different learning rates and data formats. Performance degraded slightly on all benchmarks compared to baseline, with the author concluding transformers already route information dynamically through attention.

Anthropic Doubles Claude Code Usage Limits, Signs SpaceX Compute Deal
Anthropic doubled five-hour usage windows for Claude Code Pro and Max subscribers, removed peak-hour reductions, and raised API limits for Opus, citing a new deal with SpaceX for 300+ MW of compute capacity from the Colossus 1 supercomputer (220,000+ NVIDIA GPUs).

ChatGPT Workspace Agents Free Preview Ends Today — How It Compares to OpenClaw and Hermes
OpenAI's ChatGPT Workspace Agents free preview ends May 6, switching to credit-based pricing. The Reddit post compares it to OpenClaw, Hermes, and managed platforms like BetterClaw for team vs. personal use.