Kimi K2.6 beats Claude, GPT-5.5 and Gemini in coding challenge with aggressive sliding strategy

Kimi K2.6 wins Word Gem Puzzle benchmark
Moonshot AI's open-weights Kimi K2.6 beat every Western frontier model in the Day 12 Word Gem Puzzle, a real-time sliding-tile letter puzzle. Nine models competed after Nvidia's Nemotron Super 3 failed to connect due to a syntax error.
Final Standings
- 1st: Kimi K2.6 — 22 match points (7-1-0)
- 2nd: MiMo V2-Pro — 20 points (6-2-0)
- 3rd: ChatGPT GPT-5.5 — 16 points (5-1-2)
- 4th: GLM 5.1 (Zhipu AI) — 15 points
- 5th: Claude Opus 4.7 — 12 points
- 6th: Gemini Pro 3.1 — 9 points
- 7th: Grok Expert 4.2 — 9 points
- 8th: DeepSeek V4 — 3 points
- 9th: Muse Spark — 0 points
How the puzzle works
The board is a rectangular grid (10×10 to 30×30) filled with letter tiles and one blank space. Bots slide adjacent tiles into the blank and claim valid English words in straight horizontal/vertical lines. Diagonals and backwards don't count. Scoring: words under 7 letters cost points (5-letter: -1, 3-letter: -3). Words 7+ letters score length - 6 (8-letter: +2). Each word can only be claimed once. Grids are seeded with dictionary words in crossword layout, remaining cells filled with Scrabble-weighted letters, then scrambled (more aggressively on larger boards). On 30×30, nearly all seed words are broken.
Kimi's winning strategy
Kimi used a greedy approach: score each possible move by what new positive-value words it unlocks, execute the best, repeat. When no move unlocked a positive word, it fell back to the first legal direction alphabetically. This caused inefficient edge-oscillation on small grids but paid off on 30×30 where reconstruction was needed — Kimi's cumulative score of 77 was the tournament's highest.
Why other models struggled
MiMo V2-Pro never actually slid — its "best value > 0" threshold never triggered, so it scanned the initial grid for 7+ letter words and claimed all in one TCP packet. It scored well on boards with intact seed words but zero on scrambled ones (final: 43 cumulative points). Claude also didn't slide, holding up on 25×25 but failing on 30×30. GPT-5.5 was conservative (~120 slides/round) and showed its best numbers on 15×15 and 30×30. GLM was the most aggressive slider overall (>800,000 total slides). Grok never slid but scored decently on larger boards.
Key takeaway
This isn't simply East vs. West — it's two specific Chinese models that performed best with very different strategies. Kimi is open-weights and publicly available from Moonshot AI (founded 2023). MiMo V2-Pro is API-only; Xiaomi confirmed V2.5 Pro weights are dropping soon.
📖 Read the full source: HN AI Agents
👀 See Also

Waymo Launches Fully Autonomous Operations with 6th-Gen Driver
Waymo's 6th-generation Driver begins fully autonomous operations, featuring a multi-modal sensing suite and next-gen 17-megapixel imagers.

APEX MoE Quants Update: 25+ New Models and I-Nano Tier Released
APEX MoE-aware mixed-precision quantization expands to 30+ models across Qwen, Mistral, Gemma, and hybrid SSM families, plus a new I-Nano tier pushing as low as 2.06 bpw on mid-layer experts.

Claude Code v2.1.73: Model Overrides, Stability Fixes, and Performance Improvements
Claude Code v2.1.73 adds modelOverrides for custom provider IDs, fixes critical freezes and deadlocks, resolves subagent model downgrades, and improves voice mode stability. The release addresses 18 specific issues including bash command permission prompts, session corruption, and Linux sandbox failures.

Cognitive Debt: When AI Output Outpaces Understanding
A Reddit post discusses 'cognitive debt' — the gap between AI-generated output and the team's understanding of it — and argues that creative control means knowing what you shipped. The post itself was written with Claude's help, meta-commenting on the irony.