Kimi K2.6 vs GPT-5.5, Claude, Gemini: Coding Challenge Results with 22 Match Points

Kimi K2.6 wins Word Gem Puzzle benchmark

Moonshot AI's open-weights Kimi K2.6 beat every Western frontier model in the Day 12 Word Gem Puzzle, a real-time sliding-tile letter puzzle. Nine models competed after Nvidia's Nemotron Super 3 failed to connect due to a syntax error.

Final Standings

1st: Kimi K2.6 — 22 match points (7-1-0)
2nd: MiMo V2-Pro — 20 points (6-2-0)
3rd: ChatGPT GPT-5.5 — 16 points (5-1-2)
4th: GLM 5.1 (Zhipu AI) — 15 points
5th: Claude Opus 4.7 — 12 points
6th: Gemini Pro 3.1 — 9 points
7th: Grok Expert 4.2 — 9 points
8th: DeepSeek V4 — 3 points
9th: Muse Spark — 0 points

How the puzzle works

The board is a rectangular grid (10×10 to 30×30) filled with letter tiles and one blank space. Bots slide adjacent tiles into the blank and claim valid English words in straight horizontal/vertical lines. Diagonals and backwards don't count. Scoring: words under 7 letters cost points (5-letter: -1, 3-letter: -3). Words 7+ letters score length - 6 (8-letter: +2). Each word can only be claimed once. Grids are seeded with dictionary words in crossword layout, remaining cells filled with Scrabble-weighted letters, then scrambled (more aggressively on larger boards). On 30×30, nearly all seed words are broken.

Kimi's winning strategy

Kimi used a greedy approach: score each possible move by what new positive-value words it unlocks, execute the best, repeat. When no move unlocked a positive word, it fell back to the first legal direction alphabetically. This caused inefficient edge-oscillation on small grids but paid off on 30×30 where reconstruction was needed — Kimi's cumulative score of 77 was the tournament's highest.

Why other models struggled

MiMo V2-Pro never actually slid — its "best value > 0" threshold never triggered, so it scanned the initial grid for 7+ letter words and claimed all in one TCP packet. It scored well on boards with intact seed words but zero on scrambled ones (final: 43 cumulative points). Claude also didn't slide, holding up on 25×25 but failing on 30×30. GPT-5.5 was conservative (~120 slides/round) and showed its best numbers on 15×15 and 30×30. GLM was the most aggressive slider overall (>800,000 total slides). Grok never slid but scored decently on larger boards.

Key takeaway

This isn't simply East vs. West — it's two specific Chinese models that performed best with very different strategies. Kimi is open-weights and publicly available from Moonshot AI (founded 2023). MiMo V2-Pro is API-only; Xiaomi confirmed V2.5 Pro weights are dropping soon.

📖 Read the full source: HN AI Agents