Code Evolution Method Triples LLM Performance on ARC-AGI-2 Benchmark

Code Evolution Boosts LLM Reasoning on ARC-AGI-2
Researchers from Imbue have published results showing how code evolution can significantly improve LLM performance on the ARC-AGI-2 benchmark. Their method combines fitness-based sampling and code mutation driven by a base LLM, achieving substantial gains across different model types.
Performance Results
The evolution method produces different improvements depending on the base model:
- Kimi K2.5 (open-weights): 2.8x performance gain, from 12.1% to 34.0% accuracy on the public evaluation set, at $2.67 per task. This represents the highest performing open-source/open-weights solution for ARC-AGI-2 currently available.
- Gemini 3 Flash: 1.8x performance gain, from 34.0% to 61.4% accuracy, at $2.42 per task.
- Gemini 3.1 Pro: Improved from 88.1% to 95.1% accuracy, at $8.71 per task. This result is competitive with the current state of the art (97.9% at $11.77/task by Confluence Lab).
All runs used the exact same evolution framework and prompts. The researchers note that scores on the public evaluation set used for these results are not directly comparable to the semi-private data set used for the official ARC-AGI-2 leaderboard.
How Code Evolution Works
The method iteratively improves upon an initial solution using fitness-based sampling and code mutation. The mutation step is driven by an underlying base LLM but is agnostic to the specific model chosen. This approach can be applied across a wide range of reasoning and optimization tasks beyond ARC-AGI-2.
For context, ARC-AGI (Abstraction and Reasoning Corpus) was proposed by François Chollet in 2019 as a way to measure "general fluid intelligence" - a system's ability to efficiently learn solutions to novel problems. Each task presents 2-5 input/output examples (rectangular grids with color values) and requires deducing transformation rules to predict outputs for challenge inputs.
📖 Read the full source: HN LLM Tools
👀 See Also

Claude Code v2.1.144: Background Sessions, /model Scoping, and 15s Startup Timeout
Claude Code v2.1.144 adds /resume for background sessions, scopes /model to current session only, and fixes a 75s startup hang when api.anthropic.com is unreachable with a 15s timeout.

Six GitHub Repositories for Claude Code Development
A Reddit user tested and shared six GitHub repositories designed to improve Claude Code projects, including tools for structured development, UI generation, task management, memory, ecosystem exploration, and workflow automation.

Claude Code gains TLA+ model checking via tla-mcp MCP server
tla-mcp is a new MCP server that lets Claude Code call the TLA+ model checker tla-rs as a first-class tool — validate specs, run bounded checks with counterexample traces, and replay scenarios from the chat.

htmLLM-124M v2 Released: Specialized HTML/Bootstrap Autocomplete Model
LH-Tech-AI released htmLLM-124M v2, a 124M parameter model specialized for HTML/Bootstrap autocompletion that achieves 0.91 validation loss and trains in ~8 hours on a single T4 GPU.