Code Evolution Boosts LLM 2.8x on ARC-AGI-2

Code Evolution Boosts LLM Reasoning on ARC-AGI-2

Researchers from Imbue have published results showing how code evolution can significantly improve LLM performance on the ARC-AGI-2 benchmark. Their method combines fitness-based sampling and code mutation driven by a base LLM, achieving substantial gains across different model types.

Performance Results

The evolution method produces different improvements depending on the base model:

Kimi K2.5 (open-weights): 2.8x performance gain, from 12.1% to 34.0% accuracy on the public evaluation set, at $2.67 per task. This represents the highest performing open-source/open-weights solution for ARC-AGI-2 currently available.
Gemini 3 Flash: 1.8x performance gain, from 34.0% to 61.4% accuracy, at $2.42 per task.
Gemini 3.1 Pro: Improved from 88.1% to 95.1% accuracy, at $8.71 per task. This result is competitive with the current state of the art (97.9% at $11.77/task by Confluence Lab).

All runs used the exact same evolution framework and prompts. The researchers note that scores on the public evaluation set used for these results are not directly comparable to the semi-private data set used for the official ARC-AGI-2 leaderboard.

How Code Evolution Works

The method iteratively improves upon an initial solution using fitness-based sampling and code mutation. The mutation step is driven by an underlying base LLM but is agnostic to the specific model chosen. This approach can be applied across a wide range of reasoning and optimization tasks beyond ARC-AGI-2.

For context, ARC-AGI (Abstraction and Reasoning Corpus) was proposed by François Chollet in 2019 as a way to measure "general fluid intelligence" - a system's ability to efficiently learn solutions to novel problems. Each task presents 2-5 input/output examples (rectangular grids with color values) and requires deducing transformation rules to predict outputs for challenge inputs.

📖 Read the full source: HN LLM Tools

Code Evolution Method Triples LLM Performance on ARC-AGI-2 Benchmark

Code Evolution Boosts LLM Reasoning on ARC-AGI-2

Performance Results

How Code Evolution Works

👀 See Also

Claude Code v2.1.144: Background Sessions, /model Scoping, and 15s Startup Timeout

Six GitHub Repositories for Claude Code Development

Claude Code gains TLA+ model checking via tla-mcp MCP server

htmLLM-124M v2 Released: Specialized HTML/Bootstrap Autocomplete Model