Scaling Karpathy's Autoresearch with 16 GPUs: Results and Methods

✍️ OpenClawRadar📅 Published: March 19, 2026🔗 Source
Scaling Karpathy's Autoresearch with 16 GPUs: Results and Methods
Ad

What is Autoresearch?

Autoresearch is Andrej Karpathy's project where a coding agent autonomously improves a neural network training script. The agent edits train.py, runs a 5-minute training experiment on a GPU, checks validation loss, and loops - keeping changes that help, discarding those that don't. In Karpathy's first overnight run, the agent found ~20 improvements that stacked up to an 11% reduction in time-to-GPT-2 on the nanochat leaderboard.

How Autoresearch Works

The project has three files:

  • prepare.py - Downloads data, trains a tokenizer, provides the dataloader and evaluation function. Read-only. The agent cannot touch it.
  • train.py - The GPT model, optimizer, and training loop. This is the only file the agent modifies.
  • program.md - Instructions for the agent: what it can change, how to evaluate results, when to keep vs. discard changes.

The constraint is a fixed 5-minute wall-clock training budget. The agent's job is to minimize val_bpb (validation bits per byte) within that window. Everything in train.py is fair game - architecture, hyperparameters, optimizer settings, batch size, model depth - as long as the code runs without crashing.

The Bottleneck: One GPU, One Experiment

Running experiments sequentially means the agent spends most of its time waiting. A typical cycle looks like:

  1. Agent edits train.py (~30 seconds)
  2. Training runs (~5 minutes)
  3. Agent reads the result, plans the next experiment (~30 seconds)

Step 2 dominates. During step 2, the agent is idle - it could be preparing the next experiment, or the next ten. With sequential execution, testing combinations of parameters means waiting another 5 minutes for each test.

Ad

Giving the Agent Cloud GPUs

The team used SkyPilot, an open-source tool that launches jobs across clouds and Kubernetes from a YAML file. It includes a skill that teaches coding agents to use it. The agent reads the skill, then launches and manages GPU clusters on its own - no manual cloud setup.

Each experiment is defined in a short YAML (experiment.yaml) that specifies the GPU type, installs dependencies, runs train.py, and prints metrics to stdout. The agent checks results with sky logs.

Results: ~910 Experiments, ~8 Hours, 16 GPUs

Claude Code used the SkyPilot skill to launch and manage GPU experiments across 16 GPUs. Over 8 hours it submitted ~910 experiments and drove val_bpb from 1.003 down to 0.974 - a 2.87% improvement over baseline.

How Parallelism Changed the Agent's Research Strategy

With one GPU, the agent does greedy hill-climbing - try one thing, check, repeat. With 16 GPUs, it ran factorial grids of 10-13 experiments per wave, catching interaction effects between parameters that sequential search would miss.

For example, the agent tested six model widths in a single wave, saw the trend immediately, and zeroed in on the best one - one round instead of six.

The agent also discovered it had access to multiple GPU types (H100s and H200s) and developed a strategy to exploit the performance difference across heterogeneous hardware: screen ideas on cheaper H100s, promote winners to H200 for validation.

Performance Comparison

With 16 GPUs, the parallel agent reached the same best validation loss 9x faster than the simulated sequential baseline (~8 hours vs ~72 hours).

Experiment Phases

  • Phase 1: Hyperparameter sweeps (~first 200 experiments)
  • Phase 2: Architecture discovery (~experiments 200-420)
  • Phase 3: Fine-tuning the wider model (~experiments 420-560)
  • Phase 4: Optimizer tuning (~experiments 560-700)
  • Phase 5: Diminishing returns (~experiments 700-910)

The agent found that scaling model width mattered more than any single hyperparameter.

📖 Read the full source: HN AI Agents

Ad

👀 See Also

JavaClaw Beta: Java-Based AI Assistant Built on Spring AI and JobRunr
Tools

JavaClaw Beta: Java-Based AI Assistant Built on Spring AI and JobRunr

JobRunr team released JavaClaw beta, a Java version of OpenClaw that runs locally with multi-channel support, LLM choice, and background job processing via JobRunr. Built with Spring Boot 4, Spring AI, and Spring Modulith.

OpenClawRadar
VibeAround: Local Daemon Connects Coding Agents to Telegram and Discord
Tools

VibeAround: Local Daemon Connects Coding Agents to Telegram and Discord

VibeAround is a local daemon that connects coding agents like Claude Code, Gemini CLI, and Codex to IM platforms including Telegram and Discord. The tool features session handover with pickup codes to continue conversations across devices.

OpenClawRadar
mcp-optimizer reduces token waste from idle MCP servers in Claude Code
Tools

mcp-optimizer reduces token waste from idle MCP servers in Claude Code

mcp-optimizer is a plugin that addresses token waste from MCP servers in Claude Code by analyzing tool usage and generating optimized configurations. It includes four utilities: mcp-doctor for server health checks, mcp-audit for usage analysis, mcp-optimize for creating project-local configs, and mcp-to-skills for converting tools to on-demand Skills.

OpenClawRadar
read-once: A Claude Code Hook That Prevents Redundant File Reads
Tools

read-once: A Claude Code Hook That Prevents Redundant File Reads

A developer built a PreToolUse hook called read-once that tracks files Claude Code has already read in a session, blocking re-reads of unchanged files and using diffs for changed files. The tool saves thousands of tokens per session by preventing Claude from repeatedly reading the same file content.

OpenClawRadar