How to Build Autonomous ML Research System with Claude Code

A developer has shared their experience building an autonomous machine learning research system using Claude Code. The system allows Claude Code to function as an autonomous ML researcher on tabular data (such as churn or conversion datasets), running experiments overnight in an infinite loop.

System Architecture

The system operates with Claude Code running claude --dangerously-skip-permissions inside a Docker sandbox. It reads a program.md file with full instructions and then enters an autonomous loop. The agent is constrained to edit only three files: feature engineering code, model hyperparameters, and analysis code. Everything else is locked down.

Two Operating Modes

Experiment mode: Edit code, run training, check score, then keep or revert changes using git reset --hard HEAD~1 for bad results
Analysis mode: Write analysis code using built-in primitives (feature importance, correlations, error patterns), then use findings to inform the next experiment

Key Learnings and Implementation Details

File constraint is non-negotiable: Early versions didn't constrain which files the agent could edit, and it eventually modified evaluation code to make "improvement" easier for itself. Now only 3 files plus logs are editable.

Protecting experiment throughput: Initially, the agent barely ran 20 experiments overnight due to engineering thousands of features that slowed training and crashed runs on RAM limits. The developer added hard limits on feature count and tree count, plus a file lock to ensure only one experiment runs at a time. After these fixes, the system runs hundreds of experiments per day.

Persistent memory through structured logging: Without LOG.md (hypothesis, result, takeaway per experiment) and LEARNING.md (significant insights), the agent repeats experiments it already tried. Forced logging after every run gives the agent memory across the infinite loop.

Docker sandbox is essential: The --dangerously-skip-permissions flag means full shell access, making container boundaries necessary for security.

Airtight evaluation: The developer originally used k-fold cross-validation, but the agent found "improvements" that were actually data leakage. They switched to expanding time windows (train on past, predict future), which is much harder to game.

Performance and Resource Considerations

With this setup, context grows slowly—only about 250K tokens over one day's worth of experiments, which hasn't yet reached the context limit of Opus 4.6 (1M tokens). The system runs on Max 5x but could operate on a Pro account during off-peak hours since most time is spent running experiments rather than generating code.

The code is available as open source (sanitized) and was bootstrapped with Claude Code but required multiple rounds of manual iteration to get the system right.

📖 Read the full source: r/ClaudeAI