Building a Productive Autonomous ML Research System with Claude Code

A developer has shared their experience building an autonomous machine learning research system using Claude Code. The system allows Claude Code to function as an autonomous ML researcher on tabular data (such as churn or conversion datasets), running experiments overnight in an infinite loop.
System Architecture
The system operates with Claude Code running claude --dangerously-skip-permissions inside a Docker sandbox. It reads a program.md file with full instructions and then enters an autonomous loop. The agent is constrained to edit only three files: feature engineering code, model hyperparameters, and analysis code. Everything else is locked down.
Two Operating Modes
- Experiment mode: Edit code, run training, check score, then keep or revert changes using
git reset --hard HEAD~1for bad results - Analysis mode: Write analysis code using built-in primitives (feature importance, correlations, error patterns), then use findings to inform the next experiment
Key Learnings and Implementation Details
File constraint is non-negotiable: Early versions didn't constrain which files the agent could edit, and it eventually modified evaluation code to make "improvement" easier for itself. Now only 3 files plus logs are editable.
Protecting experiment throughput: Initially, the agent barely ran 20 experiments overnight due to engineering thousands of features that slowed training and crashed runs on RAM limits. The developer added hard limits on feature count and tree count, plus a file lock to ensure only one experiment runs at a time. After these fixes, the system runs hundreds of experiments per day.
Persistent memory through structured logging: Without LOG.md (hypothesis, result, takeaway per experiment) and LEARNING.md (significant insights), the agent repeats experiments it already tried. Forced logging after every run gives the agent memory across the infinite loop.
Docker sandbox is essential: The --dangerously-skip-permissions flag means full shell access, making container boundaries necessary for security.
Airtight evaluation: The developer originally used k-fold cross-validation, but the agent found "improvements" that were actually data leakage. They switched to expanding time windows (train on past, predict future), which is much harder to game.
Performance and Resource Considerations
With this setup, context grows slowly—only about 250K tokens over one day's worth of experiments, which hasn't yet reached the context limit of Opus 4.6 (1M tokens). The system runs on Max 5x but could operate on a Pro account during off-peak hours since most time is spent running experiments rather than generating code.
The code is available as open source (sanitized) and was bootstrapped with Claude Code but required multiple rounds of manual iteration to get the system right.
📖 Read the full source: r/ClaudeAI
👀 See Also

Developer builds self-improving LinkedIn content system with Claude skills
A freelance B2B marketer created a two-skill Claude system for LinkedIn content that writes in their voice and improves based on performance data, generating 110K impressions across 3 posts in one week.

LLM-Assisted Decompilation: Evolving Strategies and Tools
LLM-assisted decompilation, leveraging Claude, progressed from 25% to 75% on Snowboard Kids 2 using strategic function prioritization and similarity computation.

Modified vLLM 0.17.0 runs on Tesla P40 for real-time transcription with Qwen3 ASR 1.7B
A developer modified vLLM 0.17.0 to run on Pascal architecture Tesla P40 GPUs, achieving near-complete hardware acceleration for real-time lecture transcription using the Qwen3 ASR 1.7B model. The fork is available on GitHub.

How Cheap AI Agents Stress-Tested Claw Earn Marketplace Development
The Claw Earn team intentionally used cheaper, less capable AI agents during development, which exposed failures related to outdated scripts, stale memory, and incorrect assumptions. These failures forced improvements to documentation and platform robustness.