Autoresearch with Claude Code on Production Codebase: 60 Experiments, 3 Changes Kept

✍️ OpenClawRadar📅 Published: March 24, 2026🔗 Source
Autoresearch with Claude Code on Production Codebase: 60 Experiments, 3 Changes Kept
Ad

Autoresearch Experiment on Production Codebase

A developer tested Karpathy's autoresearch approach on a real production system using Claude Code, running 60 iterations across two rounds while away from the computer. The target was a hybrid search system built with Django, pgvector, and Cohere embeddings.

Key Results and Findings

Out of 60 iterations, only 3 changes were kept while 57 were reverted. The overall score improvement was marginal (+0.03), but the knowledge gained was significant:

  • Title matching as a search signal proved to be net negative, demonstrated in just 2 iterations
  • Larger candidate pools had no effect - the problem was ranking, not recall
  • Hand-built adaptive weighting actually worked - removing it caused regressions
  • Fiddling with keyword damping formulas barely moved scores
  • Round 2 targeting the Haiku metadata prompt yielded zero improvements because ranking weights from Round 1 were co-optimized to the original prompt's output
  • Discovered a Redis caching bug: keys were on query hash, not prompt hash, which would have shipped to production unnoticed
Ad

Practical Takeaways

The biggest insight was that autoresearch helps map where the ceiling is, not just find improvements. Having 60 data points saying "You can stop tuning this" provides concrete evidence rather than relying on intuition. The developer notes this approach saved manual experimentation time on optimizations that wouldn't have paid off.

The full writeup is available at the blog link, and the open source Claude Code autoresearch skill is on GitHub. The developer is curious about others trying this on non-ML codebases and what metrics they're using.

📖 Read the full source: r/ClaudeAI

Ad

👀 See Also

Non-developer finds managed OpenClaw setup via MaxClaw on MiniMax Agent platform
Use Cases

Non-developer finds managed OpenClaw setup via MaxClaw on MiniMax Agent platform

A freelance marketing consultant with no coding background successfully deployed an AI agent using MaxClaw on the MiniMax Agent platform, avoiding Docker and API key management. The agent handles daily competitor monitoring, drafts social copy, and summarizes articles.

OpenClawRadar
User Comparison: Claude vs Gemini for Android App Development
Use Cases

User Comparison: Claude vs Gemini for Android App Development

A developer tested both Claude and Gemini for creating a Samsung Fold cover screen game controller app. Claude provided working alternatives, a complete zip folder for Android Studio, and transparent reasoning, while Gemini gave faulty code, irrelevant video suggestions, and required manual file creation.

OpenClawRadar
AI-Run Store Uses CLI for Shopping Experience
Use Cases

AI-Run Store Uses CLI for Shopping Experience

Ultrathink built a store operated entirely by AI agents with no human involvement in design, fulfillment, or marketing. The shopping experience is terminal-first, allowing users to browse, add-to-cart, and checkout via CLI commands.

OpenClawRadar
Building a 20K+ Line Production SaaS Platform with Claude Code: Lessons from Agentic Engineering at Scale
Use Cases

Building a 20K+ Line Production SaaS Platform with Claude Code: Lessons from Agentic Engineering at Scale

A developer open-sourced LastSaaS, a production-ready SaaS boilerplate built entirely through conversation with Claude Code, featuring Go backend, React frontend, multi-tenant auth, Stripe billing, and a built-in MCP server. The project reveals what works and requires discipline when using AI agents for large-scale development.

OpenClawRadar