Optimizing AutoResearch on RTX 5090: What Failed and What Worked

Initial Problems and Working Path
The initial setup for running AutoResearch on an RTX 5090/Blackwell system was "badly broken" with extremely poor performance—only a few thousand tokens per second and essentially useless MFU (Model FLOPs Utilization), despite the code technically running.
The working configuration path involved:
- Avoiding the broken full-model compile path on this setup
- Keeping the good fused optimizer compile improvements where they actually helped
- Using the stable SDPA/CuDNN attention path
- Tuning total batch and time budget empirically instead of guessing
- Automating the benchmark/extract/strategize/rerun loop
What Failed
Several failure modes were misleading:
- A path that was technically correct but catastrophically slow
- Misleading MFU interpretation until the denominator was corrected for the 5090 context
- Higher per-device batch settings that looked like they should help but actually made things much worse
- Automation bugs around lock cleanup/completion hooks/dispatch order
As the developer noted: "There were several ways to get a run that looked alive while doing something stupid."
What Helped
Real improvements came from:
- Re-enabling the fused optimizer compile path
- Reducing total batch from the original larger setting
- Validating 2**17 as the better total batch region
- Increasing time budget once the stable batch regime was found
- Treating automation as part of the benchmark system, not an afterthought
Performance Progression
The progression of useful runs showed clear improvements:
- Baseline healthy run: val_bpb: 1.165452, mfu: 40.49%
- Fused optimizer compile improvement: val_bpb: 1.155400, mfu: 42.88%
- TOTAL_BATCH_SIZE = 2**18: val_bpb: 1.108381, mfu: 43.18%
- TOTAL_BATCH_SIZE = 2**17 validation: val_bpb: 1.089424, mfu: 43.03%
- Best current auto-loop result: TOTAL_BATCH_SIZE = 2**17, TIME_BUDGET = 1200, LR multiplier = 1.0, val_bpb: 0.999445, mfu: 42.56%, total_tokens_M: 387.8, num_steps: 2959
Current Best Configuration
The best result found so far:
- TOTAL_BATCH_SIZE = 2**17
- TIME_BUDGET = 1200
- LR multiplier = 1.0
This combination beat larger batch variants, smaller 2**16 variant, a lower-LR test, and shorter training budgets.
Key Takeaways
The main lesson was that the winning configuration wasn't a "max everything" setup. The better path involved a stable batch regime, a longer training horizon, and careful elimination of automation and backend mistakes.
The developer emphasized that if you're working on Blackwell/5090 training and seeing bizarre behavior, "it may not be your imagination. Some paths are simply much worse than they first appear." The useful part of this exercise was finding a path that is stable, automatable, reproducible, and good enough to build real follow-on experiments on top of.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude Code Workflow Visual Explains Memory Hierarchy and Skills System
A Reddit user shared a visual diagram showing Claude Code's workflow structure, including memory layering with CLAUDE.md files and reusable skills defined in .claude/skills/ directories. The workflow loop suggests using Plan mode, describing features, auto-accepting, and committing frequently.

Connecting CludeCode to Webapps for Automated Interaction
Explore how CludeCode can be used to automatically interact with web applications by leveraging AI tools like browsers and scraping utilities.

Setting up OpenClaw on macOS with a unified AI provider endpoint
A developer shares their experience installing OpenClaw on macOS, including the requirement for Node.js 24, using Homebrew for installation, configuring a custom OpenAI-compatible provider like ZenMux, and setting up a background daemon. Key troubleshooting tips include WhatsApp's default message blocking and using the openclaw doctor command.

Structured AI Workflow with Phase-Based Commands to Reduce Rework
A developer shares a programmable workflow using specific commands like /pwf-brainstorm and /pwf-work-plan to address common AI coding issues: lost context, broken standards, and mixed planning/execution. The approach includes mandatory documentation updates and a multi-root project structure.