Optimizing AutoResearch on RTX 5090: What Failed and What Worked

Initial Problems and Working Path
The initial setup for running AutoResearch on an RTX 5090/Blackwell system was "badly broken" with extremely poor performance—only a few thousand tokens per second and essentially useless MFU (Model FLOPs Utilization), despite the code technically running.
The working configuration path involved:
- Avoiding the broken full-model compile path on this setup
- Keeping the good fused optimizer compile improvements where they actually helped
- Using the stable SDPA/CuDNN attention path
- Tuning total batch and time budget empirically instead of guessing
- Automating the benchmark/extract/strategize/rerun loop
What Failed
Several failure modes were misleading:
- A path that was technically correct but catastrophically slow
- Misleading MFU interpretation until the denominator was corrected for the 5090 context
- Higher per-device batch settings that looked like they should help but actually made things much worse
- Automation bugs around lock cleanup/completion hooks/dispatch order
As the developer noted: "There were several ways to get a run that looked alive while doing something stupid."
What Helped
Real improvements came from:
- Re-enabling the fused optimizer compile path
- Reducing total batch from the original larger setting
- Validating 2**17 as the better total batch region
- Increasing time budget once the stable batch regime was found
- Treating automation as part of the benchmark system, not an afterthought
Performance Progression
The progression of useful runs showed clear improvements:
- Baseline healthy run: val_bpb: 1.165452, mfu: 40.49%
- Fused optimizer compile improvement: val_bpb: 1.155400, mfu: 42.88%
- TOTAL_BATCH_SIZE = 2**18: val_bpb: 1.108381, mfu: 43.18%
- TOTAL_BATCH_SIZE = 2**17 validation: val_bpb: 1.089424, mfu: 43.03%
- Best current auto-loop result: TOTAL_BATCH_SIZE = 2**17, TIME_BUDGET = 1200, LR multiplier = 1.0, val_bpb: 0.999445, mfu: 42.56%, total_tokens_M: 387.8, num_steps: 2959
Current Best Configuration
The best result found so far:
- TOTAL_BATCH_SIZE = 2**17
- TIME_BUDGET = 1200
- LR multiplier = 1.0
This combination beat larger batch variants, smaller 2**16 variant, a lower-LR test, and shorter training budgets.
Key Takeaways
The main lesson was that the winning configuration wasn't a "max everything" setup. The better path involved a stable batch regime, a longer training horizon, and careful elimination of automation and backend mistakes.
The developer emphasized that if you're working on Blackwell/5090 training and seeing bizarre behavior, "it may not be your imagination. Some paths are simply much worse than they first appear." The useful part of this exercise was finding a path that is stable, automatable, reproducible, and good enough to build real follow-on experiments on top of.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Giving Claude M365 Access via Power Automate and a FastMCP Server
A developer built a lightweight MCP server that lets Claude interact with Microsoft 365 (inbox, calendar, OneDrive, Planner, Excel, Word) using Power Automate webhooks — no admin Graph permissions needed.

OpenClaw Startup Costs: Hardware, APIs, and Monthly Budget

OpenClaw 3.22 Upgrade Checklist: Practical Steps from a Developer Who Got Burned
A developer shares specific upgrade steps for OpenClaw 3.22, including checking for deprecated environment variables, creating backups, running migration commands, and verifying plugin compatibility.

Fix Remote Browser Automation with OpenClaw Node Setup
Use a local OpenClaw node to avoid CDP/RDP headaches — run browser visible, keep your IP and cookies.