Optimizing AutoResearch on RTX 5090: What Failed and What Worked

✍️ OpenClawRadar📅 Published: March 20, 2026🔗 Source
Optimizing AutoResearch on RTX 5090: What Failed and What Worked
Ad

Initial Problems and Working Path

The initial setup for running AutoResearch on an RTX 5090/Blackwell system was "badly broken" with extremely poor performance—only a few thousand tokens per second and essentially useless MFU (Model FLOPs Utilization), despite the code technically running.

The working configuration path involved:

  • Avoiding the broken full-model compile path on this setup
  • Keeping the good fused optimizer compile improvements where they actually helped
  • Using the stable SDPA/CuDNN attention path
  • Tuning total batch and time budget empirically instead of guessing
  • Automating the benchmark/extract/strategize/rerun loop

What Failed

Several failure modes were misleading:

  • A path that was technically correct but catastrophically slow
  • Misleading MFU interpretation until the denominator was corrected for the 5090 context
  • Higher per-device batch settings that looked like they should help but actually made things much worse
  • Automation bugs around lock cleanup/completion hooks/dispatch order

As the developer noted: "There were several ways to get a run that looked alive while doing something stupid."

What Helped

Real improvements came from:

  • Re-enabling the fused optimizer compile path
  • Reducing total batch from the original larger setting
  • Validating 2**17 as the better total batch region
  • Increasing time budget once the stable batch regime was found
  • Treating automation as part of the benchmark system, not an afterthought
Ad

Performance Progression

The progression of useful runs showed clear improvements:

  • Baseline healthy run: val_bpb: 1.165452, mfu: 40.49%
  • Fused optimizer compile improvement: val_bpb: 1.155400, mfu: 42.88%
  • TOTAL_BATCH_SIZE = 2**18: val_bpb: 1.108381, mfu: 43.18%
  • TOTAL_BATCH_SIZE = 2**17 validation: val_bpb: 1.089424, mfu: 43.03%
  • Best current auto-loop result: TOTAL_BATCH_SIZE = 2**17, TIME_BUDGET = 1200, LR multiplier = 1.0, val_bpb: 0.999445, mfu: 42.56%, total_tokens_M: 387.8, num_steps: 2959

Current Best Configuration

The best result found so far:

  • TOTAL_BATCH_SIZE = 2**17
  • TIME_BUDGET = 1200
  • LR multiplier = 1.0

This combination beat larger batch variants, smaller 2**16 variant, a lower-LR test, and shorter training budgets.

Key Takeaways

The main lesson was that the winning configuration wasn't a "max everything" setup. The better path involved a stable batch regime, a longer training horizon, and careful elimination of automation and backend mistakes.

The developer emphasized that if you're working on Blackwell/5090 training and seeing bizarre behavior, "it may not be your imagination. Some paths are simply much worse than they first appear." The useful part of this exercise was finding a path that is stable, automatable, reproducible, and good enough to build real follow-on experiments on top of.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also