AutoResearch on RTX 5090: Failed & Working Configs

Initial Problems and Working Path

The initial setup for running AutoResearch on an RTX 5090/Blackwell system was "badly broken" with extremely poor performance—only a few thousand tokens per second and essentially useless MFU (Model FLOPs Utilization), despite the code technically running.

The working configuration path involved:

Avoiding the broken full-model compile path on this setup
Keeping the good fused optimizer compile improvements where they actually helped
Using the stable SDPA/CuDNN attention path
Tuning total batch and time budget empirically instead of guessing
Automating the benchmark/extract/strategize/rerun loop

What Failed

Several failure modes were misleading:

A path that was technically correct but catastrophically slow
Misleading MFU interpretation until the denominator was corrected for the 5090 context
Higher per-device batch settings that looked like they should help but actually made things much worse
Automation bugs around lock cleanup/completion hooks/dispatch order

As the developer noted: "There were several ways to get a run that looked alive while doing something stupid."

What Helped

Real improvements came from:

Re-enabling the fused optimizer compile path
Reducing total batch from the original larger setting
Validating 2**17 as the better total batch region
Increasing time budget once the stable batch regime was found
Treating automation as part of the benchmark system, not an afterthought

Performance Progression

The progression of useful runs showed clear improvements:

Baseline healthy run: val_bpb: 1.165452, mfu: 40.49%
Fused optimizer compile improvement: val_bpb: 1.155400, mfu: 42.88%
TOTAL_BATCH_SIZE = 2**18: val_bpb: 1.108381, mfu: 43.18%
TOTAL_BATCH_SIZE = 2**17 validation: val_bpb: 1.089424, mfu: 43.03%
Best current auto-loop result: TOTAL_BATCH_SIZE = 2**17, TIME_BUDGET = 1200, LR multiplier = 1.0, val_bpb: 0.999445, mfu: 42.56%, total_tokens_M: 387.8, num_steps: 2959

Current Best Configuration

The best result found so far:

TOTAL_BATCH_SIZE = 2**17
TIME_BUDGET = 1200
LR multiplier = 1.0

This combination beat larger batch variants, smaller 2**16 variant, a lower-LR test, and shorter training budgets.

Key Takeaways

The main lesson was that the winning configuration wasn't a "max everything" setup. The better path involved a stable batch regime, a longer training horizon, and careful elimination of automation and backend mistakes.

The developer emphasized that if you're working on Blackwell/5090 training and seeing bizarre behavior, "it may not be your imagination. Some paths are simply much worse than they first appear." The useful part of this exercise was finding a path that is stable, automatable, reproducible, and good enough to build real follow-on experiments on top of.

📖 Read the full source: r/LocalLLaMA