Blackwell LLM Toolkit: NVFP4 Configs, Wheels, and Benchmarks for TensorRT-LLM on RTX Pro 6000

✍️ OpenClawRadar📅 Published: May 12, 2026🔗 Source
Blackwell LLM Toolkit: NVFP4 Configs, Wheels, and Benchmarks for TensorRT-LLM on RTX Pro 6000
Ad

A new repository on GitHub, blackwell-llm-toolkit, collects TensorRT-LLM configs, prebuilt wheels, and benchmark results for running LLMs on Nvidia Blackwell GPUs (RTX Pro 6000, 5090, 5080, 5070 Ti). The focus is on NVFP4 quantization and overcoming platform-specific hurdles.

Key Features

  • TensorRT-LLM configs: Includes a YAML file (configs/trtllm/nemotron-omni-v3-sm120.yaml) with the obscure launch flags needed to run Mamba-hybrid models on Blackwell.
  • LMCache wheels: The PyPI wheel crashed on Blackwell due to missing sm_120 cubins. The repo provides a rebuilt wheel and a build script, tested with Optane SSD for KV cache offloading.
  • Research docs: AI-generated deep-dives on architecture differences in Nemotron Omni V3, Qwen 3.5/3.6, and Gemma 4. Notably, Qwen 3.5/3.6 are not just renamed Qwen3-VL — they have a completely different architecture.
  • Benchmark harnesses: rapid_bench.py runs a 41-prompt quality eval (intelligence, tool-use, calibration, orchestration, creative writing). bench_harness.py measures sustained decode, TTFT, prefill, and concurrency, with a --prompt-tokens N mode for long context.
Ad

Benchmark Highlights (Single RTX Pro 6000 96GB, no TP)

  • Nemotron-3-Nano-Omni V3 (multimodal, NVFP4, 8k context): 270 tok/s. Fastest model tested, handles image/video/audio+text. Requires TRT-LLM v1.3.0rc13.
  • Nemotron-3-Nano (text-only, NVFP4, 8k context): 249 tok/s. Best for tool-calling agents (10/10 on tools).
  • DeepSeek-V4-Flash (IQ2_XXS-XL GGUF, 65k context): 31 tok/s. Best for complex reasoning (9/10 intel, 10/10 tools, 13/13 calibration).
  • MiniMax-M2.7-REAP-172B (Q3_K_S GGUF, 196k context): 117 tok/s. Good for long conversations.
  • MiniMax-M2.7 W4A16 (with LMCache on Optane SSD, 154k context): 20-22 tok/s. Long-context W4A16 quality.
  • MiniMax-M2.7 W4A16 (short context, no LMCache, 64k context): 22-25 tok/s. Highest quality short answers (10/10 intel).

Full results with TTFT, prefill speeds, concurrency, and eval scores are in bench/results.md.

Who It's For

Developers and researchers running LLM inference on Blackwell GPUs who need optimized TensorRT-LLM configs, prebuilt LMCache for long-context offloading, or real-world benchmark data for model selection.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also