Blackwell LLM Toolkit: NVFP4 Configs, Wheels, and Benchmarks for TensorRT-LLM on RTX Pro 6000

A new repository on GitHub, blackwell-llm-toolkit, collects TensorRT-LLM configs, prebuilt wheels, and benchmark results for running LLMs on Nvidia Blackwell GPUs (RTX Pro 6000, 5090, 5080, 5070 Ti). The focus is on NVFP4 quantization and overcoming platform-specific hurdles.
Key Features
- TensorRT-LLM configs: Includes a YAML file (
configs/trtllm/nemotron-omni-v3-sm120.yaml) with the obscure launch flags needed to run Mamba-hybrid models on Blackwell. - LMCache wheels: The PyPI wheel crashed on Blackwell due to missing sm_120 cubins. The repo provides a rebuilt wheel and a build script, tested with Optane SSD for KV cache offloading.
- Research docs: AI-generated deep-dives on architecture differences in Nemotron Omni V3, Qwen 3.5/3.6, and Gemma 4. Notably, Qwen 3.5/3.6 are not just renamed Qwen3-VL — they have a completely different architecture.
- Benchmark harnesses:
rapid_bench.pyruns a 41-prompt quality eval (intelligence, tool-use, calibration, orchestration, creative writing).bench_harness.pymeasures sustained decode, TTFT, prefill, and concurrency, with a--prompt-tokens Nmode for long context.
Benchmark Highlights (Single RTX Pro 6000 96GB, no TP)
- Nemotron-3-Nano-Omni V3 (multimodal, NVFP4, 8k context): 270 tok/s. Fastest model tested, handles image/video/audio+text. Requires TRT-LLM v1.3.0rc13.
- Nemotron-3-Nano (text-only, NVFP4, 8k context): 249 tok/s. Best for tool-calling agents (10/10 on tools).
- DeepSeek-V4-Flash (IQ2_XXS-XL GGUF, 65k context): 31 tok/s. Best for complex reasoning (9/10 intel, 10/10 tools, 13/13 calibration).
- MiniMax-M2.7-REAP-172B (Q3_K_S GGUF, 196k context): 117 tok/s. Good for long conversations.
- MiniMax-M2.7 W4A16 (with LMCache on Optane SSD, 154k context): 20-22 tok/s. Long-context W4A16 quality.
- MiniMax-M2.7 W4A16 (short context, no LMCache, 64k context): 22-25 tok/s. Highest quality short answers (10/10 intel).
Full results with TTFT, prefill speeds, concurrency, and eval scores are in bench/results.md.
Who It's For
Developers and researchers running LLM inference on Blackwell GPUs who need optimized TensorRT-LLM configs, prebuilt LMCache for long-context offloading, or real-world benchmark data for model selection.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Foreman: Open Source Slack Bot for Remote Control of Local Claude Code
Foreman is a free, open source Slack bot that provides remote control for locally running Claude Code instances. It allows developers to send tasks to Claude from their phone while maintaining full local access to filesystem, tools, and environment.

Council: A Structured Dialogue Framework for Claude
Council — A Crucible is a structured dialogue framework that runs inside a single Claude context window, using persona framing to produce four distinct modes of engagement: rigorous interrogation, generative action, lived experience, and unformed intuition.

ELBO Platform: AI-Powered Training for Critical Thinking and Communication Skills
ELBO is a live training platform built with Claude Code that uses AI to help users practice critical thinking, persuasion, negotiation, and public speaking skills through simulated scenarios and debates.

Anthropic Launches Claude for Small Business with Pre-Built Workflows for QuickBooks, HubSpot, Canva
Claude for Small Business is a toggle-install package within Claude Cowork that connects to QuickBooks, PayPal, HubSpot, Canva, Docusign, Google Workspace, and Microsoft 365, with 15 ready-to-run agentic workflows for payroll, month-end close, invoicing, campaign management, and more.