DeepSeek-V4-Flash W4A16+FP8 with MTP Self-Speculation: 85 tok/s on 2x RTX PRO 6000 Max-Q

DeepSeek-V4-Flash running at 85.52 tok/s @ 524k context and ~111 tok/s @ 128k single-stream on 2× RTX PRO 6000 Max-Q (96 GB each, no NVLink). The quant uses pasta-paul's W4A16-FP8 base but with a retrofitted MTP head (original quant silently strips MTP at load time). Key details below.
Benchmarks
- pasta-paul base, no MTP, 524k: 52.85 tok/s, 91 ms TTFT (reference)
- This model, 524k 2-stream: 85.52 tok/s, 155 ms TTFT (+62%)
- This model, 128k single-stream: ~111 tok/s, ~310 ms TTFT (+110%)
- Sanity benchmarks (small samples): GSM8K 93%, MMLU 53%, HumanEval (syntactic) 90%
Quantization Details
- 768 routed-expert tensors (256 experts × {w1, w2, w3}): W4A16 INT4 group=128 sym, GPTQ (Frantar with Cholesky H⁻¹). Calibrated with 256 ultrachat_200k prompts × 256 max_tokens – 17,701 MTP forward dumps, 473k tokens.
- 5 attention projections: FP8_BLOCK (upstream FP8 weights, renamed scale → weight_scale for compressed-tensors compat).
- Shared experts, e_proj, h_proj, norms, gate, attn_sink: BF16 / FP32.
Max-Q Specific Fixes
Pass --disable-custom-all-reduce on Max-Q workstation cards (no NVLink). vLLM's CustomAllreduce uses CUDA P2P and deadlocks on PCIe-only topology. NCCL tuning for lower TTFT (~91 ms vs ~155 ms):
NCCL_PROTO=LL NCCL_ALGO=Ring NCCL_MIN_NCHANNELS=8 NCCL_NTHREADS=512How to Run
Needs the patched vLLM fork from pasta-paul's workspace with MTP patches. Example command:
vllm serve LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 \
--tensor-parallel-size 2 --kv-cache-dtype fp8 --block-size 256 \
--max-model-len 524288 --max-num-seqs 2 \
--gpu-memory-utilization 0.93 \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 --enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--trust-remote-code \
--disable-custom-all-reduce \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
--host 0.0.0.0 --port 8000The model also includes an AGENTS.md runbook for setting up via AI coding agents (Claude/Codex/Cursor).
📖 Read the full source: r/LocalLLaMA
👀 See Also

Developer shares 25 tested Claude prompts for SaaS development workflows
A developer has shared 25 specific prompts they use daily for SaaS development, covering backend architecture, API design, frontend copy, product documentation, and go-to-market tasks. The prompts are designed to save time on repetitive tasks like code review, documentation generation, and edge case testing.

Running OpenClaw Locally with Ollama to Avoid API Costs
A Reddit user shares their experience switching from API-based OpenClaw to running it locally with Ollama, eliminating API costs while maintaining workflows. They created a step-by-step installation video guide.

Research Shows Effective AI Prompting Is Cooperative Communication, Not Engineering
Peer-reviewed research indicates that effective prompting with AI models follows the same cooperative communication principles humans use, with Lakera's analysis showing most prompt failures stem from ambiguity rather than model limitations.

End-to-End LLM Stack Trace: From Keystroke to Streamed Token
A software engineer has created a comprehensive document tracing every layer of the stack when sending a prompt to an LLM, covering client-side token counting, network protocols, API gateways, safety classifiers, tokenization, KV cache, sampling pipeline, and streaming mechanics.