vLLM Setup and Testing on 10x NVIDIA V100 Server with 320GB VRAM

✍️ OpenClawRadar📅 Published: April 15, 2026🔗 Source
vLLM Setup and Testing on 10x NVIDIA V100 Server with 320GB VRAM
Ad

Hardware Configuration and Build Notes

A developer has built a local AI server with 10x Tesla V100 SXM2 32GB GPUs (320GB VRAM total) on an AMD Threadripper PRO system. The setup uses Ubuntu 24.04 headless with NVIDIA driver 580.126.20. GPU topology consists of two NVLink quad meshes (GPUs 0-3, 4/5/8/9) plus an NV6 pair (GPUs 6-7).

What Works on V100 with vLLM

  • FP16 unquantized: Primary path using --dtype half
  • bitsandbytes 4-bit: Works for models too large for FP16
  • TRITON_ATTN: Automatic fallback since FlashAttention2 requires SM 80+
  • Tensor/Pipeline parallel: TP=4 and TP=4 PP=2 both tested successfully

What Does Not Work on V100

  • GPTQ: ExLlamaV2 kernels broken on SM 7.0 (vLLM issue #2165)
  • AWQ: Requires SM 75+
  • FP8: Requires SM 75+. MiniMax M2.5 uses FP8 internally — dead on arrival.
  • FlashAttention2: Requires SM 80+
  • DeepSeek MLA: Hopper/Blackwell only. Full DeepSeek V3/R1 cannot run on vLLM + V100.
Ad

Build Requirements and Critical Fixes

PyTorch 2.11.0+cu126 is required — cu126 is the last version with V100 support as cu128+ drops Volta. Source compilation requires TORCH_CUDA_ARCH_LIST="7.0" and MAX_JOBS=20. A MoE kernel patch is needed for issue #36008, changing B.size(1) to B.size(0) in fused_moe.py (2 lines). PYTHONNOUSERSITE=1 is required to isolate conda environment from stale system packages.

Critical NCCL Dependency Fix: pip install -e . pulls in nvidia-nccl-cu13 alongside nvidia-nccl-cu12. The cu13 library gets loaded at runtime and references CUDA 13 symbols that don't exist in the cu126 runtime, resulting in "NCCL error: unhandled cuda error" on every multi-GPU launch. The fix involves uninstalling all nvidia-* packages and managing dependencies carefully.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also