vLLM Setup and Testing on 10x NVIDIA V100 Server with 320GB VRAM

Hardware Configuration and Build Notes
A developer has built a local AI server with 10x Tesla V100 SXM2 32GB GPUs (320GB VRAM total) on an AMD Threadripper PRO system. The setup uses Ubuntu 24.04 headless with NVIDIA driver 580.126.20. GPU topology consists of two NVLink quad meshes (GPUs 0-3, 4/5/8/9) plus an NV6 pair (GPUs 6-7).
What Works on V100 with vLLM
- FP16 unquantized: Primary path using
--dtype half - bitsandbytes 4-bit: Works for models too large for FP16
- TRITON_ATTN: Automatic fallback since FlashAttention2 requires SM 80+
- Tensor/Pipeline parallel: TP=4 and TP=4 PP=2 both tested successfully
What Does Not Work on V100
- GPTQ: ExLlamaV2 kernels broken on SM 7.0 (vLLM issue #2165)
- AWQ: Requires SM 75+
- FP8: Requires SM 75+. MiniMax M2.5 uses FP8 internally — dead on arrival.
- FlashAttention2: Requires SM 80+
- DeepSeek MLA: Hopper/Blackwell only. Full DeepSeek V3/R1 cannot run on vLLM + V100.
Build Requirements and Critical Fixes
PyTorch 2.11.0+cu126 is required — cu126 is the last version with V100 support as cu128+ drops Volta. Source compilation requires TORCH_CUDA_ARCH_LIST="7.0" and MAX_JOBS=20. A MoE kernel patch is needed for issue #36008, changing B.size(1) to B.size(0) in fused_moe.py (2 lines). PYTHONNOUSERSITE=1 is required to isolate conda environment from stale system packages.
Critical NCCL Dependency Fix: pip install -e . pulls in nvidia-nccl-cu13 alongside nvidia-nccl-cu12. The cu13 library gets loaded at runtime and references CUDA 13 symbols that don't exist in the cu126 runtime, resulting in "NCCL error: unhandled cuda error" on every multi-GPU launch. The fix involves uninstalling all nvidia-* packages and managing dependencies carefully.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw Pre-Launch Checklist for Security and Reliability
A Reddit user shares a practical six-point checklist for OpenClaw setup before going live, covering access control, safety rules, memory management, automation testing, delivery validation, and failure handling.

Running Qwen3.6 27B and 35B on 6GB VRAM with ik_llama: Practical Configs and Benchmarks
A user shares detailed ik_llama configs and performance numbers for running Qwen3.6 27B and 35B A3B models on an RTX2060 mobile (6GB VRAM, 32GB RAM), with prefill speeds of 40-100 t/s and generation up to 11 t/s.

Open-source launch playbook for OSS LLM and local AI projects
An open-source playbook addresses discoverability issues for LLM and local AI projects by providing structured guidance on pre-launch preparation, launch-day execution, and post-launch follow-up. It includes templates and strategies for community distribution, creator outreach, and SEO optimization.

Fix for Claude Desktop Workspace VM Service Issue on Windows 11 Home
A community-developed fix addresses the 'VM service not running' error in Claude Desktop's workspace feature on Windows 11 Home, with manual PowerShell commands and an automated tool available on GitHub.