vLLM Setup and Testing on 10x NVIDIA V100 Server with 320GB VRAM

✍️ OpenClawRadar📅 Published: April 15, 2026🔗 Source

Hardware Configuration and Build Notes

A developer has built a local AI server with 10x Tesla V100 SXM2 32GB GPUs (320GB VRAM total) on an AMD Threadripper PRO system. The setup uses Ubuntu 24.04 headless with NVIDIA driver 580.126.20. GPU topology consists of two NVLink quad meshes (GPUs 0-3, 4/5/8/9) plus an NV6 pair (GPUs 6-7).

What Works on V100 with vLLM

FP16 unquantized: Primary path using --dtype half
bitsandbytes 4-bit: Works for models too large for FP16
TRITON_ATTN: Automatic fallback since FlashAttention2 requires SM 80+
Tensor/Pipeline parallel: TP=4 and TP=4 PP=2 both tested successfully

What Does Not Work on V100

GPTQ: ExLlamaV2 kernels broken on SM 7.0 (vLLM issue #2165)
AWQ: Requires SM 75+
FP8: Requires SM 75+. MiniMax M2.5 uses FP8 internally — dead on arrival.
FlashAttention2: Requires SM 80+
DeepSeek MLA: Hopper/Blackwell only. Full DeepSeek V3/R1 cannot run on vLLM + V100.

Build Requirements and Critical Fixes

PyTorch 2.11.0+cu126 is required — cu126 is the last version with V100 support as cu128+ drops Volta. Source compilation requires TORCH_CUDA_ARCH_LIST="7.0" and MAX_JOBS=20. A MoE kernel patch is needed for issue #36008, changing B.size(1) to B.size(0) in fused_moe.py (2 lines). PYTHONNOUSERSITE=1 is required to isolate conda environment from stale system packages.

Critical NCCL Dependency Fix: pip install -e . pulls in nvidia-nccl-cu13 alongside nvidia-nccl-cu12. The cu13 library gets loaded at runtime and references CUDA 13 symbols that don't exist in the cu126 runtime, resulting in "NCCL error: unhandled cuda error" on every multi-GPU launch. The fix involves uninstalling all nvidia-* packages and managing dependencies carefully.

📖 Read the full source: r/LocalLLaMA

👀 See Also

Guides

Three Essential OpenClaw Skills for a Stable Setup: Memory, Security, and Discovery

A Reddit post recommends installing three specific types of OpenClaw skills first: a memory fix skill to prevent context loss, a local security vetting skill to check for malicious code, and a curated discovery hub to find maintained tools.

Apr 4, 2026, 11:45 PM UTC

OpenClawRadar

Guides

Claude Code v2.1.36: Fast Mode Now Available for Opus 4.6

Anthropic releases Claude Code v2.1.36 with Fast Mode support for the latest Opus 4.6 model, enabling significantly faster code generation and analysis.

Feb 7, 2026, 06:20 PM UTC

OpenClaw Radar

Guides

Qwen 3.5 Tool Calling Fixes for Agentic Use: Server Status and Client-Side Workarounds

A detailed analysis identifies four bugs that break Qwen 3.5 tool calling in agentic setups, tracks server fixes as of April 2026, and provides a client-side Python function to parse XML tool calls when servers fail.

Apr 15, 2026, 07:45 PM UTC

OpenClawRadar

Guides

OpenClaw v2.0 update requires manual checks before installation

OpenClaw's latest update includes 12 breaking changes, a new plugin system, and 30+ security patches. The update will silently break setups if users run npm update without first checking environment variables, state directories, and browser automation configurations.

Mar 25, 2026, 03:45 PM UTC

OpenClawRadar