V100 Cluster vs. MoE: 12x SXM2 32GB Build with Claude Code Orchestration

✍️ OpenClawRadar📅 Published: June 8, 2026🔗 Source
V100 Cluster vs. MoE: 12x SXM2 32GB Build with Claude Code Orchestration
Ad

A lawyer running a 12x V100 32GB SXM2 cluster on a Threadripper Pro reports that on Volta GPUs (compute capability 7.0), only MoE models deliver usable decode speeds. Dense models are a trap — even a 27-32B dense model struggles at 20-28 tok/s, well below a 40 tok/s floor. In contrast, Qwen3.5-122B-A10B (122B total, 10B active) achieves ~50 tok/s on a single 4-GPU NVLink board, and Gemma-4-26B-A4B hits ~113 tok/s. All benchmarks use Q8 GGUF with Q4 KV cache and flash-attention enabled.

Hardware Configuration

The final build: twelve V100-SXM2 32GB on a Threadripper Pro. Two NVLink boards (4 GPUs each) plus two mixed pairs. Board A occupies GPUs {4,5,8,9}, Board B {6,7,10,11}. An NVLink pair sits on {0,1}, and a mixed pair on {2,3} where one card is 16GB. Cross-board hops go over PCIe/NUMA instead of NVLink, killing throughput. All models are kept inside a single board.

A second box was added: EPYC 7302P, 512GB RAM, 4x RTX 3090 + 2x V100-PCIe, running Ollama for smaller models.

Stack Switch: vLLM → llama.cpp

The operator abandoned vLLM because the models he actually wants are MoE GGUFs, and vLLM on Volta is a dead end for them — FP8/AWQ/Marlin kernels require SM75+, and GPTQ kernels are broken on compute 7.0. He moved to mainline llama.cpp, which recently fixed a Gemma chat-parser bug that was mangling long prompts.

Ad

Orchestration with Claude Code

The system is not a single model answering a chat — an orchestrator (driven by Claude Code) routes legal tasks across several local models, each pinned to its own board to avoid GPU contention. For the heaviest job (full affidavit or motion, intake-to-document), all 16 GPUs across both boxes are active:

  • Workhorse drafting: Qwen3.6-35B-A3B on Board A
  • Heavy reasoning + high-stakes drafting: Qwen3.5-122B-A10B on Board B
  • Gate model: small model on the {0,1} pair checks if there are grounds
  • Adversarial reviewer: attacks the draft on the {2,3} pair
  • Financial/extraction: Gemma-4-26B on the 3090s via Ollama

This is a sequential pipeline — models don't hammer all at once — but all 16 remain resident in GPU memory.

Practical Lessons

  • Hallucination: Local models confidently fabricate citations and dates. A verifier checks every cite, date, and Bates number against source material and blocks ungrounded content. An adversarial reviewer runs on top.
  • Pipeline poisoning: The evidence bundle builder was scooping up its own prior outputs as client evidence, causing the models to "ground" on slop they'd written earlier — one draft cited an RTX 3060 as a Bates number. Fixed by scrubbing the builder's input history.

Lighter tasks use far less — combining and Bates-stamping exhibits is pure CPU (PyMuPDF + Tesseract), and plain summaries hit only Gemma and the router.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

Freight Driver Builds iOS App with Claude Code, Shares Practical Lessons
Use Cases

Freight Driver Builds iOS App with Claude Code, Shares Practical Lessons

A freight driver in Japan with minimal coding experience used Claude Code to build an iOS app for new recordkeeping regulations, shipping it to the App Store in six months. He shares specific lessons about prompt engineering, unexpected costs with Expo and Supabase, and managing burnout.

OpenClawRadar
Developer Builds Custom Business System on Claude with Persistent Memory and Skill Compositions
Use Cases

Developer Builds Custom Business System on Claude with Persistent Memory and Skill Compositions

A developer built a custom system on Claude Pro that goes beyond basic tasks, featuring 13 custom skills with defined inputs/outputs, persistent memory across sessions, automated daily briefings, and skill compositions that chain or parallelize operations. The system runs on Supabase, Cloudflare Pages, and vanilla HTML/CSS/JS.

OpenClawRadar
Migrated Wix to WordPress using OpenClaw agent — now runs a 3D print shop's daily operations
Use Cases

Migrated Wix to WordPress using OpenClaw agent — now runs a 3D print shop's daily operations

A small 3D print shop migrated from Wix to WordPress using an OpenClaw agent deployed on a VPS. The agent now adds new products and builds custom order forms for pet badge orders.

OpenClawRadar
Siri Integrated with Claude Code via Telegram Bot for Personal AI Assistant
Use Cases

Siri Integrated with Claude Code via Telegram Bot for Personal AI Assistant

A developer built a personal AI assistant called Snoopy that connects Siri to Claude Code via a Telegram bot, enabling voice commands with persistent memory and integrations to Mac, Spotify, WhatsApp, iMessage, Calendar, browser, and files.

OpenClawRadar