Setting Up Qwen3.5-27B Locally: vLLM vs llama.cpp Comparison

✍️ OpenClawRadar📅 Published: March 15, 2026🔗 Source
Setting Up Qwen3.5-27B Locally: vLLM vs llama.cpp Comparison
Ad

Qwen3.5-27B Performance and Capabilities

The Qwen3.5-27B model demonstrates strong performance in various benchmarks according to the source: MMLU-Pro: 85.3, MMLU-Redux: 93.3, C-Eval: 90.2, overall intelligence score: 42.1 (better than 91% of compared models), and coding index: 34.9 (tops 88% in coding capabilities). The model features a dense architecture with native 262k context that's extensible to 1M+ tokens.

Backend Comparison: llama.cpp vs vLLM

The source compares two main approaches for local deployment:

Option 1: llama.cpp

  • Pros: Low footprint, easy setup, supports q4 KV cache for reasonable VRAM usage
  • Cons: Major issue with KV cache getting wiped randomly, forcing full prompt reprocessing mid-session. Speculative decoding via MTP doesn't work. Known bug with no solid fixes yet.

Option 2: vLLM

  • Pros: Stable sessions, no KV wipeouts, supports speculative decoding with MTP for faster generations
  • Cons: No q4 KV support, so VRAM spikes at 256k context. Tool call parsing is buggy for Qwen3.5 in v0.17.1, with fixes in open GitHub PRs but not merged yet. This breaks agentic coding flows with malformed JSON outputs.
Ad

Recommended vLLM Configuration

The source provides specific configuration recommendations for stable, high-speed runs using the model from HF: osoleve/Qwen3.5-27B-Text-NVFP4-MTP:

  • Use the flashinfer cutlass backend for optimized performance
  • Set context window to 128k (balances VRAM and usability; bump to 256k if you have the hardware)
  • Limit GPU utilization to 0.82 to avoid OOM crashes
  • Set max-num-seq to 2 (handles a single session fine without overcommitting)
  • Enable MTP speculative decoding for speed improvements
  • Patch vLLM with the Qwen tool call parsing fixes from the open PRs
  • Use Claude code cli - open code still has tool call parsing issues that don't appear on Claude code after the patch

Performance Results

According to the source, performance varies by hardware:

  • On an RTX 5090 (32GB VRAM): ~50 TPS
  • On an RTX Pro 6000 (96GB VRAM): 70 TPS at full 256k context

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also