Qwen3.5-27B Local Setup: vLLM vs llama.cpp

Qwen3.5-27B Performance and Capabilities

The Qwen3.5-27B model demonstrates strong performance in various benchmarks according to the source: MMLU-Pro: 85.3, MMLU-Redux: 93.3, C-Eval: 90.2, overall intelligence score: 42.1 (better than 91% of compared models), and coding index: 34.9 (tops 88% in coding capabilities). The model features a dense architecture with native 262k context that's extensible to 1M+ tokens.

Backend Comparison: llama.cpp vs vLLM

The source compares two main approaches for local deployment:

Option 1: llama.cpp

Pros: Low footprint, easy setup, supports q4 KV cache for reasonable VRAM usage
Cons: Major issue with KV cache getting wiped randomly, forcing full prompt reprocessing mid-session. Speculative decoding via MTP doesn't work. Known bug with no solid fixes yet.

Option 2: vLLM

Pros: Stable sessions, no KV wipeouts, supports speculative decoding with MTP for faster generations
Cons: No q4 KV support, so VRAM spikes at 256k context. Tool call parsing is buggy for Qwen3.5 in v0.17.1, with fixes in open GitHub PRs but not merged yet. This breaks agentic coding flows with malformed JSON outputs.

Recommended vLLM Configuration

The source provides specific configuration recommendations for stable, high-speed runs using the model from HF: osoleve/Qwen3.5-27B-Text-NVFP4-MTP:

Use the flashinfer cutlass backend for optimized performance
Set context window to 128k (balances VRAM and usability; bump to 256k if you have the hardware)
Limit GPU utilization to 0.82 to avoid OOM crashes
Set max-num-seq to 2 (handles a single session fine without overcommitting)
Enable MTP speculative decoding for speed improvements
Patch vLLM with the Qwen tool call parsing fixes from the open PRs
Use Claude code cli - open code still has tool call parsing issues that don't appear on Claude code after the patch