Setting Up Qwen3.5-27B Locally: vLLM vs llama.cpp Comparison

Qwen3.5-27B Performance and Capabilities
The Qwen3.5-27B model demonstrates strong performance in various benchmarks according to the source: MMLU-Pro: 85.3, MMLU-Redux: 93.3, C-Eval: 90.2, overall intelligence score: 42.1 (better than 91% of compared models), and coding index: 34.9 (tops 88% in coding capabilities). The model features a dense architecture with native 262k context that's extensible to 1M+ tokens.
Backend Comparison: llama.cpp vs vLLM
The source compares two main approaches for local deployment:
Option 1: llama.cpp
- Pros: Low footprint, easy setup, supports q4 KV cache for reasonable VRAM usage
- Cons: Major issue with KV cache getting wiped randomly, forcing full prompt reprocessing mid-session. Speculative decoding via MTP doesn't work. Known bug with no solid fixes yet.
Option 2: vLLM
- Pros: Stable sessions, no KV wipeouts, supports speculative decoding with MTP for faster generations
- Cons: No q4 KV support, so VRAM spikes at 256k context. Tool call parsing is buggy for Qwen3.5 in v0.17.1, with fixes in open GitHub PRs but not merged yet. This breaks agentic coding flows with malformed JSON outputs.
Recommended vLLM Configuration
The source provides specific configuration recommendations for stable, high-speed runs using the model from HF: osoleve/Qwen3.5-27B-Text-NVFP4-MTP:
- Use the flashinfer cutlass backend for optimized performance
- Set context window to 128k (balances VRAM and usability; bump to 256k if you have the hardware)
- Limit GPU utilization to 0.82 to avoid OOM crashes
- Set max-num-seq to 2 (handles a single session fine without overcommitting)
- Enable MTP speculative decoding for speed improvements
- Patch vLLM with the Qwen tool call parsing fixes from the open PRs
- Use Claude code cli - open code still has tool call parsing issues that don't appear on Claude code after the patch
Performance Results
According to the source, performance varies by hardware:
- On an RTX 5090 (32GB VRAM): ~50 TPS
- On an RTX Pro 6000 (96GB VRAM): 70 TPS at full 256k context
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw Multi-Agent Playbook: 7 Isolated Agents for 5/Month
Complete architecture guide for running specialized AI agents with focused memory, least-privilege permissions, and smart model routing.

How to fix OpenClaw 'Cannot find module' error after update
After updating OpenClaw from version 2026.3.24 to 2026.4.5, users are encountering a 'Cannot find module @buape/carbon' error. The solution involves manually running a post-installation script instead of installing the package globally.

OpenClaw's Gateway and Skills: Moving Beyond Chat to Automated Execution
OpenClaw's Gateway connects channels like Telegram and WhatsApp to skills that execute real-world actions such as running tests, calling APIs, and managing files, with cron jobs enabling scheduled background automation.

Building 9 Claude Skills for Solo Studio: Stacking Instructions for Real Work
A solo developer built nine Claude skills for video production, analytics, SEO, financial modeling, and more. Key insight: write skills as instructions to an experienced colleague, not as documentation. Skills auto-trigger and stack when tasks overlap.