Speculative Decoding Benchmarks on RTX 3090 with Qwen Models for HVAC Business Use

✍️ OpenClawRadar📅 Published: March 28, 2026🔗 Source
Speculative Decoding Benchmarks on RTX 3090 with Qwen Models for HVAC Business Use
Ad

Hardware and Setup

The developer used an RTX 3090 24GB, Ryzen 7600X, 32GB RAM, and WSL2 Ubuntu. They moved from Ollama on Windows to llama.cpp on WSL Linux with speculative decoding for an internal AI platform handling customer lookups, quote formatting, equipment research, and parsing messy job notes.

Testing Methodology

They tested 16 GGUF models across Qwen2.5, Qwen3, and Qwen3.5 families, every target+draft combination that fits in 24GB VRAM, cross-generation draft pairings (Qwen2.5 drafts on Qwen3 targets and vice versa), and monitored VRAM on every combo to catch CPU offloading. Quality evaluation used real HVAC business prompts for SQL generation, quote formatting, messy field note parsing, and equipment compatibility reasoning. They used draftbench and llama-throughput-lab for speed sweeps, with Claude Code automating the process overnight.

Top Speed Results

  • Qwen3-8B Q8_0 + Qwen3-1.7B Q4_K_M: 279.9 tok/s (+236% speedup, 13.6 GB VRAM)
  • Qwen2.5-7B Q4_K_M + Qwen2.5-0.5B Q8_0: 205.4 tok/s (+50% speedup, ~6 GB VRAM)
  • Qwen3-8B Q8_0 + Qwen3-0.6B Q4_0: 190.5 tok/s (+129% speedup, 12.9 GB VRAM)
  • Qwen3-14B Q4_K_M + Qwen3-0.6B Q4_0: 159.1 tok/s (+115% speedup, 13.5 GB VRAM)
  • Qwen2.5-14B Q8_0 + Qwen2.5-0.5B Q4_K_M: 137.5 tok/s (+186% speedup, ~16 GB VRAM)
  • Qwen3.5-35B-A3B Q4_K_M (baseline, no draft): 133.6 tok/s (22 GB VRAM)
  • Qwen2.5-32B Q4_K_M + Qwen2.5-1.5B Q4_K_M: 91.0 tok/s (+156% speedup, ~20 GB VRAM)

The Qwen3-8B + 1.7B draft combo achieved 100% acceptance rate—perfect draft match where the 1.7B predicts exactly what the 8B would generate.

Ad

Qwen3.5 Thinking Mode Issue

Qwen3.5 models enter thinking mode by default on llama.cpp, generating hidden reasoning tokens before responding. This caused erratic benchmark results: 0 tok/s alternating with 700 tok/s, TTFT jumping between 1s and 28s. Only three methods worked to disable it:

  • --jinja + patched chat template with enable_thinking=false hardcoded ✅
  • Raw /completion endpoint (bypasses chat template entirely) ✅
  • Everything else (system prompts, /no_think suffix, temperature tricks) ❌

If running Qwen3.5 on llama.cpp, you need the patched template or you'll get garbage benchmarks.

Quality Evaluation Findings

They ran four hard HVAC-specific prompts testing ambiguous customer requests, complex quotes, messy notes with typos, and equipment compatibility reasoning. Key findings:

  • Every single model failed the pricing formula math: 8B, 14B, 32B, 35B—none could correctly compute $4,811 / (1 - 0.47) = $9,077. LLMs cannot do business math reliably—put your formulas in code.
  • The 8B handled 3/4 hard prompts—good on ambiguous requests, messy notes, daily tasks—but failed on technical equipment reasoning.
  • The 35B-A3B was the only model with real HVAC domain knowledge—correctly sized a mini split for an uninsulated Chicago garage, knew to recommend Hyper-Heat series for cold climate, correctly said no branch box needed for single zone—but missed a model number in messy notes and failed the math.
  • Bigger ≠ better across the board: The Qwen3-14B Q4_K_M (159 tok/s) performed worse than the 8B on most prompts. The 32B recommended a 5-ton unit for a 400 sqft garage.
  • Qwen2.5-7B hallucinated on every note parsing test—consistently invented details.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also