Speculative Decoding Benchmarks on RTX 3090 with Qwen Models for HVAC Business Use

Hardware and Setup
The developer used an RTX 3090 24GB, Ryzen 7600X, 32GB RAM, and WSL2 Ubuntu. They moved from Ollama on Windows to llama.cpp on WSL Linux with speculative decoding for an internal AI platform handling customer lookups, quote formatting, equipment research, and parsing messy job notes.
Testing Methodology
They tested 16 GGUF models across Qwen2.5, Qwen3, and Qwen3.5 families, every target+draft combination that fits in 24GB VRAM, cross-generation draft pairings (Qwen2.5 drafts on Qwen3 targets and vice versa), and monitored VRAM on every combo to catch CPU offloading. Quality evaluation used real HVAC business prompts for SQL generation, quote formatting, messy field note parsing, and equipment compatibility reasoning. They used draftbench and llama-throughput-lab for speed sweeps, with Claude Code automating the process overnight.
Top Speed Results
- Qwen3-8B Q8_0 + Qwen3-1.7B Q4_K_M: 279.9 tok/s (+236% speedup, 13.6 GB VRAM)
- Qwen2.5-7B Q4_K_M + Qwen2.5-0.5B Q8_0: 205.4 tok/s (+50% speedup, ~6 GB VRAM)
- Qwen3-8B Q8_0 + Qwen3-0.6B Q4_0: 190.5 tok/s (+129% speedup, 12.9 GB VRAM)
- Qwen3-14B Q4_K_M + Qwen3-0.6B Q4_0: 159.1 tok/s (+115% speedup, 13.5 GB VRAM)
- Qwen2.5-14B Q8_0 + Qwen2.5-0.5B Q4_K_M: 137.5 tok/s (+186% speedup, ~16 GB VRAM)
- Qwen3.5-35B-A3B Q4_K_M (baseline, no draft): 133.6 tok/s (22 GB VRAM)
- Qwen2.5-32B Q4_K_M + Qwen2.5-1.5B Q4_K_M: 91.0 tok/s (+156% speedup, ~20 GB VRAM)
The Qwen3-8B + 1.7B draft combo achieved 100% acceptance rate—perfect draft match where the 1.7B predicts exactly what the 8B would generate.
Qwen3.5 Thinking Mode Issue
Qwen3.5 models enter thinking mode by default on llama.cpp, generating hidden reasoning tokens before responding. This caused erratic benchmark results: 0 tok/s alternating with 700 tok/s, TTFT jumping between 1s and 28s. Only three methods worked to disable it:
--jinja+ patched chat template withenable_thinking=falsehardcoded ✅- Raw
/completionendpoint (bypasses chat template entirely) ✅ - Everything else (system prompts,
/no_thinksuffix, temperature tricks) ❌
If running Qwen3.5 on llama.cpp, you need the patched template or you'll get garbage benchmarks.
Quality Evaluation Findings
They ran four hard HVAC-specific prompts testing ambiguous customer requests, complex quotes, messy notes with typos, and equipment compatibility reasoning. Key findings:
- Every single model failed the pricing formula math: 8B, 14B, 32B, 35B—none could correctly compute $4,811 / (1 - 0.47) = $9,077. LLMs cannot do business math reliably—put your formulas in code.
- The 8B handled 3/4 hard prompts—good on ambiguous requests, messy notes, daily tasks—but failed on technical equipment reasoning.
- The 35B-A3B was the only model with real HVAC domain knowledge—correctly sized a mini split for an uninsulated Chicago garage, knew to recommend Hyper-Heat series for cold climate, correctly said no branch box needed for single zone—but missed a model number in messy notes and failed the math.
- Bigger ≠ better across the board: The Qwen3-14B Q4_K_M (159 tok/s) performed worse than the 8B on most prompts. The 32B recommended a 5-ton unit for a 400 sqft garage.
- Qwen2.5-7B hallucinated on every note parsing test—consistently invented details.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Unlocking Efficiency: Evenrealities Order Tracker Enhances OpenClaw's Capabilities
Discover how Evenrealities Order Tracker optimizes OpenClaw users' experience, further bridging AI automation and streamlined management.

Case Study: Using LLM Prompts Instead of Programmatic Scaffolding for Multi-Agent Software Builds
A case study of 10 autonomous software builds using a Claude Opus orchestrator with CLI access and Codex worker agents produced 10 TypeScript browser games totaling over 50,000 lines of code with zero human code intervention. The orchestration logic was entirely prompt-based, replacing a purpose-built scaffold.

One Month with OpenClaw: Personalization Successes and Stability Challenges
An AI researcher replaced ChatGPT Plus with OpenClaw for one month, achieving personalized chatbot functionality through USER.md and PERSONAL_MODEL.md files, daily check-in agents, and spending reports, but encountered persistent breakage requiring Claude Code intervention.

Building a Slack-based debugging system for non-technical Claude users
A developer created a local Claude skill that polls a Slack channel every 7 seconds, allowing non-technical team members to get debugging help by pinging their Claude instance directly in Slack threads.