10.33 t/s on Qwen 3.5 35B with a $300 Laptop: Full Optimization Breakdown

✍️ OpenClawRadar📅 Published: June 14, 2026🔗 Source
10.33 t/s on Qwen 3.5 35B with a $300 Laptop: Full Optimization Breakdown
Ad

A Reddit user pushed Qwen 3.5 35B inference to 10.33 t/s on a $300 Lenovo Ideapad Slim 3i (12th Gen i3-1215U, 8GB soldered + 32GB DDR4 expansion). The setup uses a Q4_K_S quantized MoE model with only ~3B active parameters and ik_llama.cpp build 4509.

Hardware & Model

  • Laptop: Lenovo Ideapad Slim 3i 2023 (~$300)
  • CPU: Intel i3-1215U (6 cores, 2 performance cores used)
  • RAM: 8GB soldered + 32GB DDR4 SO-DIMM (Flex mode)
  • OS: Linux Mint
  • Model: Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-Q4_K_S.gguf (35B MoE, 3B active params per token)
  • Backend: ik_llama.cpp commit 40aae0b6, compiled with GCC 13.3.0

Optimizations Applied

  • BIOS: Battery → Extreme performance mode; fan set to quiet (off)
  • OS power profile: performance
  • Core pinning: threads pinned to performance cores 0 and 2 via taskset -c 0,2
  • Quantization: Q4_K_S
  • Batch size: 64 (-ub 64)
  • Speculative decoding: MTP type, draft max 3
  • Flash attention, fmoe, rtr — all default-enabled
  • Fresh restart before benchmark
Ad

Command Used

taskset -c 0,2 ./build/bin/llama-cli \
  -m "/home/default/LLM Models/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-Q4_K_S.gguf" \
  -p "User: Please explain the history of france \nAI:" \
  -n 1028 \
  --spec-type mtp \
  --draft-max 3 \
  -t 2 \
  -ub 64 \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --presence-penalty 1.5 \
  --repeat-penalty 1.0

Results

  • Prompt eval: 22.49 t/s
  • Inference: 10.33 t/s (over 1028 tokens)
  • Thermals: ~90°C, no wattage cap needed with ik_llama (previously required 17.5W cap on llama.cpp)

Why Qwen 3.5 MoE is Fast

The Qwen 3.5 35B MoE architecture activates only ~3B parameters per token, unlike dense models. For comparison, Gemma 4 26b (4B active) yielded only ~3 t/s under similar settings — suggesting the MoE routing and sparse compute in Qwen 3.5 are particularly CPU-friendly.

Potential Further Gains

  • Custom BIOS for XMP memory timings → +10% t/s
  • Thermal repaste with high-end compound
  • Upgrade from DDR4 to DDR5 laptop RAM (combined with repaste → +20% t/s)

Who it's for: Developers running local LLMs on budget hardware who want to squeeze maximum performance from Qwen MoE models using CPU-only inference.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also