Qwen 3.5 35B at 10.33 t/s on a $300 Laptop

A Reddit user pushed Qwen 3.5 35B inference to 10.33 t/s on a $300 Lenovo Ideapad Slim 3i (12th Gen i3-1215U, 8GB soldered + 32GB DDR4 expansion). The setup uses a Q4_K_S quantized MoE model with only ~3B active parameters and ik_llama.cpp build 4509.

Hardware & Model

Laptop: Lenovo Ideapad Slim 3i 2023 (~$300)
CPU: Intel i3-1215U (6 cores, 2 performance cores used)
RAM: 8GB soldered + 32GB DDR4 SO-DIMM (Flex mode)
OS: Linux Mint
Model: Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-Q4_K_S.gguf (35B MoE, 3B active params per token)
Backend: ik_llama.cpp commit 40aae0b6, compiled with GCC 13.3.0

Optimizations Applied

BIOS: Battery → Extreme performance mode; fan set to quiet (off)
OS power profile: performance
Core pinning: threads pinned to performance cores 0 and 2 via taskset -c 0,2
Quantization: Q4_K_S
Batch size: 64 (-ub 64)
Speculative decoding: MTP type, draft max 3
Flash attention, fmoe, rtr — all default-enabled
Fresh restart before benchmark

Command Used

taskset -c 0,2 ./build/bin/llama-cli \
  -m "/home/default/LLM Models/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-Q4_K_S.gguf" \
  -p "User: Please explain the history of france \nAI:" \
  -n 1028 \
  --spec-type mtp \
  --draft-max 3 \
  -t 2 \
  -ub 64 \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --presence-penalty 1.5 \
  --repeat-penalty 1.0

Results

Prompt eval: 22.49 t/s
Inference: 10.33 t/s (over 1028 tokens)
Thermals: ~90°C, no wattage cap needed with ik_llama (previously required 17.5W cap on llama.cpp)

Why Qwen 3.5 MoE is Fast

The Qwen 3.5 35B MoE architecture activates only ~3B parameters per token, unlike dense models. For comparison, Gemma 4 26b (4B active) yielded only ~3 t/s under similar settings — suggesting the MoE routing and sparse compute in Qwen 3.5 are particularly CPU-friendly.

Potential Further Gains

Custom BIOS for XMP memory timings → +10% t/s
Thermal repaste with high-end compound
Upgrade from DDR4 to DDR5 laptop RAM (combined with repaste → +20% t/s)

Who it's for: Developers running local LLMs on budget hardware who want to squeeze maximum performance from Qwen MoE models using CPU-only inference.

📖 Read the full source: r/LocalLLaMA