10.33 t/s on Qwen 3.5 35B with a $300 Laptop: Full Optimization Breakdown

A Reddit user pushed Qwen 3.5 35B inference to 10.33 t/s on a $300 Lenovo Ideapad Slim 3i (12th Gen i3-1215U, 8GB soldered + 32GB DDR4 expansion). The setup uses a Q4_K_S quantized MoE model with only ~3B active parameters and ik_llama.cpp build 4509.
Hardware & Model
- Laptop: Lenovo Ideapad Slim 3i 2023 (~$300)
- CPU: Intel i3-1215U (6 cores, 2 performance cores used)
- RAM: 8GB soldered + 32GB DDR4 SO-DIMM (Flex mode)
- OS: Linux Mint
- Model:
Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-Q4_K_S.gguf(35B MoE, 3B active params per token) - Backend: ik_llama.cpp commit 40aae0b6, compiled with GCC 13.3.0
Optimizations Applied
- BIOS: Battery → Extreme performance mode; fan set to quiet (off)
- OS power profile: performance
- Core pinning: threads pinned to performance cores 0 and 2 via
taskset -c 0,2 - Quantization: Q4_K_S
- Batch size: 64 (
-ub 64) - Speculative decoding: MTP type, draft max 3
- Flash attention, fmoe, rtr — all default-enabled
- Fresh restart before benchmark
Command Used
taskset -c 0,2 ./build/bin/llama-cli \
-m "/home/default/LLM Models/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-Q4_K_S.gguf" \
-p "User: Please explain the history of france \nAI:" \
-n 1028 \
--spec-type mtp \
--draft-max 3 \
-t 2 \
-ub 64 \
--temp 1.0 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence-penalty 1.5 \
--repeat-penalty 1.0
Results
- Prompt eval: 22.49 t/s
- Inference: 10.33 t/s (over 1028 tokens)
- Thermals: ~90°C, no wattage cap needed with ik_llama (previously required 17.5W cap on llama.cpp)
Why Qwen 3.5 MoE is Fast
The Qwen 3.5 35B MoE architecture activates only ~3B parameters per token, unlike dense models. For comparison, Gemma 4 26b (4B active) yielded only ~3 t/s under similar settings — suggesting the MoE routing and sparse compute in Qwen 3.5 are particularly CPU-friendly.
Potential Further Gains
- Custom BIOS for XMP memory timings → +10% t/s
- Thermal repaste with high-end compound
- Upgrade from DDR4 to DDR5 laptop RAM (combined with repaste → +20% t/s)
Who it's for: Developers running local LLMs on budget hardware who want to squeeze maximum performance from Qwen MoE models using CPU-only inference.
📖 Read the full source: r/LocalLLaMA
👀 See Also

aco-system: An Entire Company OS for Claude That Writes User Stories, Breaks Tasks, Reviews PRs
A Reddit user shared how aco-system turned a single GitHub issue into a fully validated PR with tests — driven entirely by Claude. Includes user story generation, task breakdown, secret checking, and PR review.

Simplifying Automation with OpenClaw Wrappers
OpenClaw Wrappers offer an efficient way to manage AI coding agents. Discover how these tools integrate easily into existing frameworks with specific command examples and community feedback.

Running Qwen3.6-35B-A3B-UD-Q5_K_XL Locally with VS Code Copilot on AMD R9700
A user shares their working llama.cpp setup for Qwen3.6-35B-A3B-UD-Q5_K_XL on a single AMD R9700 with Vulkan, achieving full website and Playwright test generation from scratch with minimal nudging.

Shield: Open-Source Security Plugin for Claude Code with Unified Scanning
Shield is an open-source security plugin for Claude Code that orchestrates multiple security tools from a single /shield:shield command, auto-detects your stack, runs installed tools, and generates unified reports with risk scores and code fix suggestions.