Qwen 3.6 27B Speed Test on M2 MacBook Pro: 7.9 to 3.1 t/s

A developer on r/LocalLLaMA tested Qwen 3.6 27B (IQ4_XS unsloth quant) on an M2 MacBook Pro with 32GB RAM. As expected, the machine is under-spec'd for a dense 27B model, but the field report provides concrete numbers and a realistic take on performance and output quality.

Command and Setup

The model was served with llama-server using the following command:

llama-server -m ~/models/unsloth/Qwen3.6-27B-IQ4_XS.gguf --mmproj ~/models/unsloth/Qwen3.6-27B-mmproj-BF16.gguf -c 131072 --batch-size 256 -ngl 99 -np 1 --host 127.0.0.1 --port 8899 -ctk q8_0 -ctv q8_0 --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 12 --draft-max 48

Notable choices: single process (-np 1) to avoid overloading the GPU, speculative decoding with ngram-mod, and a context window of 131072 tokens.

Performance Breakdown

Initial speeds: 80 t/s prompt processing, 7.9 t/s token generation. At 52,000 tokens of context, performance collapsed to 4 t/s prompt processing — which the author confirms is not a typo — and 3.1 t/s token generation. Memory pressure never entered the red zone, indicating the bottleneck is memory bandwidth, not swap.

Speculative Decoding Not Effective

The reporter enabled ngram-mod speculative decoding but saw no real benefit. Logs showed:

accept: low acceptance streak (3) – resetting ngram_mod ... draft acceptance rate = 1.00000 ( 2 accepted / 2 generated)

The model resets constantly due to low n-gram matches; the apparent 100% acceptance rate is an artifact of tiny sample sizes. The author concludes that dense models like this don't repeat themselves enough for the ngram-mod approach to work well.

Code Quality

Despite the slowness, the code generated by Qwen 3.6 27B was rated as excellent. It analyzed a significant codebase with no additional prompting beyond the initial task and outperformed the Qwen 35B A3B (MoE) model in quality. The author compares the output to what one would expect from a self-hosted Claude Sonnet, and notes that even Claude Opus 4.7 was impressed.

Key Takeaways

Memory bandwidth rules dense models: On Apple Silicon, token generation halved as context grew. Even without swapping, bandwidth throttling killed performance.
Single process is the way to go: Running concurrent agent tasks on this hardware offers no win — just serial queuing.
Speculative decoding is model-dependent: Ngram-mod didn't help here; the model's low repetitiveness prevented draft matches.

The author plans to test Qwen 3.6 27B on a cloud GPU with specs comparable to the R9700 (current price ~$1,400 on Amazon, higher on eBay) to get a true sense of its capability on their own programming tasks.

📖 Read the full source: r/LocalLLaMA