Qwen 3.6 27B Quantization Benchmark: Q4_K_M Beats Q8_0 on Practical Tradeoffs

✍️ OpenClawRadar📅 Published: April 28, 2026🔗 Source
Qwen 3.6 27B Quantization Benchmark: Q4_K_M Beats Q8_0 on Practical Tradeoffs
Ad

A Reddit user benchmarked Qwen 3.6 27B in three GGUF quantization variants (BF16, Q4_K_M, Q8_0) using llama-cpp-python via the Neo AI Engineer framework. The evaluation covered 664 total samples across three tasks: HumanEval (code generation, 164 samples), HellaSwag (commonsense reasoning, 100 samples), and BFCL (function calling, 400 samples).

Benchmark Results

  • BF16 (model size 53.8 GB, peak RAM 54 GB, throughput 15.5 tok/s): HumanEval 56.10% (92/164), HellaSwag 90.00% (90/100), BFCL 63.25% (253/400). Average accuracy: 69.78%.
  • Q4_K_M (16.8 GB, 28 GB RAM, 22.5 tok/s): HumanEval 50.61% (83/164), HellaSwag 86.00% (86/100), BFCL 63.00% (252/400). Average: 66.54%.
  • Q8_0 (28.6 GB, 42 GB RAM, 18.0 tok/s): HumanEval 52.44% (86/164), HellaSwag 83.00% (83/100), BFCL 63.00% (252/400). Average: 66.15%.
Ad

Key Takeaways

Q4_K_M is the standout practical variant. It preserves BFCL accuracy (63.00% vs 63.25%), drops only ~5.5 points on HumanEval, and is ~4 points behind BF16 on HellaSwag. The tradeoffs: 1.45x faster than BF16, 48% less peak RAM, 68.8% smaller file, and nearly identical function calling performance. Q8_0 was underwhelming: it improved HumanEval by only ~1.8 points over Q4_K_M but used 42 GB RAM vs 28 GB, was slower, and scored lower on HellaSwag.

For local/CPU deployment, Q4_K_M is recommended unless the workload is heavily code-generation focused. For maximum quality, BF16 still wins.

Evaluation Setup

GGUF variants via llama-cpp-python with n_ctx: 32768, checkpointed evaluation. The Neo AI Engineer framework built the GGUF eval pipeline, handled checkpointed runs, and consolidated results. Complete case study with code snippets is linked in the original Reddit comments.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also