Qwen 3.6 27B Quantization Benchmark: Q4_K_M Beats Q8_0 on Practical Tradeoffs

A Reddit user benchmarked Qwen 3.6 27B in three GGUF quantization variants (BF16, Q4_K_M, Q8_0) using llama-cpp-python via the Neo AI Engineer framework. The evaluation covered 664 total samples across three tasks: HumanEval (code generation, 164 samples), HellaSwag (commonsense reasoning, 100 samples), and BFCL (function calling, 400 samples).
Benchmark Results
- BF16 (model size 53.8 GB, peak RAM 54 GB, throughput 15.5 tok/s): HumanEval 56.10% (92/164), HellaSwag 90.00% (90/100), BFCL 63.25% (253/400). Average accuracy: 69.78%.
- Q4_K_M (16.8 GB, 28 GB RAM, 22.5 tok/s): HumanEval 50.61% (83/164), HellaSwag 86.00% (86/100), BFCL 63.00% (252/400). Average: 66.54%.
- Q8_0 (28.6 GB, 42 GB RAM, 18.0 tok/s): HumanEval 52.44% (86/164), HellaSwag 83.00% (83/100), BFCL 63.00% (252/400). Average: 66.15%.
Key Takeaways
Q4_K_M is the standout practical variant. It preserves BFCL accuracy (63.00% vs 63.25%), drops only ~5.5 points on HumanEval, and is ~4 points behind BF16 on HellaSwag. The tradeoffs: 1.45x faster than BF16, 48% less peak RAM, 68.8% smaller file, and nearly identical function calling performance. Q8_0 was underwhelming: it improved HumanEval by only ~1.8 points over Q4_K_M but used 42 GB RAM vs 28 GB, was slower, and scored lower on HellaSwag.
For local/CPU deployment, Q4_K_M is recommended unless the workload is heavily code-generation focused. For maximum quality, BF16 still wins.
Evaluation Setup
GGUF variants via llama-cpp-python with n_ctx: 32768, checkpointed evaluation. The Neo AI Engineer framework built the GGUF eval pipeline, handled checkpointed runs, and consolidated results. Complete case study with code snippets is linked in the original Reddit comments.
📖 Read the full source: r/LocalLLaMA
👀 See Also

CAP: Claude Code Statusline Plugin That Installs with /plugin install
CAP (Claude Allowance Pulse) is a statusline plugin for Claude Code that installs via /plugin install without npm, curl, or jq. It displays model usage, session and weekly limits, context window usage, and session cost in the terminal.

Open-Source Claude IDE Bridge Connects Dispatch, Desktop App, and Claude Code
The claude-ide-bridge is an MIT-licensed open-source tool that connects Claude Code to your IDE, providing access to LSP, debugger, terminals, git, and GitHub through 124 tools. It enables a workflow where tasks sent via Dispatch from a phone are handled by the Claude desktop app, which uses Claude Code to write code and run tests while interacting with the IDE.

MCP Context Bloat: Real Costs and a Practical Fix for Claude Code Users
Running 9 MCP servers in Claude Code leads to 38k token cold starts, ~$700/month in tool definition overhead, and degraded model performance. A gateway pattern with BM25 ranking cuts context to 4k.

context-link v1.0.0: Local MCP server reduces Claude Code token usage by 91%
context-link v1.0.0 is a local MCP server that indexes codebases with Tree-sitter to serve Claude only the exact symbols, dependencies and structure needed, reducing token usage by 91% in specific cases and 70-80% across full tasks.