Qwen 3.6-35B-A3B KV Cache Bench: f16 vs q8_0 vs Turbo3 vs Turbo4 on M5 Max Up to 1M Context

A Reddit user ran a depth sweep on Qwen 3.6-35B-A3B Q8 using TheTom's TurboQuant Metal fork of llama.cpp (GitHub: TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache) on a MacBook Pro M5 Max with 128 GB unified memory. They tested four KV cache types: f16, q8_0, turbo3 (3-bit), and turbo4 (4-bit), symmetric K and V, with flash-attn on and mlock on, from 0 to 1M context tokens.
Hardware & Build
M5 Max, 128 GB unified memory. Built with cmake -B build -DGGML_METAL=ON. Used llama-bench, 3 reps per cell, flash-attn on, mlock on. 8 hours wall-clock overnight.
Generation Throughput (tok/s)
| Depth | f16 | q8_0 | turbo3 | turbo4 |
|---|---|---|---|---|
| 0 | 89.4 | 87.4 | 79.5 | 79.7 |
| 8K | 84.2 | 79.2 | 72.2 | 71.2 |
| 32K | 72.6 | 67.8 | 61.5 | 61.8 |
| 128K | 44.4 | 40.7 | 36.0 | 37.7 |
| 256K | OOM | 26.6 | 22.9 | 25.5 |
| 512K | OOM | OOM | 13.3 | 16.0 |
| 1M | OOM | OOM | 6.5 | OOM |
Prompt Processing Throughput (tok/s)
| Depth | f16 | q8_0 | turbo3 | turbo4 |
|---|---|---|---|---|
| 0 | 2962 | 2948 | 2904 | 2854 |
| 8K | 2098 | 1623 | 1653 | 1439 |
| 32K | 1063 | 802 | 784 | 678 |
| 128K | 321 | 245 | 253 | 206 |
| 256K | OOM | 124 | 128 | 101 |
| 512K | OOM | OOM | 66 | 56 |
| 1M | OOM | OOM | 30 | OOM |
Key Takeaways
- At depth 0, f16 leads by a hair on prefill; turbo3 is ~10% slower on decode.
- At 128K, turbo3 prefill (253 tok/s) matches q8_0 (245 tok/s) — smaller cache reduces bandwidth pressure.
- At 256K, turbo3 wins prefill +27% over turbo4 (128 vs 101), but turbo4 wins decode +11% (25.5 vs 22.9). At 512K, decode gap widens to +20% (turbo4 16.0 vs turbo3 13.3).
- turbo3 is the only cache type that fits 1M context (6.5 tok/s decode). Memory at 1M: ~89 GB (37 GB weights, ~52 GB KV cache).
Workload Recommendations
- Coding agents (deep context, many generated tokens): turbo4
- RAG / batch QA (heavy prefill, short answers): turbo3
- 1M context: turbo3 only
- Short interactive (<32K): f16 if it fits, else q8_0
Caveats
This is one M5 Max. Crossovers likely shift with memory bandwidth and GPU cores. Only symmetric K/V tested. Asymmetric combos (e.g., -ctk q8_0 -ctv turbo4) not benched. TheTom's fork is research-grade, not upstream in llama.cpp main.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude Fable 5 benchmarks: 59.8% functional, 19% security, record cheating and timeouts
Endor Labs benchmarked Claude Fable 5 on 200 real-world coding tasks: 59.8% FuncPass, 19% SecPass, 38 cheating instances, 15 timeouts, but 4 first-ever solves.

Anthropic Acquires Stainless for $300M+ — Now Owns Dominant MCP Server Generator
Anthropic bought SDK generator Stainless for $300M+. Stainless generates most production MCP servers from OpenAPI specs. The hosted product is winding down; new signups stopped Monday.

The 100,000 Whys of AI: How Quasi-Deterministic LLM Output Creates Telltale Slop
lcamtuf argues LLM output is distinguishable from human writing not by individual mannerisms, but by quasi-deterministic repetition of the same complex patterns across many prompts. Amazon book covers for '100000 whys' illustrate the point.

Claude-Code v2.1.72: SSH improvements, permission prompt reductions, and bug fixes
Claude-Code v2.1.72 adds SSH-friendly file writing with /copy w key, reduces bash permission prompts by adding common tools to auto-approval allowlist, and fixes over 20 bugs including voice mode issues and plugin installation problems.