Qwen 3.6-35B-A3B KV Cache Bench: Turbo3 Hits 1M at 6.5 tok/s

A Reddit user ran a depth sweep on Qwen 3.6-35B-A3B Q8 using TheTom's TurboQuant Metal fork of llama.cpp (GitHub: TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache) on a MacBook Pro M5 Max with 128 GB unified memory. They tested four KV cache types: f16, q8_0, turbo3 (3-bit), and turbo4 (4-bit), symmetric K and V, with flash-attn on and mlock on, from 0 to 1M context tokens.

Hardware & Build

M5 Max, 128 GB unified memory. Built with cmake -B build -DGGML_METAL=ON. Used llama-bench, 3 reps per cell, flash-attn on, mlock on. 8 hours wall-clock overnight.

Generation Throughput (tok/s)

Depth	f16	q8_0	turbo3	turbo4
0	89.4	87.4	79.5	79.7
8K	84.2	79.2	72.2	71.2
32K	72.6	67.8	61.5	61.8
128K	44.4	40.7	36.0	37.7
256K	OOM	26.6	22.9	25.5
512K	OOM	OOM	13.3	16.0
1M	OOM	OOM	6.5	OOM

Prompt Processing Throughput (tok/s)

Depth	f16	q8_0	turbo3	turbo4
0	2962	2948	2904	2854
8K	2098	1623	1653	1439
32K	1063	802	784	678
128K	321	245	253	206
256K	OOM	124	128	101
512K	OOM	OOM	66	56
1M	OOM	OOM	30	OOM

Key Takeaways

At depth 0, f16 leads by a hair on prefill; turbo3 is ~10% slower on decode.
At 128K, turbo3 prefill (253 tok/s) matches q8_0 (245 tok/s) — smaller cache reduces bandwidth pressure.
At 256K, turbo3 wins prefill +27% over turbo4 (128 vs 101), but turbo4 wins decode +11% (25.5 vs 22.9). At 512K, decode gap widens to +20% (turbo4 16.0 vs turbo3 13.3).
turbo3 is the only cache type that fits 1M context (6.5 tok/s decode). Memory at 1M: ~89 GB (37 GB weights, ~52 GB KV cache).

Workload Recommendations

Coding agents (deep context, many generated tokens): turbo4
RAG / batch QA (heavy prefill, short answers): turbo3
1M context: turbo3 only
Short interactive (<32K): f16 if it fits, else q8_0

Caveats

This is one M5 Max. Crossovers likely shift with memory bandwidth and GPU cores. Only symmetric K/V tested. Asymmetric combos (e.g., -ctk q8_0 -ctv turbo4) not benched. TheTom's fork is research-grade, not upstream in llama.cpp main.

📖 Read the full source: r/LocalLLaMA

Qwen 3.6-35B-A3B KV Cache Bench: f16 vs q8_0 vs Turbo3 vs Turbo4 on M5 Max Up to 1M Context

Hardware & Build

Generation Throughput (tok/s)

Prompt Processing Throughput (tok/s)

Key Takeaways

Workload Recommendations

Caveats

👀 See Also

Claude Fable 5 benchmarks: 59.8% functional, 19% security, record cheating and timeouts

Anthropic Acquires Stainless for $300M+ — Now Owns Dominant MCP Server Generator

The 100,000 Whys of AI: How Quasi-Deterministic LLM Output Creates Telltale Slop

Claude-Code v2.1.72: SSH improvements, permission prompt reductions, and bug fixes