Qwen KV Cache Quantization Deep Dive: PPL, KL Divergence, and Asymmetric K/V Results

Follow-up benchmarks for Qwen 3.6-35B-A3B Q8 with KV cache quantization using TheTom TurboQuant fork (feature/turboquant-kv-cache) on an M5 Max. This round covers perplexity, KL divergence, asymmetric K/V combinations, and a 64K depth data point.
Quality Results (Perplexity + KL Divergence)
Context size 4096 on wikitext-2. f16 used as baseline for logits.
- q8_0: PPL 5.7433, KL 0.0016, top-1 token agreement 98.64% — essentially free at 4K context (PPL delta -0.0005 within ±0.036 stderr).
- turbo3 (~4.9x): PPL 5.8092, KL 0.0199, top-1 agreement 93.93% — ~1% PPL increase, 5pp token disagreement.
- turbo4 (~3.8x): PPL 5.7810, KL 0.0131, top-1 agreement 95.28% — sits between q8_0 and turbo3, consistent with compression ratio.
Quality cost scales with compression, no surprises.
Asymmetric K/V Sweep
Decode tok/s with llama-bench, same flags as symmetric sweep. Key configs:
-ctk q8_0 -ctv turbo4is standout: at 256K matches symmetric q8_0 throughput (27.1 vs 26.6 tg), fits 512K where symmetric q8_0 OOM'd. Gives q8_0-grade prefill with turbo4-grade context ceiling.-ctk q8_0 -ctv turbo3: similar trick but worse decode (tighter V quant taxes generation).-ctk f16 -ctv turbo4: broken on Metal — FlashAttention kernel doesn't fast-path this combo, falls back to generic dequant-attention. At 8K it's 34x slower than symmetric f16; at 128K it's 78x slower (4.1 t/s pp). Do not use.
Sample decode tok/s at depth 128K: q8_0 K/turbo4 V 41.0, q8_0 K/turbo3 V 38.2, f16 K/turbo4 V 2.8.
64K Depth Row
All seven configs at depth 65536 (pp512 / tg128 tok/s):
- f16 symmetric: 602.0 / 59.8
- q8_0 symmetric: 479.2 / 57.9
- turbo3 symmetric: 469.8 / 49.9
- turbo4 symmetric: 418.0 / 55.2
- q8_0 K / turbo4 V: 468.2 / 55.9
- q8_0 K / turbo3 V: 465.6 / 52.6
- f16 K / turbo4 V: 8.3 / 4.9
Prefill curves nearly converged at 64K: turbo3 (470) within 2% of q8_0 (479). Bandwidth-bound regime kicks in between 64K and 128K.
Updated Recommendation
For coding agents (deep context, many generated tokens): use -ctk q8_0 -ctv turbo4. q8_0 quality on K, turbo4 savings on V, fits 512K. For RAG or batch QA (heavy prefill, smaller decode), symmetric q8_0 or turbo4 remains viable.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Benchmarking the Latest AI Models: The Rise of Extreme Models
A detailed benchmarking of 40 new AI models reveals a split market with 'God Mode' and 'Flash Mode' leading the way. Mid-range models are now considered obsolete.

AI tools need practical integration for small businesses, not just hype
The AI community focuses on technical debates while small business owners need existing tools integrated into their workflows to handle repetitive tasks like scheduling, follow-ups, and bookkeeping.

Developer pleads guilty to $8M AI music streaming fraud scheme
Michael Smith, 54, admitted to using thousands of bot accounts and AI-generated songs to siphon $8 million in royalties from streaming platforms including Spotify, Apple Music, and YouTube Music between 2017 and 2024.

Analysis of OpenClaw's Astroturfing Campaign and $CLAWD Token Pump
A Reddit investigation reveals OpenClaw's viral growth in late January was driven by a recursive astroturfing campaign using approximately 400 bot instances, which created hype to pump the $CLAWD token to a $16M market cap before it crashed 90%.