Qwen KV Cache Quantization Deep Dive: PPL, KL Divergence, and Asymmetric K/V Results

✍️ OpenClawRadar📅 Published: April 29, 2026🔗 Source
Qwen KV Cache Quantization Deep Dive: PPL, KL Divergence, and Asymmetric K/V Results
Ad

Follow-up benchmarks for Qwen 3.6-35B-A3B Q8 with KV cache quantization using TheTom TurboQuant fork (feature/turboquant-kv-cache) on an M5 Max. This round covers perplexity, KL divergence, asymmetric K/V combinations, and a 64K depth data point.

Quality Results (Perplexity + KL Divergence)

Context size 4096 on wikitext-2. f16 used as baseline for logits.

  • q8_0: PPL 5.7433, KL 0.0016, top-1 token agreement 98.64% — essentially free at 4K context (PPL delta -0.0005 within ±0.036 stderr).
  • turbo3 (~4.9x): PPL 5.8092, KL 0.0199, top-1 agreement 93.93% — ~1% PPL increase, 5pp token disagreement.
  • turbo4 (~3.8x): PPL 5.7810, KL 0.0131, top-1 agreement 95.28% — sits between q8_0 and turbo3, consistent with compression ratio.

Quality cost scales with compression, no surprises.

Asymmetric K/V Sweep

Decode tok/s with llama-bench, same flags as symmetric sweep. Key configs:

  • -ctk q8_0 -ctv turbo4 is standout: at 256K matches symmetric q8_0 throughput (27.1 vs 26.6 tg), fits 512K where symmetric q8_0 OOM'd. Gives q8_0-grade prefill with turbo4-grade context ceiling.
  • -ctk q8_0 -ctv turbo3: similar trick but worse decode (tighter V quant taxes generation).
  • -ctk f16 -ctv turbo4: broken on Metal — FlashAttention kernel doesn't fast-path this combo, falls back to generic dequant-attention. At 8K it's 34x slower than symmetric f16; at 128K it's 78x slower (4.1 t/s pp). Do not use.

Sample decode tok/s at depth 128K: q8_0 K/turbo4 V 41.0, q8_0 K/turbo3 V 38.2, f16 K/turbo4 V 2.8.

Ad

64K Depth Row

All seven configs at depth 65536 (pp512 / tg128 tok/s):

  • f16 symmetric: 602.0 / 59.8
  • q8_0 symmetric: 479.2 / 57.9
  • turbo3 symmetric: 469.8 / 49.9
  • turbo4 symmetric: 418.0 / 55.2
  • q8_0 K / turbo4 V: 468.2 / 55.9
  • q8_0 K / turbo3 V: 465.6 / 52.6
  • f16 K / turbo4 V: 8.3 / 4.9

Prefill curves nearly converged at 64K: turbo3 (470) within 2% of q8_0 (479). Bandwidth-bound regime kicks in between 64K and 128K.

Updated Recommendation

For coding agents (deep context, many generated tokens): use -ctk q8_0 -ctv turbo4. q8_0 quality on K, turbo4 savings on V, fits 512K. For RAG or batch QA (heavy prefill, smaller decode), symmetric q8_0 or turbo4 remains viable.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also