llama.cpp Q8_0 quantization gets 3.1x speedup on Intel Arc GPUs with SYCL reorder fix

✍️ OpenClawRadar📅 Published: April 16, 2026🔗 Source
llama.cpp Q8_0 quantization gets 3.1x speedup on Intel Arc GPUs with SYCL reorder fix
Ad

A performance optimization fix for llama.cpp's SYCL backend delivers significant speed improvements for Q8_0 quantized models running on Intel Arc GPUs. The fix addresses a memory access pattern issue that was limiting Q8_0 performance to only 21% of theoretical bandwidth.

Performance Problem and Root Cause

On an Intel Arc Pro B70 GPU with 32GB GDDR6 and 608 GB/s bandwidth, Q8_0 models were running at only 4.88 tokens/second while Q4_K_M achieved 20.56 tokens/second. This 4x performance gap was unexpected since Q8_0 only has 1.7x more data than Q4_K_M.

After ruling out VRAM pressure, driver issues, and backend problems, the investigation traced the bottleneck to llama.cpp's SYCL kernel dispatch path. The SYCL backend includes a "reorder" optimization that separates quantization scale factors from weight data for coalesced GPU memory access. This optimization was implemented for Q4_0, Q4_K, and Q6_K quantizations, but Q8_0 was never added to the reorder framework.

Q8_0's 34-byte blocks (which are not power-of-2) made the non-reordered layout particularly inefficient for GPU cache performance.

Ad

The Fix and Results

The solution involved approximately 200 lines of code extending the existing reorder framework to support Q8_0. The most critical bug was a single line issue: Q8_0 tensors weren't getting the "extra" struct allocated during buffer initialization, causing the reorder flag to never be set.

Results on Qwen3.5-27B (Intel Arc Pro B70):

  • Q8_0 before: 4.88 t/s (21% bandwidth)
  • Q8_0 after: 15.24 t/s (66% bandwidth) - 3.1x faster
  • Q4_K_M: 20.12 t/s (unchanged)
  • Q6_K: 13.83 t/s (no reorder)

With this fix, Q8_0 now outperforms Q6_K (15.24 vs 13.83 tokens/second) while providing higher quality than lower-bit quantizations.

Validation and Implementation

Before implementing the fix, the team binary-patched Intel's closed-source IPEX-LLM to run on the B70 GPU (which isn't officially supported by its PCI device ID). Their optimized Q8_0 kernels achieved 61% bandwidth, confirming the problem was solvable. The open-source implementation in llama.cpp achieves 66% bandwidth.

The fix has been submitted as a pull request to the llama.cpp repository.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also