llama.cpp Q8_0 Gets 3.1x Speedup on Intel Arc with SYCL Fix

A performance optimization fix for llama.cpp's SYCL backend delivers significant speed improvements for Q8_0 quantized models running on Intel Arc GPUs. The fix addresses a memory access pattern issue that was limiting Q8_0 performance to only 21% of theoretical bandwidth.

Performance Problem and Root Cause

On an Intel Arc Pro B70 GPU with 32GB GDDR6 and 608 GB/s bandwidth, Q8_0 models were running at only 4.88 tokens/second while Q4_K_M achieved 20.56 tokens/second. This 4x performance gap was unexpected since Q8_0 only has 1.7x more data than Q4_K_M.

After ruling out VRAM pressure, driver issues, and backend problems, the investigation traced the bottleneck to llama.cpp's SYCL kernel dispatch path. The SYCL backend includes a "reorder" optimization that separates quantization scale factors from weight data for coalesced GPU memory access. This optimization was implemented for Q4_0, Q4_K, and Q6_K quantizations, but Q8_0 was never added to the reorder framework.

Q8_0's 34-byte blocks (which are not power-of-2) made the non-reordered layout particularly inefficient for GPU cache performance.

The Fix and Results

The solution involved approximately 200 lines of code extending the existing reorder framework to support Q8_0. The most critical bug was a single line issue: Q8_0 tensors weren't getting the "extra" struct allocated during buffer initialization, causing the reorder flag to never be set.

Results on Qwen3.5-27B (Intel Arc Pro B70):

Q8_0 before: 4.88 t/s (21% bandwidth)
Q8_0 after: 15.24 t/s (66% bandwidth) - 3.1x faster
Q4_K_M: 20.12 t/s (unchanged)
Q6_K: 13.83 t/s (no reorder)

With this fix, Q8_0 now outperforms Q6_K (15.24 vs 13.83 tokens/second) while providing higher quality than lower-bit quantizations.

Validation and Implementation

Before implementing the fix, the team binary-patched Intel's closed-source IPEX-LLM to run on the B70 GPU (which isn't officially supported by its PCI device ID). Their optimized Q8_0 kernels achieved 61% bandwidth, confirming the problem was solvable. The open-source implementation in llama.cpp achieves 66% bandwidth.

The fix has been submitted as a pull request to the llama.cpp repository.

📖 Read the full source: r/LocalLLaMA

llama.cpp Q8_0 quantization gets 3.1x speedup on Intel Arc GPUs with SYCL reorder fix

Performance Problem and Root Cause

The Fix and Results

Validation and Implementation

👀 See Also

Synthetic announces major pricing restructuring with significant rate limit changes

Claude Opus 4.7 adds high-resolution image support, task budgets, and removes extended thinking

Transformer Language Model Runs Locally on Stock Game Boy Color

ClawbBot Community Discusses Potential Interface Improvements