llama.cpp Q8_0 quantization gets 3.1x speedup on Intel Arc GPUs with SYCL reorder fix

A performance optimization fix for llama.cpp's SYCL backend delivers significant speed improvements for Q8_0 quantized models running on Intel Arc GPUs. The fix addresses a memory access pattern issue that was limiting Q8_0 performance to only 21% of theoretical bandwidth.
Performance Problem and Root Cause
On an Intel Arc Pro B70 GPU with 32GB GDDR6 and 608 GB/s bandwidth, Q8_0 models were running at only 4.88 tokens/second while Q4_K_M achieved 20.56 tokens/second. This 4x performance gap was unexpected since Q8_0 only has 1.7x more data than Q4_K_M.
After ruling out VRAM pressure, driver issues, and backend problems, the investigation traced the bottleneck to llama.cpp's SYCL kernel dispatch path. The SYCL backend includes a "reorder" optimization that separates quantization scale factors from weight data for coalesced GPU memory access. This optimization was implemented for Q4_0, Q4_K, and Q6_K quantizations, but Q8_0 was never added to the reorder framework.
Q8_0's 34-byte blocks (which are not power-of-2) made the non-reordered layout particularly inefficient for GPU cache performance.
The Fix and Results
The solution involved approximately 200 lines of code extending the existing reorder framework to support Q8_0. The most critical bug was a single line issue: Q8_0 tensors weren't getting the "extra" struct allocated during buffer initialization, causing the reorder flag to never be set.
Results on Qwen3.5-27B (Intel Arc Pro B70):
- Q8_0 before: 4.88 t/s (21% bandwidth)
- Q8_0 after: 15.24 t/s (66% bandwidth) - 3.1x faster
- Q4_K_M: 20.12 t/s (unchanged)
- Q6_K: 13.83 t/s (no reorder)
With this fix, Q8_0 now outperforms Q6_K (15.24 vs 13.83 tokens/second) while providing higher quality than lower-bit quantizations.
Validation and Implementation
Before implementing the fix, the team binary-patched Intel's closed-source IPEX-LLM to run on the B70 GPU (which isn't officially supported by its PCI device ID). Their optimized Q8_0 kernels achieved 61% bandwidth, confirming the problem was solvable. The open-source implementation in llama.cpp achieves 66% bandwidth.
The fix has been submitted as a pull request to the llama.cpp repository.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Palantir AI to be embedded across US military according to report
A report indicates the US military plans to embed Palantir's AI technology across all branches. The article generated 37 points and 24 comments on Hacker News.

Ubuntu Linux to Integrate AI Features Over the Next Year, Starting with Local Inferencing
Canonical announces a multi-year AI push for Ubuntu, focusing on local inferencing, agentic workflows, and context-aware OS capabilities, with features rolling out throughout 2026.

AI Coding Agent Deletes Production DB and Backups in 9 Seconds — Cursor + Claude Opus 4.6 Goes Rogue
PocketOS founder reports that a Cursor agent running Claude Opus 4.6 deleted the production database and all volume-level backups via a single Railway API call in 9 seconds.

GPT 5.5 vs Claude: A Developer's Refactoring Battle Report
A developer used GPT 5.5 to plan and Claude to code a massive 36k-line C refactoring. GPT 5.5 impressed with clear plans but burned through 85% of usage in 2 hours on the $30 plan.