llama.cpp Offloads Matrix Multiplication to AMD XDNA2 NPU on Ryzen AI MAX 385

Custom Backend for AMD XDNA2 NPU Offload

A developer has created a custom llama.cpp backend that dispatches GEMM operations directly to the AMD XDNA2 NPU on Ryzen AI MAX 385 (Strix Halo). This approach avoids iGPU usage and shared memory contention.

Hardware and Software Configuration

Model: Meta-Llama-3.1-8B-Instruct Q4_K_M

Hardware: Ryzen AI MAX 385, CachyOS 6.19, amdxdna driver, XRT 2.21.75

Performance Results

Vulkan prefill + NPU decode: 930 t/s prefill (pp512), 43.7 t/s decode (tg64), 41.5W avg power, 0.947 J/tok
Vulkan only: 833 t/s prefill, 41.6 t/s decode, 52.2W avg power, 1.3 J/tok
CPU only: 4.6 t/s prefill, 3.76 t/s decode

The NPU decode path saves approximately 10W versus Vulkan-only while matching (and slightly beating) decode throughput, as the iGPU remains free for other work.

Technical Stack

Kernels: mlir-aie xclbins (Xilinx/mlir-aie, Apache 2.0)
Runtime dispatch: XRT 2.21.75
Base: Fork of ggml-org/llama.cpp (MIT)
Kernel routing: 4 xclbin slots covering different K-dimension tiles with MIN_N/MAX_N routing to select the appropriate kernel at runtime

Performance Ceiling Investigation

The developer attempted to push beyond 43.7 t/s decode with several approaches:

Batch sweep N=1..64: No improvement (flat performance)
Int4 double-quant: Killed SNR (44.8 → 19.7 dB) - dead end
Cascade offload: Ruled out by AMD documentation
Speculative decoding with Llama-3.2-1B draft: 44% accept rate, 212 t/s draft, but zero effective gain

The lack of improvement from speculative decoding (which normally provides gains with a 44% accept rate) indicates the bottleneck is LPDDR5 bandwidth, not compute. The NPU is already hitting the memory wall, making 43.7 t/s the ceiling for this model on this hardware.

Project Links

GitHub: https://github.com/BrandedTamarasu-glitch/OllamaAMDNPU
Changelog: https://brandedtamarasu-glitch.github.io/OllamaAMDNPU/xdna-npu/

The project was built with Claude Sonnet 4.6 / Claude Code, disclosed for reproducibility purposes. The developer is seeking feedback from others running Strix Halo or Phoenix with the amdxdna driver to compare decode throughput on comparable quants and determine if other XDNA2 configurations encounter the same performance ceiling.

📖 Read the full source: r/LocalLLaMA