Custom llama.cpp Backend Offloads LLM Matrix Multiplication to AMD XDNA2 NPU on Ryzen AI MAX 385

✍️ OpenClawRadar📅 Published: March 26, 2026🔗 Source
Custom llama.cpp Backend Offloads LLM Matrix Multiplication to AMD XDNA2 NPU on Ryzen AI MAX 385
Ad

Custom Backend for AMD XDNA2 NPU Offload

A developer has created a custom llama.cpp backend that dispatches GEMM operations directly to the AMD XDNA2 NPU on Ryzen AI MAX 385 (Strix Halo). This approach avoids iGPU usage and shared memory contention.

Hardware and Software Configuration

Model: Meta-Llama-3.1-8B-Instruct Q4_K_M

Hardware: Ryzen AI MAX 385, CachyOS 6.19, amdxdna driver, XRT 2.21.75

Performance Results

  • Vulkan prefill + NPU decode: 930 t/s prefill (pp512), 43.7 t/s decode (tg64), 41.5W avg power, 0.947 J/tok
  • Vulkan only: 833 t/s prefill, 41.6 t/s decode, 52.2W avg power, 1.3 J/tok
  • CPU only: 4.6 t/s prefill, 3.76 t/s decode

The NPU decode path saves approximately 10W versus Vulkan-only while matching (and slightly beating) decode throughput, as the iGPU remains free for other work.

Technical Stack

  • Kernels: mlir-aie xclbins (Xilinx/mlir-aie, Apache 2.0)
  • Runtime dispatch: XRT 2.21.75
  • Base: Fork of ggml-org/llama.cpp (MIT)
  • Kernel routing: 4 xclbin slots covering different K-dimension tiles with MIN_N/MAX_N routing to select the appropriate kernel at runtime
Ad

Performance Ceiling Investigation

The developer attempted to push beyond 43.7 t/s decode with several approaches:

  • Batch sweep N=1..64: No improvement (flat performance)
  • Int4 double-quant: Killed SNR (44.8 → 19.7 dB) - dead end
  • Cascade offload: Ruled out by AMD documentation
  • Speculative decoding with Llama-3.2-1B draft: 44% accept rate, 212 t/s draft, but zero effective gain

The lack of improvement from speculative decoding (which normally provides gains with a 44% accept rate) indicates the bottleneck is LPDDR5 bandwidth, not compute. The NPU is already hitting the memory wall, making 43.7 t/s the ceiling for this model on this hardware.

Project Links

  • GitHub: https://github.com/BrandedTamarasu-glitch/OllamaAMDNPU
  • Changelog: https://brandedtamarasu-glitch.github.io/OllamaAMDNPU/xdna-npu/

The project was built with Claude Sonnet 4.6 / Claude Code, disclosed for reproducibility purposes. The developer is seeking feedback from others running Strix Halo or Phoenix with the amdxdna driver to compare decode throughput on comparable quants and determine if other XDNA2 configurations encounter the same performance ceiling.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also