Custom llama.cpp Backend Offloads LLM Matrix Multiplication to AMD XDNA2 NPU on Ryzen AI MAX 385

Custom Backend for AMD XDNA2 NPU Offload
A developer has created a custom llama.cpp backend that dispatches GEMM operations directly to the AMD XDNA2 NPU on Ryzen AI MAX 385 (Strix Halo). This approach avoids iGPU usage and shared memory contention.
Hardware and Software Configuration
Model: Meta-Llama-3.1-8B-Instruct Q4_K_M
Hardware: Ryzen AI MAX 385, CachyOS 6.19, amdxdna driver, XRT 2.21.75
Performance Results
- Vulkan prefill + NPU decode: 930 t/s prefill (pp512), 43.7 t/s decode (tg64), 41.5W avg power, 0.947 J/tok
- Vulkan only: 833 t/s prefill, 41.6 t/s decode, 52.2W avg power, 1.3 J/tok
- CPU only: 4.6 t/s prefill, 3.76 t/s decode
The NPU decode path saves approximately 10W versus Vulkan-only while matching (and slightly beating) decode throughput, as the iGPU remains free for other work.
Technical Stack
- Kernels: mlir-aie xclbins (Xilinx/mlir-aie, Apache 2.0)
- Runtime dispatch: XRT 2.21.75
- Base: Fork of ggml-org/llama.cpp (MIT)
- Kernel routing: 4 xclbin slots covering different K-dimension tiles with MIN_N/MAX_N routing to select the appropriate kernel at runtime
Performance Ceiling Investigation
The developer attempted to push beyond 43.7 t/s decode with several approaches:
- Batch sweep N=1..64: No improvement (flat performance)
- Int4 double-quant: Killed SNR (44.8 → 19.7 dB) - dead end
- Cascade offload: Ruled out by AMD documentation
- Speculative decoding with Llama-3.2-1B draft: 44% accept rate, 212 t/s draft, but zero effective gain
The lack of improvement from speculative decoding (which normally provides gains with a 44% accept rate) indicates the bottleneck is LPDDR5 bandwidth, not compute. The NPU is already hitting the memory wall, making 43.7 t/s the ceiling for this model on this hardware.
Project Links
- GitHub: https://github.com/BrandedTamarasu-glitch/OllamaAMDNPU
- Changelog: https://brandedtamarasu-glitch.github.io/OllamaAMDNPU/xdna-npu/
The project was built with Claude Sonnet 4.6 / Claude Code, disclosed for reproducibility purposes. The developer is seeking feedback from others running Strix Halo or Phoenix with the amdxdna driver to compare decode throughput on comparable quants and determine if other XDNA2 configurations encounter the same performance ceiling.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Improving Claude Code Sessions with claude-self-improve
Claude-self-improve is a CLI tool that enhances Claude Code's AI performance by analyzing session data and updating memory files automatically.

Why Codex Still Beats Claude Code for Complex Python Monoliths
A senior developer compares Codex vs Claude Code on a production Python monolith with mixed architectural layers. Codex wins for back-end work due to better planning, code reuse, and harness-engineering adherence.

Modo: Open-Source AI IDE with Spec-Driven Development and Agent Hooks
Modo is an open-source desktop IDE built on Void editor that adds spec-driven development workflows, agent hooks, and steering files. It structures prompts into requirements, design, and tasks before generating code.

Hawkeye Update Adds Swarm Orchestration, Remote Tasks, and Local Model Support
Hawkeye v1.0+ now supports multi-agent swarm orchestration, remote task queuing, and improved Ollama/LM Studio integration. The local-first AI agent flight recorder helps developers track what happens when agents work in repositories.