hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX)

✍️ OpenClawRadar📅 Published: May 25, 2026🔗 Source
hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX)
Ad

A new ROCm-native inference engine for Qwen 3.6 MoE and dense models has appeared: hipEngine by the developer behind FastDMS and ParoQuant. It's Python-based with hot paths in HIP/C++, using AMD native libs like hipBLASLt, hipGraph, and AOTriton. No heavy PyTorch dependency.

Target Hardware

  • gfx1100 — Radeon RX 7900 XTX / Radeon Pro W7900 (RDNA3). Strix Halo also supported.

Benchmarks vs llama.cpp

On Qwen 3.6 35B MoE (using ParoQuant 4.68 bpw and GGUF Q4_K_S), hipEngine matches or beats llama.cpp HIP and Vulkan at all tested context lengths (512–128K). Key numbers (prefill tok/s, 512 prompt / 128 gen):

  • hipEngine PARO: 2718.497 tok/s
  • hipEngine GGUF Q4_K_S: 2258.847 tok/s
  • llama.cpp HIP: 2436.049 tok/s
  • llama.cpp Vulkan: 1816.927 tok/s

At 128K context, hipEngine PARO prefill reaches 1055 tok/s vs llama.cpp HIP 710 tok/s — a 48% improvement. Decode tok/s are comparable (60–127 tok/s range).

Ad

Memory Efficiency

hipEngine uses near-lossless INT8 KV cache with almost no speed penalty. This allows running the full Qwen 3.6 256K context window in under 24GB on a single 7900 XTX:

  • 128K context, BF16 KV: sampled peak 21.04 GiB, prefill 1091.9 tok/s, decode 62.2 tok/s
  • 128K context, INT8 KV: sampled peak 19.80 GiB, prefill 1076.5 tok/s, decode 60.0 tok/s
  • Peak memory at 128K (hipEngine PARO): 22.122 GiB vs llama.cpp HIP 23.605 GiB

Features

  • AGPLv3 open source
  • ROCm-native, no PyTorch dependency in hot path
  • Uses hipBLASLt, hipGraph, AOTriton
  • ParoQuant ported to ROCm
  • INT8 KV cache (near-lossless, minimal speed impact)
  • Supports Qwen 3.6 MoE and dense models

If you're running Qwen 3.6 on RDNA3 hardware, hipEngine is worth a look — especially for memory-constrained 256K context use cases.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also