hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX)

A new ROCm-native inference engine for Qwen 3.6 MoE and dense models has appeared: hipEngine by the developer behind FastDMS and ParoQuant. It's Python-based with hot paths in HIP/C++, using AMD native libs like hipBLASLt, hipGraph, and AOTriton. No heavy PyTorch dependency.
Target Hardware
gfx1100— Radeon RX 7900 XTX / Radeon Pro W7900 (RDNA3). Strix Halo also supported.
Benchmarks vs llama.cpp
On Qwen 3.6 35B MoE (using ParoQuant 4.68 bpw and GGUF Q4_K_S), hipEngine matches or beats llama.cpp HIP and Vulkan at all tested context lengths (512–128K). Key numbers (prefill tok/s, 512 prompt / 128 gen):
- hipEngine PARO: 2718.497 tok/s
- hipEngine GGUF Q4_K_S: 2258.847 tok/s
- llama.cpp HIP: 2436.049 tok/s
- llama.cpp Vulkan: 1816.927 tok/s
At 128K context, hipEngine PARO prefill reaches 1055 tok/s vs llama.cpp HIP 710 tok/s — a 48% improvement. Decode tok/s are comparable (60–127 tok/s range).
Memory Efficiency
hipEngine uses near-lossless INT8 KV cache with almost no speed penalty. This allows running the full Qwen 3.6 256K context window in under 24GB on a single 7900 XTX:
- 128K context, BF16 KV: sampled peak 21.04 GiB, prefill 1091.9 tok/s, decode 62.2 tok/s
- 128K context, INT8 KV: sampled peak 19.80 GiB, prefill 1076.5 tok/s, decode 60.0 tok/s
- Peak memory at 128K (hipEngine PARO): 22.122 GiB vs llama.cpp HIP 23.605 GiB
Features
- AGPLv3 open source
- ROCm-native, no PyTorch dependency in hot path
- Uses hipBLASLt, hipGraph, AOTriton
- ParoQuant ported to ROCm
- INT8 KV cache (near-lossless, minimal speed impact)
- Supports Qwen 3.6 MoE and dense models
If you're running Qwen 3.6 on RDNA3 hardware, hipEngine is worth a look — especially for memory-constrained 256K context use cases.
📖 Read the full source: r/LocalLLaMA
👀 See Also

HostedShell: A Web-Based Deployment Solution for OpenClaw Agents
HostedShell is a hosted version of OpenClaw that eliminates local CLI setup, dependency management, and manual pairing by providing a web console with direct terminal access and filesystem updates.

Akemon: Publish and Hire AI Coding Agents Directly from Your Laptop
Akemon is a tool that lets developers publish their AI coding agents with one command and hire others' agents with another, working directly from laptops through a relay tunnel without needing servers. It's protocol-agnostic, supporting agents from Claude Code, Codex, Gemini, OpenCode, Cursor, and Windsurf.

OpenClaw Skills with High Adoption: Capability Evolver, WACLI, Composio, and More
A Reddit post highlights several OpenClaw skills with significant install counts and specific use cases, including Capability Evolver for self-auditing agent behavior, WACLI for WhatsApp access, and Composio for connecting to 860+ apps.

Exploring the Claude Code Guidelines: A Minimalist Approach in 65 Lines
The Claude Code extension encapsulates essential AI coding principles in just 65 lines of Markdown, emphasizing 'Think Before Coding'. Despite its simplicity, it has gained notable traction among developers.