hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX)

✍️ OpenClawRadar📅 Published: May 25, 2026🔗 Source

A new ROCm-native inference engine for Qwen 3.6 MoE and dense models has appeared: hipEngine by the developer behind FastDMS and ParoQuant. It's Python-based with hot paths in HIP/C++, using AMD native libs like hipBLASLt, hipGraph, and AOTriton. No heavy PyTorch dependency.

Target Hardware

gfx1100 — Radeon RX 7900 XTX / Radeon Pro W7900 (RDNA3). Strix Halo also supported.

Benchmarks vs llama.cpp

On Qwen 3.6 35B MoE (using ParoQuant 4.68 bpw and GGUF Q4_K_S), hipEngine matches or beats llama.cpp HIP and Vulkan at all tested context lengths (512–128K). Key numbers (prefill tok/s, 512 prompt / 128 gen):

hipEngine PARO: 2718.497 tok/s
hipEngine GGUF Q4_K_S: 2258.847 tok/s
llama.cpp HIP: 2436.049 tok/s
llama.cpp Vulkan: 1816.927 tok/s

At 128K context, hipEngine PARO prefill reaches 1055 tok/s vs llama.cpp HIP 710 tok/s — a 48% improvement. Decode tok/s are comparable (60–127 tok/s range).

Memory Efficiency

hipEngine uses near-lossless INT8 KV cache with almost no speed penalty. This allows running the full Qwen 3.6 256K context window in under 24GB on a single 7900 XTX:

128K context, BF16 KV: sampled peak 21.04 GiB, prefill 1091.9 tok/s, decode 62.2 tok/s
128K context, INT8 KV: sampled peak 19.80 GiB, prefill 1076.5 tok/s, decode 60.0 tok/s
Peak memory at 128K (hipEngine PARO): 22.122 GiB vs llama.cpp HIP 23.605 GiB

Features

AGPLv3 open source
ROCm-native, no PyTorch dependency in hot path
Uses hipBLASLt, hipGraph, AOTriton
ParoQuant ported to ROCm
INT8 KV cache (near-lossless, minimal speed impact)
Supports Qwen 3.6 MoE and dense models

If you're running Qwen 3.6 on RDNA3 hardware, hipEngine is worth a look — especially for memory-constrained 256K context use cases.

📖 Read the full source: r/LocalLLaMA

👀 See Also

Tools

Kubeez MCP Server Connects Claude to 70+ AI Media Models

Kubeez has released an MCP server that connects Claude to over 70 AI models for image, video, music, and voice generation. The server supports OAuth authentication and provides async generation with Claude polling for status and returning CDN URLs.

Mar 28, 2026, 09:45 PM UTC

OpenClawRadar

Tools

Visual Prompting Framework Replaces Text Prompts with Single Image for Claude AI

The Carrying Capacity Principle v9 is a bidirectional structural framework that uses a single flowchart image instead of text prompts for Claude AI. It provides structural diagnosis or generative construction plans based on system parameters or goals.

Mar 21, 2026, 03:45 AM UTC

OpenClawRadar

Tools

Exploring API-to-API Interactions: A Closer Look at Automation

A recent discussion on Reddit delves into the intricacies of API-to-API phone calls, focusing on practical implementation and potential challenges using tools such as Postman and Twilio.

Apr 20, 2026, 05:38 PM UTC

OpenClawRadar

Tools

ConnectSafely AI MCP Server Links LinkedIn to Claude for Direct Control

ConnectSafely AI provides an MCP server that connects LinkedIn directly to Claude, allowing users to send messages, search for people, check profile visitors, and track conversations through prompts without switching tabs.

Apr 16, 2026, 05:21 PM UTC

OpenClawRadar