Qwen3.5 Benchmarks: Apple Silicon vs ROCm vs Vulkan AMD

Hardware and Software Setup

The benchmark compared three systems: a MacBook Pro with Apple M5 Max (48GB unified memory), a Mac Studio with Apple M1 Max (64GB unified memory), and a Fedora 43 GPU server with Intel Core Ultra 7 265K processor and three AMD GPUs: Radeon Pro W7900 (48GB, RDNA 3), Radeon AI PRO R9700 (32GB, RDNA 4), and Radeon Pro W6800 (32GB, RDNA 2). The motherboard provided x8/x8/x4 electrical connections, with the W6800 on a chipset-connected x4 slot bottlenecked by the DMI link.

Inference Engines and Models

Apple systems used mlx-lm (versions 0.31.1 and 0.31.0). The Fedora server ran llama.cpp with both HIP/ROCm build (b5065) and AMDVLK Vulkan build (b5065). ROCm version was 7.2, AMDVLK version was 2025.Q2.1. All Fedora runs used a single GPU except the 122B model which used W7900 + R9700 with --split-mode layer.

Models tested were Qwen3.5-35B-A3B MoE (3B active params, mlx-community 4-bit or unsloth Q4_K_M), Qwen3.5-27B dense (27B params, mlx-community 4-bit or unsloth Q4_K_M), and Qwen3.5-122B-A10B MoE (10B active params, unsloth Q3_K_XL).

Benchmark Methodology

The benchmark reflected pharmacovigilance data analysis use cases: writing extraction scripts, reasoning about clinical data, generating regulatory narratives, and structured data extraction from clinical text. Prompts were domain-specific, not general-purpose LLM benchmarks.

Standard benchmark used 8K context with 7 prompts: 2 prompt-processing tests (short ~27 token and long ~2.9K token input with minimal output to isolate prefill speed) and 5 generation tasks (short coding, medium coding, math reasoning, regulatory safety narrative writing, structured AE extraction). Single-user, single-request, temperature 0.3, /no_think to disable thinking mode, no prompt caching between requests.

Context-scaling benchmark used the same model and GPU with progressively larger prompts (512 to 16K+ tokens) consisting of synthetic adverse event listings, with only 64 max output tokens to isolate how prompt processing and generation scale with input size.

Key Findings

The benchmark revealed interesting ROCm vs AMDVLK Vulkan findings, including context-scaling tests showing when each backend performs best. The source notes that most available comparisons don't help decide between configurations like an M5 Max laptop and a W7900 workstation, or whether ROCm is worth the setup hassle over Vulkan.

📖 Read the full source: r/LocalLLaMA