Benchmark Results: Qwen3.5 Models on Apple Silicon vs AMD GPUs with ROCm vs Vulkan

Hardware and Software Setup
The benchmark compared three systems: a MacBook Pro with Apple M5 Max (48GB unified memory), a Mac Studio with Apple M1 Max (64GB unified memory), and a Fedora 43 GPU server with Intel Core Ultra 7 265K processor and three AMD GPUs: Radeon Pro W7900 (48GB, RDNA 3), Radeon AI PRO R9700 (32GB, RDNA 4), and Radeon Pro W6800 (32GB, RDNA 2). The motherboard provided x8/x8/x4 electrical connections, with the W6800 on a chipset-connected x4 slot bottlenecked by the DMI link.
Inference Engines and Models
Apple systems used mlx-lm (versions 0.31.1 and 0.31.0). The Fedora server ran llama.cpp with both HIP/ROCm build (b5065) and AMDVLK Vulkan build (b5065). ROCm version was 7.2, AMDVLK version was 2025.Q2.1. All Fedora runs used a single GPU except the 122B model which used W7900 + R9700 with --split-mode layer.
Models tested were Qwen3.5-35B-A3B MoE (3B active params, mlx-community 4-bit or unsloth Q4_K_M), Qwen3.5-27B dense (27B params, mlx-community 4-bit or unsloth Q4_K_M), and Qwen3.5-122B-A10B MoE (10B active params, unsloth Q3_K_XL).
Benchmark Methodology
The benchmark reflected pharmacovigilance data analysis use cases: writing extraction scripts, reasoning about clinical data, generating regulatory narratives, and structured data extraction from clinical text. Prompts were domain-specific, not general-purpose LLM benchmarks.
Standard benchmark used 8K context with 7 prompts: 2 prompt-processing tests (short ~27 token and long ~2.9K token input with minimal output to isolate prefill speed) and 5 generation tasks (short coding, medium coding, math reasoning, regulatory safety narrative writing, structured AE extraction). Single-user, single-request, temperature 0.3, /no_think to disable thinking mode, no prompt caching between requests.
Context-scaling benchmark used the same model and GPU with progressively larger prompts (512 to 16K+ tokens) consisting of synthetic adverse event listings, with only 64 max output tokens to isolate how prompt processing and generation scale with input size.
Key Findings
The benchmark revealed interesting ROCm vs AMDVLK Vulkan findings, including context-scaling tests showing when each backend performs best. The source notes that most available comparisons don't help decide between configurations like an M5 Max laptop and a W7900 workstation, or whether ROCm is worth the setup hassle over Vulkan.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Codestrap founders critique AI coding metrics and warn of quality issues
Codestrap founders argue AI coding tools are being measured incorrectly with metrics like lines of code and pull requests, while quality metrics show problems like a 3.7x larger codebase performing 2,000 times worse in an SQLite-to-Rust rewrite.

OpenClaw 2026.3.2 Release: Production Secrets, PDF Tool, and Safer Defaults
OpenClaw 2026.3.2 introduces a production-grade secrets system with fail-fast behavior, a native PDF tool with Anthropic and Google model support, and safer defaults that restrict tool access for new installations.

Reddit Discussion on Long-Term Risks of Coding Agent Dependency
A Reddit user argues that current coding agents like Claude Code and Copilot create dependency that could lead to vendor lock-in, centralization of software creation, and commoditization of engineering craftsmanship.

Stanford's 2026 AI Index Report: Key Trends on Investment, Models, and Public Perception
Stanford's 2026 AI Index report shows AI investment is skyrocketing while impact on jobs and public perception remains mixed. US companies released 50 notable AI models in 2025, with China closing the gap.