Apple Silicon Benchmark: Qwen3-VL Performance on M3, M4, and M5 Max for Vision LLM Classification

Benchmark Setup and Hardware
A vision LLM classification pipeline was tested on technical drawings (PDFs at various megapixel resolutions) using LM Studio with MLX backend, streaming enabled, same 53-file test dataset, and same prompt. The task involves classification where the model analyzes an image and returns a short structured JSON response (~300-400 tokens), making inference heavily prefill-dominated with minimal token generation.
Hardware tested:
- M3 Max: 40 GPU cores, 48 GB RAM, 400 GB/s memory bandwidth
- M4 Max Studio: 40 GPU cores, 64 GB RAM, 546 GB/s memory bandwidth
- M5 Max: 40 GPU cores, 64 GB RAM, 614 GB/s memory bandwidth
Models Tested
- Qwen3-VL 8B: 8B parameters, 4-bit MLX quantization, ~5.8 GB on disk
- Qwen3.5 9B: 9B parameters (dense, hybrid attention), 4-bit MLX quantization, ~6.2 GB on disk
- Qwen3-VL 32B: 32B parameters, 4-bit MLX quantization, ~18 GB on disk
8B Model Results
Total time per image for Qwen3-VL 8B (4-bit):
- 4 MP: M3 Max 48GB: 16.5s, M4 Studio 64GB: 15.8s, M5 Max 64GB: 9.0s (M5 is 83% faster than M3)
- 5 MP: M3 Max: 20.3s, M4 Studio: 19.8s, M5 Max: 11.5s (77% faster)
- 6 MP: M3 Max: 24.1s, M4 Studio: 24.4s, M5 Max: 14.0s (72% faster)
- 7.5 MP: M4 Studio: 32.7s, M5 Max: 20.3s
The M3 Max and M4 Studio are basically identical on the 8B model, with total inference time within 3-5% despite M4 having 37% more memory bandwidth. The M5 Max is roughly 75-83% faster than both.
Why M3 and M4 Have Similar Speed
Prefill (prompt processing) scales with GPU compute cores, not memory bandwidth. Both chips have 40 GPU cores, so prefill speed is identical. For vision models, prefill dominates: TTFT (time to first token) is 70-85% of total inference time because the vision encoder does heavy compute work per image.
The M4 does show its bandwidth advantage in token generation: 76-80 T/s vs M3's 60-64 T/s (25% faster), matching the 37% bandwidth gap (546 vs 400 GB/s). However, for classification tasks with short outputs (~300-400 tokens), generation is only ~15% of total time, making the 25% generation speed advantage translate to just 3-5% end-to-end improvement.
32B Model Results
Total time per image for Qwen3-VL 32B (4-bit):
- 2 MP: M3 Max 48GB: 47.6s, M4 Studio 64GB: 35.3s, M5 Max 64GB: 21.2s
- 4 MP: M3 Max: 63.2s, M4 Studio: 50.0s, M5 Max: 27.4s
- 5 MP: M3 Max: 72.9s, M4 Studio: 59.2s, M5 Max: 30.7s
- 6 MP: M3 Max: 85.3s, M4 Studio: 78.0s, M5 Max: 35.6s
For longer generation tasks like summarization, description, or code generation, the M4's bandwidth advantage would matter more than in this classification workload.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Mercor Breach: 4TB of Voice Samples + IDs Stolen – What Attackers Can Do Now
4TB of voice recordings paired with government IDs stolen from 40,000 Mercor contractors. Attackers can clone voices from 15 seconds of clean audio and bypass bank voiceprint verification, deepfake calls, and insurance fraud.

Claude Code v2.1.163: Version Pinning, Plugin List, Hook Improvements, and Critical Bug Fixes
Claude Code v2.1.163 adds requiredMinimumVersion/requiredMaximumVersion managed settings, /plugin list command, hook context improvements, and fixes for claude -p hangs, Windows EEXIST, and Bazel/EDR workflows.

Claude Pro User Reports 5-Hour Usage Window Burned on Single Prompt with No Output
A Claude Pro user reports that a single prompt consumed their entire 5-hour usage window, returning only planning text and no deliverable. The incident highlights issues with token consumption during internal reasoning and lack of safeguards.

Claude Service Incident: Elevated Errors Across Platforms
Claude experienced elevated errors across claude.ai, console, and Claude Code platforms on March 2, 2026, with issues affecting login/logout paths and some API methods. The incident was resolved after approximately 4 hours.