Benchmark: MLX vs Ollama Running Qwen3-Coder-Next 8-Bit on M5 Max MacBook Pro

✍️ OpenClawRadar📅 Published: April 16, 2026🔗 Source
Benchmark: MLX vs Ollama Running Qwen3-Coder-Next 8-Bit on M5 Max MacBook Pro
Ad

A benchmark was conducted comparing two local inference backends—MLX (Apple's native ML framework) and Ollama (llama.cpp-based)—running the same Qwen3-Coder-Next model in 8-bit quantization on Apple Silicon. The goal was to measure raw throughput (tokens per second), time to first token (TTFT), and overall coding capability across real-world programming tasks.

Methodology

The setup used:

  • MLX backend: mlx-lm v0.29.1 serving mlx-community/Qwen3-Coder-Next-8bit via its built-in OpenAI-compatible HTTP server on port 8080.
  • Ollama backend: Ollama serving qwen3-coder-next:Q8_0 via its OpenAI-compatible API on port 11434.

Both backends were accessed through the same Python benchmark harness using the OpenAI client library with streaming enabled. Each test was run 3 iterations per prompt, with results averaged and excluding the first iteration's TTFT for the initial cold-start prompt (model load).

Test Suite

Six prompts covered a spectrum of coding tasks:

  • Short Completion: Write a palindrome check function (150 max tokens)
  • Medium Generation: Implement an LRU cache class with type hints (500 max tokens)
  • Long Reasoning: Explain async/await vs threading with examples (1000 max tokens)
  • Debug Task: Find and fix bugs in merge sort + binary search (800 max tokens)
  • Complex Coding: Thread-safe bounded blocking queue with context manager (1000 max tokens)
  • Code Review: Review 3 functions for performance/correctness/style (1000 max tokens)
Ad

Results

Throughput (Tokens per Second) on M5 Max with 128GB RAM:

  • Short Completion: Ollama 32.51 tok/s, MLX 69.62 tok/s (MLX +114%)
  • Medium Generation: Ollama 35.97 tok/s, MLX 78.28 tok/s (MLX +118%)
  • Long Reasoning: Ollama 40.45 tok/s, MLX 78.29 tok/s (MLX +94%)
  • Debug Task: Ollama 37.06 tok/s, MLX 74.89 tok/s (MLX +102%)
  • Complex Coding: Ollama 35.84 tok/s, MLX 76.99 tok/s (MLX +115%)
  • Code Review: Ollama 39.00 tok/s, MLX 74.98 tok/s (MLX +92%)

Overall average: MLX achieved approximately 72 tokens per second, roughly double Ollama's throughput. Metrics measured included tokens/sec (output tokens generated per second, higher is better), TTFT (time from request sent to first token received, lower is better), total time (wall-clock time for full response, lower is better), and memory usage measured via psutil.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also