MLX vs Ollama: Qwen3-Coder-Next 8-Bit Benchmark on M5 Max MacBook Pro

A benchmark was conducted comparing two local inference backends—MLX (Apple's native ML framework) and Ollama (llama.cpp-based)—running the same Qwen3-Coder-Next model in 8-bit quantization on Apple Silicon. The goal was to measure raw throughput (tokens per second), time to first token (TTFT), and overall coding capability across real-world programming tasks.

Methodology

The setup used:

MLX backend: mlx-lm v0.29.1 serving mlx-community/Qwen3-Coder-Next-8bit via its built-in OpenAI-compatible HTTP server on port 8080.
Ollama backend: Ollama serving qwen3-coder-next:Q8_0 via its OpenAI-compatible API on port 11434.

Both backends were accessed through the same Python benchmark harness using the OpenAI client library with streaming enabled. Each test was run 3 iterations per prompt, with results averaged and excluding the first iteration's TTFT for the initial cold-start prompt (model load).

Test Suite

Six prompts covered a spectrum of coding tasks:

Short Completion: Write a palindrome check function (150 max tokens)
Medium Generation: Implement an LRU cache class with type hints (500 max tokens)
Long Reasoning: Explain async/await vs threading with examples (1000 max tokens)
Debug Task: Find and fix bugs in merge sort + binary search (800 max tokens)
Complex Coding: Thread-safe bounded blocking queue with context manager (1000 max tokens)
Code Review: Review 3 functions for performance/correctness/style (1000 max tokens)

Results

Throughput (Tokens per Second) on M5 Max with 128GB RAM:

Short Completion: Ollama 32.51 tok/s, MLX 69.62 tok/s (MLX +114%)
Medium Generation: Ollama 35.97 tok/s, MLX 78.28 tok/s (MLX +118%)
Long Reasoning: Ollama 40.45 tok/s, MLX 78.29 tok/s (MLX +94%)
Debug Task: Ollama 37.06 tok/s, MLX 74.89 tok/s (MLX +102%)
Complex Coding: Ollama 35.84 tok/s, MLX 76.99 tok/s (MLX +115%)
Code Review: Ollama 39.00 tok/s, MLX 74.98 tok/s (MLX +92%)

Overall average: MLX achieved approximately 72 tokens per second, roughly double Ollama's throughput. Metrics measured included tokens/sec (output tokens generated per second, higher is better), TTFT (time from request sent to first token received, lower is better), total time (wall-clock time for full response, lower is better), and memory usage measured via psutil.

📖 Read the full source: r/LocalLLaMA

Benchmark: MLX vs Ollama Running Qwen3-Coder-Next 8-Bit on M5 Max MacBook Pro

Methodology

Test Suite

Results

👀 See Also

TeenyApp lets Claude build and deploy full-stack websites from a single chat link

Meeting Summarization on a 6GB GPU: qwen3.5:0.8B Works at 57s, Granite 4 350M Hallucinates

AI Agent Session Center: 3D Dashboard for Monitoring Claude Code Sessions

Parallel Agent Orchestrator for Claude Code Using Git Worktrees