Benchmark: MLX vs Ollama Running Qwen3-Coder-Next 8-Bit on M5 Max MacBook Pro

A benchmark was conducted comparing two local inference backends—MLX (Apple's native ML framework) and Ollama (llama.cpp-based)—running the same Qwen3-Coder-Next model in 8-bit quantization on Apple Silicon. The goal was to measure raw throughput (tokens per second), time to first token (TTFT), and overall coding capability across real-world programming tasks.
Methodology
The setup used:
- MLX backend: mlx-lm v0.29.1 serving mlx-community/Qwen3-Coder-Next-8bit via its built-in OpenAI-compatible HTTP server on port 8080.
- Ollama backend: Ollama serving qwen3-coder-next:Q8_0 via its OpenAI-compatible API on port 11434.
Both backends were accessed through the same Python benchmark harness using the OpenAI client library with streaming enabled. Each test was run 3 iterations per prompt, with results averaged and excluding the first iteration's TTFT for the initial cold-start prompt (model load).
Test Suite
Six prompts covered a spectrum of coding tasks:
- Short Completion: Write a palindrome check function (150 max tokens)
- Medium Generation: Implement an LRU cache class with type hints (500 max tokens)
- Long Reasoning: Explain async/await vs threading with examples (1000 max tokens)
- Debug Task: Find and fix bugs in merge sort + binary search (800 max tokens)
- Complex Coding: Thread-safe bounded blocking queue with context manager (1000 max tokens)
- Code Review: Review 3 functions for performance/correctness/style (1000 max tokens)
Results
Throughput (Tokens per Second) on M5 Max with 128GB RAM:
- Short Completion: Ollama 32.51 tok/s, MLX 69.62 tok/s (MLX +114%)
- Medium Generation: Ollama 35.97 tok/s, MLX 78.28 tok/s (MLX +118%)
- Long Reasoning: Ollama 40.45 tok/s, MLX 78.29 tok/s (MLX +94%)
- Debug Task: Ollama 37.06 tok/s, MLX 74.89 tok/s (MLX +102%)
- Complex Coding: Ollama 35.84 tok/s, MLX 76.99 tok/s (MLX +115%)
- Code Review: Ollama 39.00 tok/s, MLX 74.98 tok/s (MLX +92%)
Overall average: MLX achieved approximately 72 tokens per second, roughly double Ollama's throughput. Metrics measured included tokens/sec (output tokens generated per second, higher is better), TTFT (time from request sent to first token received, lower is better), total time (wall-clock time for full response, lower is better), and memory usage measured via psutil.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Relay CLI tool saves Claude session context when rate limited
Relay is a Rust CLI tool that reads Claude's .jsonl session transcripts from disk and creates full snapshots of your session, including conversation, tool calls, todos, git state, and errors. It generates context prompts to resume sessions after rate limits reset.

agentcache: Python Library for Multi-Agent LLM Prefix Caching
agentcache is a Python library that enables multi-agent LLM frameworks to share cached prompt prefixes, achieving up to 76% cache hit rates and cutting inference time by more than half in tests with GPT-4o-mini.

Introducing cltree: A File Tree TUI for Claude Code
cltree is a split-pane TUI that displays your project file tree in real-time alongside Claude Code, showing the current working directory, hiding noise, and allowing all keystrokes to pass through.

PromoClock: Timezone Tracker for Claude's 2x Off-Peak Hours Built with Claude 4.6
A developer built PromoClock.co, a free tool that automatically converts Claude's "5-11am PT / 12-6pm GMT" 2x off-peak promo hours to local time, using Claude 4.6 to handle timezone logic, Next.js 15 setup, and UI design.