JANG Quantization Boosts MLX Performance on Large Models

Performance Gap Between MLX and GGUF Quantizations

The source discusses a significant performance issue with standard MLX quantization methods for large language models. On the MMLU benchmark (200 questions), MiniMax-M2.5 quantized to 4-bit for MLX scored only 26.5% (53/200), while the same model quantized with JANG_2S method scored 74% (148/200). The JANG method outperformed all MLX quantization levels (2-bit, 3-bit, and 4-bit), which all scored near random chance at approximately 25%.

Specific Benchmark Results

Detailed MMLU subject breakdown shows JANG_2L consistently outperforming MLX quantizations:

Abstract Algebra: JANG_2L 10/20 vs MLX 4-bit 3/20
Astronomy: JANG_2L 20/20 vs MLX 4-bit 7/20
College CS: JANG_2L 13/20 vs MLX 4-bit 4/20
HS Biology: JANG_2L 18/20 vs MLX 4-bit 4/20

The root cause identified for poor MLX performance is that "MLX generates meta-commentary instead of direct answers on this model."

Model Size and Performance Comparisons

For Qwen 3.5 122B model:

JANG_4K: 86% MMLU score, 69 GB size
MLX 4-bit: 85% MMLU score, 64 GB size
JANG_2S: 79% MMLU score, 38 GB size
MLX 2-bit: 56.5% MMLU score, 36 GB size

The author notes that "People trade the M chip speed for coherency, with no GGUF equivalent on MLX" and that "Qwen 3.5 on Macs when using GGUF is also 1/3rd slower than MLX."

MiniMax-M2.5 Code Generation Issue

From referenced benchmarks: "MiniMax-M2.5 can't code — 10% on HumanEval+ despite 87% tool calling and 80% reasoning. Something is off with its code generation format. Great for reasoning though."

Availability and Implementation

Currently available through:

MLX Studio: https://mlx.studio/ - has JANG_Q inferencing engine native
Repository: For self-installation and model quantization

The method allows running models like MiniMax-M2.5 at "2bit MLX equivalent while getting test results that just wasn't possible before on MLX."

📖 Read the full source: r/LocalLLaMA

JANG Quantization Method Improves MLX Performance for Large Models

Performance Gap Between MLX and GGUF Quantizations

Specific Benchmark Results

Model Size and Performance Comparisons

MiniMax-M2.5 Code Generation Issue

Availability and Implementation

👀 See Also

Claudy: A native macOS wrapper for Claude Code with multi-session, auto account switching, and draft commits

OpenHelm: A macOS App for Automating Claude Code Tasks

Agentlint: GitHub App that catches CLAUDE.md contradictions and broken pointers on every PR

OpenClaw SEO Audit Skill Released for Technical Website Analysis