JANG Quantization Method Improves MLX Performance for Large Models

Performance Gap Between MLX and GGUF Quantizations
The source discusses a significant performance issue with standard MLX quantization methods for large language models. On the MMLU benchmark (200 questions), MiniMax-M2.5 quantized to 4-bit for MLX scored only 26.5% (53/200), while the same model quantized with JANG_2S method scored 74% (148/200). The JANG method outperformed all MLX quantization levels (2-bit, 3-bit, and 4-bit), which all scored near random chance at approximately 25%.
Specific Benchmark Results
Detailed MMLU subject breakdown shows JANG_2L consistently outperforming MLX quantizations:
- Abstract Algebra: JANG_2L 10/20 vs MLX 4-bit 3/20
- Astronomy: JANG_2L 20/20 vs MLX 4-bit 7/20
- College CS: JANG_2L 13/20 vs MLX 4-bit 4/20
- HS Biology: JANG_2L 18/20 vs MLX 4-bit 4/20
The root cause identified for poor MLX performance is that "MLX generates meta-commentary instead of direct answers on this model."
Model Size and Performance Comparisons
For Qwen 3.5 122B model:
- JANG_4K: 86% MMLU score, 69 GB size
- MLX 4-bit: 85% MMLU score, 64 GB size
- JANG_2S: 79% MMLU score, 38 GB size
- MLX 2-bit: 56.5% MMLU score, 36 GB size
The author notes that "People trade the M chip speed for coherency, with no GGUF equivalent on MLX" and that "Qwen 3.5 on Macs when using GGUF is also 1/3rd slower than MLX."
MiniMax-M2.5 Code Generation Issue
From referenced benchmarks: "MiniMax-M2.5 can't code — 10% on HumanEval+ despite 87% tool calling and 80% reasoning. Something is off with its code generation format. Great for reasoning though."
Availability and Implementation
Currently available through:
- MLX Studio: https://mlx.studio/ - has JANG_Q inferencing engine native
- Repository: For self-installation and model quantization
The method allows running models like MiniMax-M2.5 at "2bit MLX equivalent while getting test results that just wasn't possible before on MLX."
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw Skill for Local Meeting Transcription with Whisper
A new OpenClaw skill called ghostmeet provides local meeting transcription using Whisper. It captures audio from browser tabs via a Chrome Extension and can generate summaries using Claude, with all audio and transcription processed locally on your machine.

Agent Architect: Free Tool Generates Complete Workspace Files for AI Agents
Agent Architect is a free interactive tool that walks users through 40+ questions about their AI agent, then compiles everything into a formatted prompt to generate seven production-grade workspace files: SOUL.md, IDENTITY.md, AGENTS.md, OPERATIONS.md, TOOLS.md, MEMORY.md, and HEARTBEAT.md.

Relvy improves Claude's root cause analysis accuracy by 12 percentage points on OpenRCA benchmark
Relvy, a tool that automates runbooks, has demonstrated a 12 percentage point improvement in Claude's accuracy on the OpenRCA benchmark for root cause analysis. The results were shared via a Hacker News post with 11 points.

TrustLog Dynamics: Python Daemon Uses Bond Math to Kill Rogue AI Agents
TrustLog Dynamics is a Python daemon that monitors AI agent API costs in real time and terminates processes using two quantitative finance methods: convexity detection for accelerating costs and zero-variance detection for mechanical loops.