GLM 5 on Mac M3: Performance Observations for Agentic Coding

Performance Benchmarks and Limitations
A developer tested GLM 5 using MLX 4-bit quantization on a Mac M3 with 512GB RAM for agentic coding tasks. The model is described as "quite usable" with context kept below approximately 50,000 tokens, though significantly slower than API-based solutions like Claude, particularly during prompt processing.
Performance degrades substantially when context exceeds 50k tokens. In one test processing 65k tokens, the first half completed in 8 minutes (67 tokens/second), while the second half took 18 additional minutes, resulting in an overall rate of 41 tokens/second. Token generation remains faster, estimated at 12-20 tokens/second at larger context sizes.
Workflow Observations
The user notes that Opencode (the agentic coding system) handles multi-file code generation efficiently once a plan is created, outputting "thousands of tokens of code across multiple files in just a few minutes with reasoning in between." Prompt processing typically takes "a couple minutes" to read a few hundred lines of code per file, with about 10 minutes total spread across planning sessions.
Compaction in Opencode "does take a while as it likes to basically just reprocess the whole context." With a 50k token context limit, compaction takes approximately 5 minutes.
Technical Setup and Future Expectations
The test was conducted using LM Studio, which may not provide the latest runtime optimizations. The user suggests that "MLX or even GGUF may get faster prompt processing as the runtimes are updated for GLM 5, but it will likely not get a TON faster than this."
The setup is not recommended for tasks requiring 70k+ tokens in context due to both context size limitations and "unbearable slowness" that occurs after exceeding certain thresholds during prompt processing.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Open-sourced self-healing skill for AI agents detects and fixes failures automatically
A new open-source skill enables AI agents to automatically detect failures, diagnose root causes, and implement fixes. It includes a failure scanner for crons, sub-agents, and deploy logs, plus a database that learns from previous fixes.

angular-grab: Tool for Extracting Angular Component Context for AI Agents
angular-grab is a dev-only tool that lets you point at any UI element in an Angular dev server, press Cmd+C, and copy the full component stack trace with file paths and HTML to your clipboard for pasting into AI agents.

Codeflash Analysis: 118 Performance Bugs Found in Two PRs Written with Claude Code
Codeflash measured performance of two major features built with Claude Code and found 118 functions running up to 446x slower than necessary. The analysis revealed patterns of inefficient algorithms, redundant computation, missing caching, and suboptimal data structures.

CodeLedger and Vibecop Updates for Multi-Agent AI Coding Cost and Quality Tracking
CodeLedger now tracks spending across Claude Code, Codex CLI, Cline, and Gemini CLI by reading local session files, while Vibecop adds automated quality checks with new LLM-specific detectors and a one-command setup for multiple AI coding tools.