Benching local Qwen 3.6 27B as a Codex validator co-agent

A developer on r/LocalLLaMA has been running a local Qwen model beside OpenAI's Codex as a validator and challenger, and built a small reproducible eval suite to quantify which GGUF quant profiles work best for this role. The workflow: Codex handles main repo work; local Qwen challenges the plan, checks for overbuilding, missed hard directives, UI/design issues, bad assumptions, and long-context misses. The author reviews each interaction before proceeding.
Eval suite setup
The suite tests Qwen 3.6 27B GGUF profiles through llama.cpp, including Bartowski and Unsloth variants at different context sizes and KV cache formats (q8, f16). The focus is on real-world failures: missed directives, bad challenge behavior, overbuilding, UI judgment, and long-context misses.
Key findings
- The top-performing profiles on this suite were:
bartowski-128k-f16,bartowski-128k-q8, andunsloth-128k-q8. All three tied on accuracy. - q8 KV cache showed no measured accuracy loss in this specific suite.
- Context size mattered more than f16-vs-q8 KV for this workflow. 65k profiles failed when the suite required >65k tokens.
unsloth-128k-f16loaded but hit memory/throughput pressure on long-context cases on an RTX 5090.
Practical observations
The author reports Qwen is extremely good at catching silent bypasses, overbuilding, and coding-to-completion shortcuts in Codex. For UI-related tasks, Qwen takes the lead in design while Codex implements. The roles reverse: Qwen challenges the plan, and the human reviews before each stage.
Resources
- Project page: https://robert896r1.github.io/qwen-realworld-accuracy-evals/
- Repo: https://github.com/robert896r1/qwen-realworld-accuracy-evals
📖 Read the full source: r/LocalLLaMA
👀 See Also

Conduid.com indexes 23,000+ MCP servers into searchable directory
Conduid.com aggregates MCP servers from 11 sources, deduplicates them, and provides search, categories, and trust scores based on GitHub activity, documentation quality, and maintenance signals.

cortex-engine MCP server adds persistent memory and multi-agent support
cortex-engine v0.4.0 is an open-source MCP server that gives AI agents persistent long-term memory with tools like observe(), query(), believe(), and dream(). It now supports multiple agents with isolated memory namespaces.

Alfred Beta Launches: Simplified OpenClaw Alternative for Non-Technical Users
Alfred is a new beta tool that provides approximately 70% of OpenClaw's functionality with significantly reduced complexity, featuring simple defaults for app connections, memory, usage modes, and infrastructure while allowing customization.

Claude Code Skill Refactors React Components Using 'Don't Make Me Think' Principles
A new Claude Code skill automatically refactors React components for usability based on Steve Krug's principles — cuts happy talk, surfaces primary CTAs, fixes empty/error states, and tightens labels.