Qwen 3.6 27B GGUF Benchmarks: Codex Validator Co-Agent

A developer on r/LocalLLaMA has been running a local Qwen model beside OpenAI's Codex as a validator and challenger, and built a small reproducible eval suite to quantify which GGUF quant profiles work best for this role. The workflow: Codex handles main repo work; local Qwen challenges the plan, checks for overbuilding, missed hard directives, UI/design issues, bad assumptions, and long-context misses. The author reviews each interaction before proceeding.

Eval suite setup

The suite tests Qwen 3.6 27B GGUF profiles through llama.cpp, including Bartowski and Unsloth variants at different context sizes and KV cache formats (q8, f16). The focus is on real-world failures: missed directives, bad challenge behavior, overbuilding, UI judgment, and long-context misses.

Key findings

The top-performing profiles on this suite were: bartowski-128k-f16, bartowski-128k-q8, and unsloth-128k-q8. All three tied on accuracy.
q8 KV cache showed no measured accuracy loss in this specific suite.
Context size mattered more than f16-vs-q8 KV for this workflow. 65k profiles failed when the suite required >65k tokens.
unsloth-128k-f16 loaded but hit memory/throughput pressure on long-context cases on an RTX 5090.

Practical observations

The author reports Qwen is extremely good at catching silent bypasses, overbuilding, and coding-to-completion shortcuts in Codex. For UI-related tasks, Qwen takes the lead in design while Codex implements. The roles reverse: Qwen challenges the plan, and the human reviews before each stage.