Local vs Cloud Models: Qwen, Gemma, Claude, Codex-Spark Tested

A Reddit user compared locally-ran Qwen-3.6-27B (GGUF q4_k_m) against API equivalents: Qwen-3.6-27B via OpenRouter, Gemma-4-31B via OpenRouter, Claude Haiku 4.5, and GPT-Codex-Spark. The test involved implementing an autoresearch loop from a design document — a deliberately hard task to evaluate failure cleanliness, not success rate.

Hardware Setup

CPU: Ryzen 7 7800X3D
RAM: 64 GB DDR5-6400
GPU: RTX 5080 (16 GB VRAM)
Local model: Qwen-3.6-27B q4_k_m (GGUF) — fits 16 GB VRAM via quantization

Results

Gemma-4-31B (API): Failed completely. Wrote skeleton with mocked modules, no tests, no config files (__init__.py, requirements.txt, pyproject.toml). Cost: $0.112, 803k context tokens consumed, 21k generated.
Codex-Spark (API): Produced beautiful folder structure and code, but imports were hallucinated. No unit tests. Used 1% of $100/mo Spark limits.
Claude Haiku 4.5 (API): Detailed implementation but failed on correctness. (Further details truncated in source.)
Qwen-3.6-27B (local q4_k_m): Not explicitly scored, but user notes quantized inference degrades quality vs full-precision API version.

Context

The user argues that typical local-model evals use trivial tasks (e.g., Snake in HTML) where both local and frontier models succeed, making local models look better than they are. This test used a real work project with a design document; only Codex-Spark produced fully written (but broken) code. The point: local models are not yet ready for complex code generation without substantial fixes.

📖 Read the full source: r/LocalLLaMA

Local vs Cloud Models: Qwen-3.6-27B, Gemma-4-31B, Claude Haiku, Codex-Spark on Hard Code Gen

Hardware Setup

Results

Context

👀 See Also

Claude Code v2.1.169: Safe Mode, /cd Command, and Dozens of Bug Fixes

Claude Cowork unifies slash commands and skills under single concept

Lovable offers $100 free Claude API credits for International Women's Day

Tennessee Woman Jailed for Six Months Due to AI Facial Recognition Error