Optimizing Qwen 3.6 27B/35B on RTX 3090: Flags, Quantization, and Auto-Routing

A developer running Qwen 3.6 models locally on an RTX 3090 (24GB VRAM), Ryzen 5700X, 64GB RAM, Windows 11, is hitting performance and reliability issues. They're using llama-server with custom flags and seeking advice on quant choice, throughput, and automatic model routing.
Commands and Quantizations
35B (UD Q4_K_M):
llama-server.exe -m "path\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" -ngl 99 -c 131072 -np 2 -fa on -ctk f16 -ctv f16 -b 2048 -ub 512 -t 8 --mlock -rea on --reasoning-budget 2048 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.027B (UD Q4_K_XL):
llama-server.exe -m "path\Qwen3.6-27B-UD-Q4_K_XL.gguf" -ngl 99 -c 196608 -np 1 -fa on -ctk q8_0 -ctv q8_0 -b 2048 -ub 512 -t 8 --no-mmap -rea on --reasoning-budget -1 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.0Reported Issues
- 35B too slow – even simple iterative tasks feel unusable.
- 27B faster but unreliable – code output breaks; simple tasks can take 20–30 minutes.
- Manual model switching – must kill server, paste new command, reload model.
Specific Questions
- Are the flags suboptimal? (e.g., context size, batch size, cache type)
- Which quant / model gives best balance of speed and coding accuracy on 24GB VRAM?
- How to auto-switch models per request, or keep multiple models warm and route?
Context
The user runs Hermes agent on a Raspberry Pi 5 for scraping and automation, and local coding with OpenCode/QwenCode. They want a setup that doesn't require manual server restarts.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Creating Custom Skills for Claude Co-Work: Best Practices and Formats
Explore best practices for creating custom skills for Claude Co-Work with specific formatting tips and implementation advice from user-experienced insights.

OpenClaw v2026.3.22 Update Issues and 30-Second Fixes
The OpenClaw v2026.3.22 update introduced 12 breaking changes, including ClawHub becoming the default plugin store and deprecated environment variables. Five common disasters with quick fixes include API billing spikes, unintended agent actions, and configuration errors.

How Small Model Evaluation Prompts Can Mislead and How to Fix Them
A Reddit post explains that small model evaluation prompts often produce misleading results due to triggering the wrong cognitive pathways in transformers, specifically identifying three distinct modes: factual recall, application/instruction following, and emotional/empathic inference.

Components of a Coding Agent: How Tools, Memory, and Context Extend LLMs
Sebastian Raschka breaks down the six building blocks of coding agents like Claude Code and Codex CLI, explaining how agent harnesses combine models with tools, memory, and repository context to make LLMs more effective for software work.