Running Qwen3.6-35B-A3B-UD-Q5_K_XL Locally with VS Code Copilot on AMD R9700

A Reddit user reports great results running the Qwen3.6-35B-A3B-UD-Q5_K_XL GGUF model locally using llama.cpp with Vulkan on a single AMD R9700 GPU. The setup served as a drop-in replacement for GitHub Copilot in VS Code, generating a complete test website and Playwright test suite with minimal intervention.
llama.cpp Startup Command
/app/llama-server -m /models/Qwen3.6-35B-A3B-UD-Q5_K_XL/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf \
--ctx-size 262144 --threads 8 --threads-batch 8 \
--gpu-layers 99 --parallel 1 --flash-attn on \
--batch-size 2048 --ubatch-size 1024 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--cache-ram 12000 --ctx-checkpoints 50 \
--mmap --no-mmproj --kv-unified \
--reasoning off --reasoning-budget 0 --jinja \
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 \
--repeat-penalty 1.0 --presence-penalty 0.0
Key parameters: 256K context window, 99 GPU layers for full offload, flash attention enabled, and sampling config taken from the Qwen3.6-35B-A3B Hugging Face page under "precise coding".
VS Code Integration
The user configured a custom chat model in chatLanguageModels.json pointing to the local llama.cpp server:
{
"name": "Sean Llama.cpp",
"vendor": "customoai",
"apiKey": "${input:chat.lm.secret.3c0c0f21}",
"models": [
{
"id": "Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf",
"name": "Qwen3.6-35B",
"url": "https://llm.home.arpa/v1/chat/completions",
"toolCalling": true,
"vision": false,
"maxInputTokens": 180000,
"maxOutputTokens": 10000,
"family": "Qwen3",
"inputTokenCost": 0.0001,
"outputTokenCost": 0.0001,
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"repeat_penalty": 1,
"presence_penalty": 0,
"frequency_penalty": 0,
"systemMessage": "You are a precise coding assistant. Avoid repeating plans. Execute tasks directly. Do not restate intentions multiple times.",
"timeout": 600000,
"retry": { "enabled": true, "max_attempts": 2, "interval_ms": 1500 }
}
]
}
The model correctly responded to tool calling requests, allowing it to act as a Copilot replacement.
Real-World Test: Full Stack Generation
The user fed a detailed prompt (originally from ChatGPT) asking the model to build a "Bike Shop Service Tracker" — a local-first React + TypeScript app using localStorage. Requirements included a data model, seed data, filtering, sorting, and form validation. The model generated the entire website fully functional on the first run.
Next, they prompted it to generate a complete Playwright test suite. Only one test required a manual fix — otherwise the suite ran without errors. The user's conclusion: "I think I am done tweaking and testing models (until the next big release) and can get back to coding now."
Who It's For
Developers running local LLMs for coding assistance, especially those with AMD GPUs (Vulkan) who want a Copilot alternative with comparable quality.
📖 Read the full source: r/LocalLLaMA
👀 See Also

ClawWatcher Reaches 200 Users, Reports $28K+ in Collective OpenClaw API Savings
ClawWatcher, a tool that tracks OpenClaw API costs in real-time, has reached 200 users. According to its creator, users have collectively saved over $28,000 in API costs, with an average cost reduction of 45%.

Session Search: Local Full-Text Search for Claude Code and Codex Sessions, Now in Your Menu Bar
Session Search indexes local Claude Code and Codex transcripts using SQLite FTS, enabling deep full-text search across errors, commands, filenames, and decisions—accessible from the macOS menu bar with highlighted snippets.

Real-Time Desktop Overlay for Monitoring Claude Code Usage Limits
The open-source desktop overlay displays Claude Code usage limits in real-time, eliminating the need to repeatedly type '/usage'.

Jork Agentic Framework Built with Claude Ranks Top 10 in $4M Hackathon
A developer built an agentic framework called Jork using Claude and GLM models that ranked Top 10 among 2000+ applications in a $4 million hackathon. The framework autonomously developed tools including a Solana launchpad radar and a working word search game.