Running Qwen3.6-35B-A3B-UD-Q5_K_XL Locally with VS Code Copilot on AMD R9700

✍️ OpenClawRadar📅 Published: May 7, 2026🔗 Source
Running Qwen3.6-35B-A3B-UD-Q5_K_XL Locally with VS Code Copilot on AMD R9700
Ad

A Reddit user reports great results running the Qwen3.6-35B-A3B-UD-Q5_K_XL GGUF model locally using llama.cpp with Vulkan on a single AMD R9700 GPU. The setup served as a drop-in replacement for GitHub Copilot in VS Code, generating a complete test website and Playwright test suite with minimal intervention.

llama.cpp Startup Command

/app/llama-server -m /models/Qwen3.6-35B-A3B-UD-Q5_K_XL/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf \
  --ctx-size 262144 --threads 8 --threads-batch 8 \
  --gpu-layers 99 --parallel 1 --flash-attn on \
  --batch-size 2048 --ubatch-size 1024 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --cache-ram 12000 --ctx-checkpoints 50 \
  --mmap --no-mmproj --kv-unified \
  --reasoning off --reasoning-budget 0 --jinja \
  --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 \
  --repeat-penalty 1.0 --presence-penalty 0.0

Key parameters: 256K context window, 99 GPU layers for full offload, flash attention enabled, and sampling config taken from the Qwen3.6-35B-A3B Hugging Face page under "precise coding".

Ad

VS Code Integration

The user configured a custom chat model in chatLanguageModels.json pointing to the local llama.cpp server:

{
  "name": "Sean Llama.cpp",
  "vendor": "customoai",
  "apiKey": "${input:chat.lm.secret.3c0c0f21}",
  "models": [
    {
      "id": "Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf",
      "name": "Qwen3.6-35B",
      "url": "https://llm.home.arpa/v1/chat/completions",
      "toolCalling": true,
      "vision": false,
      "maxInputTokens": 180000,
      "maxOutputTokens": 10000,
      "family": "Qwen3",
      "inputTokenCost": 0.0001,
      "outputTokenCost": 0.0001,
      "temperature": 0.6,
      "top_p": 0.95,
      "top_k": 20,
      "repeat_penalty": 1,
      "presence_penalty": 0,
      "frequency_penalty": 0,
      "systemMessage": "You are a precise coding assistant. Avoid repeating plans. Execute tasks directly. Do not restate intentions multiple times.",
      "timeout": 600000,
      "retry": { "enabled": true, "max_attempts": 2, "interval_ms": 1500 }
    }
  ]
}

The model correctly responded to tool calling requests, allowing it to act as a Copilot replacement.

Real-World Test: Full Stack Generation

The user fed a detailed prompt (originally from ChatGPT) asking the model to build a "Bike Shop Service Tracker" — a local-first React + TypeScript app using localStorage. Requirements included a data model, seed data, filtering, sorting, and form validation. The model generated the entire website fully functional on the first run.

Next, they prompted it to generate a complete Playwright test suite. Only one test required a manual fix — otherwise the suite ran without errors. The user's conclusion: "I think I am done tweaking and testing models (until the next big release) and can get back to coding now."

Who It's For

Developers running local LLMs for coding assistance, especially those with AMD GPUs (Vulkan) who want a Copilot alternative with comparable quality.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also