Running Qwen3.6 27B and 35B on 6GB VRAM with ik_llama: Practical Configs and Benchmarks

✍️ OpenClawRadar📅 Published: May 17, 2026🔗 Source
Running Qwen3.6 27B and 35B on 6GB VRAM with ik_llama: Practical Configs and Benchmarks
Ad

A Reddit user reports successfully running Qwen3.6 27B and 35B A3B models on an old gaming laptop with an RTX 2060 Mobile (6 GB VRAM) and 32 GB RAM using ik_llama and llama.cpp. Key optimizations include double speculative decoding with MTP and ngram, --fit and --mtp-requantize-output-tensor, plus output tensor repacking. Below are the exact configs and observed speeds.

Config for Qwen3.6 27B (Q3_K_XL)

export GGML_CUDA_GRAPHS=1
./llama-server \
  -m /mnt/second-ssd/lib/llama.cpp/models/Qwen3.6-27B-MTP-UD-Q3_K_XL.gguf \
  -c 16000 \
  -b 512 -ub 512 \
  --fit --fit-margin 3076 \
  -fa on \
  -np 1 \
  -ctk q4_0 -ctv q4_0 \
  --mtp-requantize-output-tensor q4_0 \
  -khad -vhad -rtr \
  --threads 6 --threads-batch 8 \
  --slot-save-path ./slots \
  --prompt-cache "prompt.cache" \
  --port 8888 --host 0.0.0.0 \
  --spec-stage ngram-mod:n_max=64,n_min=2,spec-ngram-size-n=16 \
  --spec-stage mtp:n_max=1,draft-p-min=0.0 \
  --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \
  --jinja \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --reasoning on
Ad

Config for Qwen3.6 35B A3B (IQ4_XS, Claude Opus Distill)

export GGML_CUDA_GRAPHS=1
./llama-server \
  -m /mnt/second-ssd/lib/llama.cpp/models/lordx64-Claude-4.7-Opus-Reasoning-Distilled-Qwen3.6-35B-A3B-MTP-IQ4_XS.gguf \
  -c 80000 \
  -b 1024 -ub 1024 \
  --fit --fit-margin 2048 \
  -fa on \
  -np 1 \
  -ctk q8_0 -ctv q4_0 \
  --mtp-requantize-output-tensor q4_0 \
  -khad -vhad -rtr \
  --threads 6 --threads-batch 8 \
  --slot-save-path ./slots \
  --prompt-cache "prompt.cache" \
  --mlock --no-mmap \
  --port 8888 --host 0.0.0.0 \
  --spec-stage ngram-mod:n_max=64,n_min=2,spec-ngram-size-n=16 \
  --spec-stage mtp:n_max=3,draft-p-min=0.0 \
  --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \
  --jinja \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --reasoning on

Performance Numbers

  • 27B: prefill ~100 t/s, first token up to 4 t/s, ~1 t/s at 10k context
  • 35B A3B: prefill ~40 t/s, first token up to 15 t/s, constant ~11 t/s at 10k context

The user notes that 27B became usable for reasoning about files up to 1000 lines (taking minutes but useful), and the 35B Opus distill runs at a steady 11 t/s output. They use it to generate mermaid charts, images, markdown, and PDFs with little-coder or agentic coding workflows.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also