Running Qwen3.6 27B and 35B on 6GB VRAM with ik_llama: Practical Configs and Benchmarks

A Reddit user reports successfully running Qwen3.6 27B and 35B A3B models on an old gaming laptop with an RTX 2060 Mobile (6 GB VRAM) and 32 GB RAM using ik_llama and llama.cpp. Key optimizations include double speculative decoding with MTP and ngram, --fit and --mtp-requantize-output-tensor, plus output tensor repacking. Below are the exact configs and observed speeds.
Config for Qwen3.6 27B (Q3_K_XL)
export GGML_CUDA_GRAPHS=1
./llama-server \
-m /mnt/second-ssd/lib/llama.cpp/models/Qwen3.6-27B-MTP-UD-Q3_K_XL.gguf \
-c 16000 \
-b 512 -ub 512 \
--fit --fit-margin 3076 \
-fa on \
-np 1 \
-ctk q4_0 -ctv q4_0 \
--mtp-requantize-output-tensor q4_0 \
-khad -vhad -rtr \
--threads 6 --threads-batch 8 \
--slot-save-path ./slots \
--prompt-cache "prompt.cache" \
--port 8888 --host 0.0.0.0 \
--spec-stage ngram-mod:n_max=64,n_min=2,spec-ngram-size-n=16 \
--spec-stage mtp:n_max=1,draft-p-min=0.0 \
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \
--jinja \
--chat-template-kwargs '{"preserve_thinking": true}' \
--reasoning on
Config for Qwen3.6 35B A3B (IQ4_XS, Claude Opus Distill)
export GGML_CUDA_GRAPHS=1
./llama-server \
-m /mnt/second-ssd/lib/llama.cpp/models/lordx64-Claude-4.7-Opus-Reasoning-Distilled-Qwen3.6-35B-A3B-MTP-IQ4_XS.gguf \
-c 80000 \
-b 1024 -ub 1024 \
--fit --fit-margin 2048 \
-fa on \
-np 1 \
-ctk q8_0 -ctv q4_0 \
--mtp-requantize-output-tensor q4_0 \
-khad -vhad -rtr \
--threads 6 --threads-batch 8 \
--slot-save-path ./slots \
--prompt-cache "prompt.cache" \
--mlock --no-mmap \
--port 8888 --host 0.0.0.0 \
--spec-stage ngram-mod:n_max=64,n_min=2,spec-ngram-size-n=16 \
--spec-stage mtp:n_max=3,draft-p-min=0.0 \
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \
--jinja \
--chat-template-kwargs '{"preserve_thinking": true}' \
--reasoning on
Performance Numbers
- 27B: prefill ~100 t/s, first token up to 4 t/s, ~1 t/s at 10k context
- 35B A3B: prefill ~40 t/s, first token up to 15 t/s, constant ~11 t/s at 10k context
The user notes that 27B became usable for reasoning about files up to 1000 lines (taking minutes but useful), and the 35B Opus distill runs at a steady 11 t/s output. They use it to generate mermaid charts, images, markdown, and PDFs with little-coder or agentic coding workflows.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Fixing OpenClaw Agent Autonomy Issues: Skill Files, Tool Selection, and Cron Setup
A developer shares solutions for OpenClaw agents that stop working autonomously after initial setup. Key fixes include using external skill files instead of chat instructions, replacing browser tools with API-based tools or Puppeteer scripts, and properly configuring cron jobs.

VPS vs Dedicated Machine: Where to Run OpenClaw

Running OpenClaw Locally with Ollama to Avoid API Costs
A Reddit user shares their experience switching from API-based OpenClaw to running it locally with Ollama, eliminating API costs while maintaining workflows. They created a step-by-step installation video guide.

OpenClaw setup guide from Reddit analysis: hardware, cost, memory, and security practices
A Reddit user analyzed common OpenClaw mistakes and created a setup guide covering hardware requirements, cost optimization to $10/month, memory management using MEMORY.md files, and security practices to prevent prompt injection attacks.