Running Qwen3.6-35B-A3B with ~190k Context on 8GB VRAM + 32GB RAM – Setup & Benchmarks

✍️ OpenClawRadar📅 Published: May 10, 2026🔗 Source
Running Qwen3.6-35B-A3B with ~190k Context on 8GB VRAM + 32GB RAM – Setup & Benchmarks
Ad

A Reddit user has posted a detailed setup for running Qwen3.6-35B-A3B GGUF models with ~190k context on a laptop with 8GB VRAM (RTX 4060) and 32GB DDR5 RAM. They report 37-43 tok/s out of the box, with tweaks pushing to ~51 tok/s.

Hardware & Models

  • GPU: RTX 4060 8GB VRAM
  • RAM: 32GB DDR5 5600MHz
  • OS: Linux (performance noted as better than Windows)
  • Models tested (Q5 quant):
    • mudler/Qwen3.6-35B-A3B-APEX-GGUF – ~40 tok/s to 37 tok/s
    • hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF – ~43 tok/s to 37 tok/s

Key Configuration

Using a fork of llama.cpp with TurboQuant support (turboquant_plus), the user runs llama-server with the following flags:

--model "<path>" \
--host 0.0.0.0 \
--port 8085 \
--ctx-size 192640 \
--n-gpu-layers 430 \
--n-cpu-moe 35 \
--cache-type-k "turbo4" \
--cache-type-v "turbo4" \
--flash-attn on \
--batch-size 2048 \
--parallel 1 \
--no-mmap \
--mlock \
--ubatch-size 512 \
--threads 6 \
--cont-batching \
--timeout 300 \
--temp 0.2 \
--top-p 0.95 \
--min-p 0.05 \
--top-k 20 \
--metrics \
--chat-template-kwargs '{"preserve_thinking": true}'

To push speeds to ~51 tok/s, adjust three flags: --ctx-size 192640, --n-gpu-layers 430, --n-cpu-moe 35 (tweak slightly based on stability/memory).

Ad

Caveats

  • Q4 quant is noticeably worse for long-context reasoning vs Q5.
  • --no-mmap + --mlock reduces stuttering slowdowns.
  • TurboQuant KV cache is critical at high context sizes.
  • High RAM bandwidth (DDR5) is important for these speeds.
  • Linux outperforms Windows significantly for this workload.

Who This Is For

Developers running local LLMs with very long contexts (170k+ tokens) on consumer hardware, especially those with 8-12GB VRAM and fast system RAM.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also