Running Qwen3.6-35B-A3B with ~190k Context on 8GB VRAM + 32GB RAM – Setup & Benchmarks

✍️ OpenClawRadar📅 Published: May 10, 2026🔗 Source

Running Qwen3.6-35B-A3B with ~190k Context on 8GB VRAM + 32GB RAM – Setup & Benchmarks

Ad

A Reddit user has posted a detailed setup for running Qwen3.6-35B-A3B GGUF models with ~190k context on a laptop with 8GB VRAM (RTX 4060) and 32GB DDR5 RAM. They report 37-43 tok/s out of the box, with tweaks pushing to ~51 tok/s.

Hardware & Models

GPU: RTX 4060 8GB VRAM
RAM: 32GB DDR5 5600MHz
OS: Linux (performance noted as better than Windows)
Models tested (Q5 quant):
- mudler/Qwen3.6-35B-A3B-APEX-GGUF – ~40 tok/s to 37 tok/s
- hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF – ~43 tok/s to 37 tok/s

Key Configuration

Using a fork of llama.cpp with TurboQuant support (turboquant_plus), the user runs llama-server with the following flags:

--model "<path>" \
--host 0.0.0.0 \
--port 8085 \
--ctx-size 192640 \
--n-gpu-layers 430 \
--n-cpu-moe 35 \
--cache-type-k "turbo4" \
--cache-type-v "turbo4" \
--flash-attn on \
--batch-size 2048 \
--parallel 1 \
--no-mmap \
--mlock \
--ubatch-size 512 \
--threads 6 \
--cont-batching \
--timeout 300 \
--temp 0.2 \
--top-p 0.95 \
--min-p 0.05 \
--top-k 20 \
--metrics \
--chat-template-kwargs '{"preserve_thinking": true}'

To push speeds to ~51 tok/s, adjust three flags: --ctx-size 192640, --n-gpu-layers 430, --n-cpu-moe 35 (tweak slightly based on stability/memory).

Ad

Caveats

Q4 quant is noticeably worse for long-context reasoning vs Q5.
--no-mmap + --mlock reduces stuttering slowdowns.
TurboQuant KV cache is critical at high context sizes.
High RAM bandwidth (DDR5) is important for these speeds.
Linux outperforms Windows significantly for this workload.

Who This Is For

Developers running local LLMs with very long contexts (170k+ tokens) on consumer hardware, especially those with 8-12GB VRAM and fast system RAM.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

Analysis of Claude Code's Production Engineering Patterns from Reverse-Engineered Source

Analysis of Claude Code's Production Engineering Patterns from Reverse-Engineered Source

A developer reverse-engineered approximately 500,000 lines of Claude Code's TypeScript source code into a 19-chapter technical handbook documenting production engineering patterns that emerge under real load, real money, and real adversaries.

Apr 3, 2026, 03:45 PM UTC

Flow Maps: Learning the Integral of a Diffusion Model for Faster Sampling

Flow Maps: Learning the Integral of a Diffusion Model for Faster Sampling

Sander Dieleman explains flow maps — neural networks that directly predict the integral of a diffusion model's ODE, enabling faster sampling, reward-based learning, and steerability.

May 6, 2026, 08:20 PM UTC

Fix for sub-agents not showing up in OpenClaw v2026.3.13

Fix for sub-agents not showing up in OpenClaw v2026.3.13

A workaround for OpenClaw v2026.3.13 where custom sub-agents don't appear in the agent list: simplify the openclaw.json agent list to only include IDs and manually register agents in runs.json with status set to 'idle'.

Mar 16, 2026, 06:45 AM UTC

AGENTS.md Done Right: A 25% Correctness Boost — or a 30% Drop

AGENTS.md Done Right: A 25% Correctness Boost — or a 30% Drop

Augment Code tested AGENTS.md files head-to-head: the best ones rival a model upgrade from Haiku to Opus; the worst ones hurt output. Decision tables, procedural workflows, and progressive disclosure win.

Apr 28, 2026, 10:15 PM UTC