Running Qwen3.6-35B-A3B with ~190k Context on 8GB VRAM + 32GB RAM – Setup & Benchmarks

A Reddit user has posted a detailed setup for running Qwen3.6-35B-A3B GGUF models with ~190k context on a laptop with 8GB VRAM (RTX 4060) and 32GB DDR5 RAM. They report 37-43 tok/s out of the box, with tweaks pushing to ~51 tok/s.
Hardware & Models
- GPU: RTX 4060 8GB VRAM
- RAM: 32GB DDR5 5600MHz
- OS: Linux (performance noted as better than Windows)
- Models tested (Q5 quant):
mudler/Qwen3.6-35B-A3B-APEX-GGUF– ~40 tok/s to 37 tok/shesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF– ~43 tok/s to 37 tok/s
Key Configuration
Using a fork of llama.cpp with TurboQuant support (turboquant_plus), the user runs llama-server with the following flags:
--model "<path>" \
--host 0.0.0.0 \
--port 8085 \
--ctx-size 192640 \
--n-gpu-layers 430 \
--n-cpu-moe 35 \
--cache-type-k "turbo4" \
--cache-type-v "turbo4" \
--flash-attn on \
--batch-size 2048 \
--parallel 1 \
--no-mmap \
--mlock \
--ubatch-size 512 \
--threads 6 \
--cont-batching \
--timeout 300 \
--temp 0.2 \
--top-p 0.95 \
--min-p 0.05 \
--top-k 20 \
--metrics \
--chat-template-kwargs '{"preserve_thinking": true}'
To push speeds to ~51 tok/s, adjust three flags: --ctx-size 192640, --n-gpu-layers 430, --n-cpu-moe 35 (tweak slightly based on stability/memory).
Caveats
- Q4 quant is noticeably worse for long-context reasoning vs Q5.
--no-mmap+--mlockreduces stuttering slowdowns.- TurboQuant KV cache is critical at high context sizes.
- High RAM bandwidth (DDR5) is important for these speeds.
- Linux outperforms Windows significantly for this workload.
Who This Is For
Developers running local LLMs with very long contexts (170k+ tokens) on consumer hardware, especially those with 8-12GB VRAM and fast system RAM.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude Code Cheat Sheet with 140 Tips and LLMs.txt File
A GitHub repository contains a Claude Code cheat sheet with 140 tips organized into 14 sections, tagged by difficulty. The repository includes an llms.txt file that can be fed directly to Claude for learning or applying the tips.

V100 SXM2 NVLink Homelab Guide: Building 64GB Unified VRAM for ~$1,100
A comprehensive guide details how to build a V100 SXM2 homelab with 64GB of NVLink-unified VRAM for approximately $1,100 using reverse-engineered Chinese hardware, covering hardware sourcing, performance estimates, and software compatibility.

Building a Full BI System with Claude Code and Metabase for Under $50/month
A Reddit user built a complete BI system using Claude Code, BigQuery, and self-hosted Metabase — replacing $15k expert quotes with 3 days of work and $30/month in cloud costs.

Fix for 'VM Service Not Running' error in Cowork on Windows 11
A Reddit user shares a PowerShell command fix for the 'VM Service Not Running' error in Cowork when Hyper-V is installed but the hypervisor isn't launching at boot. The solution involves checking hypervisorlaunchtype and setting it to auto.