12GB VRAM Benchmarks: Running Qwen 3.6 and Gemma 4 Models on a RTX 4070 Super

✍️ OpenClawRadar📅 Published: April 30, 2026🔗 Source
12GB VRAM Benchmarks: Running Qwen 3.6 and Gemma 4 Models on a RTX 4070 Super
Ad

A Reddit user has published speed benchmarks for running several large MoE models on a 12 GB RTX 4070 Super (with +10% OC), paired with an AMD 9800X3D CPU and 64 GB DDR5-6000 RAM. The user offloads display to the iGPU to save VRAM, noting a ~10% performance penalty otherwise. Setup uses CUDA 13.1 and the latest llama.cpp with the following hardware configuration:

n-gpu-layers = 999
threads = 8
threads-batch = 16
batch-size = 4096
ubatch-size = 4096
ctx-size = 65536
flash-attn = true

Benchmark Results

The user tested four models via Unsloth GGUF quants in VS Code with Cline and KiloCode (no tool call issues). All measurements are tokens per second (tgs) and processing per second (pps).

  • Qwen3.6-35B-A3B-GGUF Q6_K_XL: 40 tgs, 2100 pps
  • Qwen3.6-27B-IQ3_XXS: 16 tgs, 1000 pps
  • Gemma 4 26B-A4B-it-UD-Q8: 26 tgs, 2150 pps
  • Gemma-4-31B-it-IQ3_XXS: 13-16 tgs, 650 pps
Ad

Notable Config Details

The user shared individual model configs with specific tuning. Key highlights:

  • For Qwen3.6-35B-A3B: n-cpu-moe = 35 (offloads 35 MoE experts to CPU), cache-type-k = q8_0, cache-type-v = q8_0, swa-full = true, cache-reuse = 512, context size 131072, reasoning enabled with budget 8096.
  • For Gemma 4 26B: n-cpu-moe = 27, context 102400, fit = on with fit-target = 256 and fit-ctx = 32768.
  • For Gemma 4 31B: uses speculative decoding with ngram-mod (spec-type = ngram-mod), n-gpu-layers = 58 (partial GPU offload), cache-type-k = q4_0, no-kv-offload = true.
  • All models use flash-attn = true and no-mmproj-offload = true.

The user's preferred model for web dev is Qwen3.6-35B-A3B, praising its quality with no tool call issues in VS Code extensions.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also