12GB VRAM Benchmarks: Running Qwen 3.6 and Gemma 4 Models on a RTX 4070 Super

A Reddit user has published speed benchmarks for running several large MoE models on a 12 GB RTX 4070 Super (with +10% OC), paired with an AMD 9800X3D CPU and 64 GB DDR5-6000 RAM. The user offloads display to the iGPU to save VRAM, noting a ~10% performance penalty otherwise. Setup uses CUDA 13.1 and the latest llama.cpp with the following hardware configuration:
n-gpu-layers = 999
threads = 8
threads-batch = 16
batch-size = 4096
ubatch-size = 4096
ctx-size = 65536
flash-attn = true
Benchmark Results
The user tested four models via Unsloth GGUF quants in VS Code with Cline and KiloCode (no tool call issues). All measurements are tokens per second (tgs) and processing per second (pps).
- Qwen3.6-35B-A3B-GGUF Q6_K_XL: 40 tgs, 2100 pps
- Qwen3.6-27B-IQ3_XXS: 16 tgs, 1000 pps
- Gemma 4 26B-A4B-it-UD-Q8: 26 tgs, 2150 pps
- Gemma-4-31B-it-IQ3_XXS: 13-16 tgs, 650 pps
Notable Config Details
The user shared individual model configs with specific tuning. Key highlights:
- For Qwen3.6-35B-A3B:
n-cpu-moe = 35(offloads 35 MoE experts to CPU),cache-type-k = q8_0,cache-type-v = q8_0,swa-full = true,cache-reuse = 512, context size 131072, reasoning enabled with budget 8096. - For Gemma 4 26B:
n-cpu-moe = 27, context 102400,fit = onwithfit-target = 256andfit-ctx = 32768. - For Gemma 4 31B: uses speculative decoding with
ngram-mod(spec-type = ngram-mod),n-gpu-layers = 58(partial GPU offload),cache-type-k = q4_0,no-kv-offload = true. - All models use
flash-attn = trueandno-mmproj-offload = true.
The user's preferred model for web dev is Qwen3.6-35B-A3B, praising its quality with no tool call issues in VS Code extensions.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw: Your Ultimate Quick Reference Cheatsheet
Dive into the nitty-gritty of OpenClaw with our handy reference cheatsheet. Extract critical features and functionalities to streamline your AI coding experience.

Practical OpenClaw Setup Insights from Docker/Windows Experience
A developer shares specific lessons from running OpenClaw on Docker with Windows 11/WSL2, covering persistence issues, Discord bot configuration, memory management approaches, and browser automation workarounds.

Mastering OpenClaw 101: A Beginner's Guide Inspired by Redditor Insights
Dive into OpenClaw with our comprehensive guide, inspired by insights from the Reddit community. Avoid common pitfalls and maximize your productivity with these expert tips.

Optimizing GLM-4.7-Flash on M4 Mac Mini with 24GB RAM
A developer shares specific configuration details for running GLM-4.7-Flash on an M4 Mac Mini with 24GB RAM, including Q3_K_XL quantization, 32k context size with MLA, and memory allocation realities for Metal.