12GB VRAM Benchmarks: Running Qwen 3.6 and Gemma 4 Models on a RTX 4070 Super

✍️ OpenClawRadar📅 Published: April 30, 2026🔗 Source

A Reddit user has published speed benchmarks for running several large MoE models on a 12 GB RTX 4070 Super (with +10% OC), paired with an AMD 9800X3D CPU and 64 GB DDR5-6000 RAM. The user offloads display to the iGPU to save VRAM, noting a ~10% performance penalty otherwise. Setup uses CUDA 13.1 and the latest llama.cpp with the following hardware configuration:

n-gpu-layers = 999
threads = 8
threads-batch = 16
batch-size = 4096
ubatch-size = 4096
ctx-size = 65536
flash-attn = true

Benchmark Results

The user tested four models via Unsloth GGUF quants in VS Code with Cline and KiloCode (no tool call issues). All measurements are tokens per second (tgs) and processing per second (pps).

Qwen3.6-35B-A3B-GGUF Q6_K_XL: 40 tgs, 2100 pps
Qwen3.6-27B-IQ3_XXS: 16 tgs, 1000 pps
Gemma 4 26B-A4B-it-UD-Q8: 26 tgs, 2150 pps
Gemma-4-31B-it-IQ3_XXS: 13-16 tgs, 650 pps

Notable Config Details

The user shared individual model configs with specific tuning. Key highlights:

For Qwen3.6-35B-A3B: n-cpu-moe = 35 (offloads 35 MoE experts to CPU), cache-type-k = q8_0, cache-type-v = q8_0, swa-full = true, cache-reuse = 512, context size 131072, reasoning enabled with budget 8096.
For Gemma 4 26B: n-cpu-moe = 27, context 102400, fit = on with fit-target = 256 and fit-ctx = 32768.
For Gemma 4 31B: uses speculative decoding with ngram-mod (spec-type = ngram-mod), n-gpu-layers = 58 (partial GPU offload), cache-type-k = q4_0, no-kv-offload = true.
All models use flash-attn = true and no-mmproj-offload = true.

The user's preferred model for web dev is Qwen3.6-35B-A3B, praising its quality with no tool call issues in VS Code extensions.

📖 Read the full source: r/LocalLLaMA

👀 See Also

Guides

Running OpenClaw Locally with Ollama to Avoid API Costs

A Reddit user shares their experience switching from API-based OpenClaw to running it locally with Ollama, eliminating API costs while maintaining workflows. They created a step-by-step installation video guide.

Mar 19, 2026, 05:45 PM UTC

OpenClawRadar

Guides

OpenClaw 101: The Ultimate Setup Guide for New Users

Feb 7, 2026, 03:58 PM UTC

u/adamb0mbNZ

Guides

Flow Maps: Learning the Integral of a Diffusion Model for Faster Sampling

Sander Dieleman explains flow maps — neural networks that directly predict the integral of a diffusion model's ODE, enabling faster sampling, reward-based learning, and steerability.

May 6, 2026, 08:20 PM UTC

OpenClawRadar

Guides

OpenClaw Project Operating System: Multi-Project Management Framework

A framework that isolates projects with standardized directories, uses cron for automation instead of agents for predictable tasks, and implements mandatory backup protocols to reduce token usage and improve execution consistency.

Mar 23, 2026, 03:45 PM UTC

OpenClawRadar