Optimize Qwen3.5-9B on RTX 3070 Mobile: 50 tok/s Configs

Hardware and Software Setup

A developer documented their experience optimizing local inference on a laptop with an RTX 3070 Mobile GPU (8GB VRAM, effectively ~7.7GB usable). The system runs CachyOS (Arch-based Linux 6.19) with 32GB RAM and an Intel i7-10750H CPU. They used ik_llama.cpp (ikawrakow's optimized fork of llama.cpp) with the Qwen3.5-9B Q4_K_M model from Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF.

Initial Configuration Issues

The initial naive configuration included several problems:

MoE-specific flags (--n-cpu-moe, -ger, -ser) were incorrectly applied to a non-MoE model (n_expert = 0)
--mlock was silently failing due to memory allocation limits (requires ulimit -l unlimited or limits.conf entry)
Batch size -b 4096 was consuming excessive VRAM (2004 MiB compute buffer), nearly 2GB on an 8GB card

This configuration produced ~47.8 t/s generation speed and ~82 t/s prompt evaluation with VRAM at ~97%.

Optimization Results

After fixing the configuration issues and adjusting batch sizes to -b 2048 -ub 512 (reducing compute buffer to 501 MiB), the developer tested different KV cache configurations:

Original (q4_0/q4_0, b4096): 47.8 t/s gen, 82.6 t/s prompt, ~97% VRAM
Fixed flags + b2048/ub512, q8_0K/q4_0V: 48.4 t/s gen, 189.9 t/s prompt, ~80% VRAM
q8_0K/q8_0V: 50.0 t/s gen, 213.0 t/s prompt, ~84% VRAM

The prompt evaluation speed increased dramatically from ~82 to ~213 t/s, primarily from reducing batch size to free up GPU memory. While generation speed showed minimal change (~2% difference between q4_0 and q8_0), the q8_0/q8_0 configuration produced noticeably more coherent and complete responses on longer outputs, worth the extra ~256 MiB VRAM usage.

Final Configuration

The optimized command for single-user local server use:

./build/bin/llama-server \
 -m ./models/Qwen3.5-9B.Q4_K_M.gguf \
 -ngl 999 \
 -fa on \
 -c 65536 \
 -b 2048 \
 -ub 512 \
 -ctk q8_0 \
 -ctv q8_0 \
 --threads 6 \
 --threads-batch 12

Open Questions and Future Testing

The developer identified several areas for further investigation:

GPU power limit tuning on mobile GPUs (potential to reduce TGP with minimal speed loss since inference is memory-bandwidth bound)
Other 8GB-compatible models with good coding or reasoning performance
Comparison of ik_llama.cpp vs mainline llama.cpp (ik-specific optimizations include fused ops and graph reuse)
Tips for hybrid SSM architecture (context shift warnings cause hard stops when context fills, no sliding window)

The testing used a prompt requesting implementation of a Rust Sieve of Eratosthenes program with algorithm explanation, complexity analysis, and example output for N=50.

📖 Read the full source: r/LocalLLaMA