Optimizing Qwen3.5-9B on RTX 3070 Mobile with ik_llama.cpp: Config Tweaks and Benchmarks

Hardware and Software Setup
A developer documented their experience optimizing local inference on a laptop with an RTX 3070 Mobile GPU (8GB VRAM, effectively ~7.7GB usable). The system runs CachyOS (Arch-based Linux 6.19) with 32GB RAM and an Intel i7-10750H CPU. They used ik_llama.cpp (ikawrakow's optimized fork of llama.cpp) with the Qwen3.5-9B Q4_K_M model from Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF.
Initial Configuration Issues
The initial naive configuration included several problems:
- MoE-specific flags (
--n-cpu-moe,-ger,-ser) were incorrectly applied to a non-MoE model (n_expert = 0) --mlockwas silently failing due to memory allocation limits (requiresulimit -l unlimitedor limits.conf entry)- Batch size
-b 4096was consuming excessive VRAM (2004 MiB compute buffer), nearly 2GB on an 8GB card
This configuration produced ~47.8 t/s generation speed and ~82 t/s prompt evaluation with VRAM at ~97%.
Optimization Results
After fixing the configuration issues and adjusting batch sizes to -b 2048 -ub 512 (reducing compute buffer to 501 MiB), the developer tested different KV cache configurations:
- Original (q4_0/q4_0, b4096): 47.8 t/s gen, 82.6 t/s prompt, ~97% VRAM
- Fixed flags + b2048/ub512, q8_0K/q4_0V: 48.4 t/s gen, 189.9 t/s prompt, ~80% VRAM
- q8_0K/q8_0V: 50.0 t/s gen, 213.0 t/s prompt, ~84% VRAM
The prompt evaluation speed increased dramatically from ~82 to ~213 t/s, primarily from reducing batch size to free up GPU memory. While generation speed showed minimal change (~2% difference between q4_0 and q8_0), the q8_0/q8_0 configuration produced noticeably more coherent and complete responses on longer outputs, worth the extra ~256 MiB VRAM usage.
Final Configuration
The optimized command for single-user local server use:
./build/bin/llama-server \
-m ./models/Qwen3.5-9B.Q4_K_M.gguf \
-ngl 999 \
-fa on \
-c 65536 \
-b 2048 \
-ub 512 \
-ctk q8_0 \
-ctv q8_0 \
--threads 6 \
--threads-batch 12Open Questions and Future Testing
The developer identified several areas for further investigation:
- GPU power limit tuning on mobile GPUs (potential to reduce TGP with minimal speed loss since inference is memory-bandwidth bound)
- Other 8GB-compatible models with good coding or reasoning performance
- Comparison of ik_llama.cpp vs mainline llama.cpp (ik-specific optimizations include fused ops and graph reuse)
- Tips for hybrid SSM architecture (context shift warnings cause hard stops when context fills, no sliding window)
The testing used a prompt requesting implementation of a Rust Sieve of Eratosthenes program with algorithm explanation, complexity analysis, and example output for N=50.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Bug Hunt: WireGuard Crashes and MTU Mismatch in GKE
Lovable engineers traced user errors to anetd crashes from a concurrent map access panic in Google's WireGuard integration, then found a secondary MTU mismatch after disabling encryption.

OpenClaw v2.0 update requires manual checks before installation
OpenClaw's latest update includes 12 breaking changes, a new plugin system, and 30+ security patches. The update will silently break setups if users run npm update without first checking environment variables, state directories, and browser automation configurations.

Open-source launch playbook for OSS LLM and local AI projects
An open-source playbook addresses discoverability issues for LLM and local AI projects by providing structured guidance on pre-launch preparation, launch-day execution, and post-launch follow-up. It includes templates and strategies for community distribution, creator outreach, and SEO optimization.

Mastering OpenClaw Skills: A Step-by-Step Guide
Unlock the full potential of OpenClaw with this comprehensive guide on building new skills. Learn key strategies to enhance your projects using AI coding agents.