Optimizing Qwen3.5-9B on RTX 3070 Mobile with ik_llama.cpp: Config Tweaks and Benchmarks

Hardware and Software Setup
A developer documented their experience optimizing local inference on a laptop with an RTX 3070 Mobile GPU (8GB VRAM, effectively ~7.7GB usable). The system runs CachyOS (Arch-based Linux 6.19) with 32GB RAM and an Intel i7-10750H CPU. They used ik_llama.cpp (ikawrakow's optimized fork of llama.cpp) with the Qwen3.5-9B Q4_K_M model from Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF.
Initial Configuration Issues
The initial naive configuration included several problems:
- MoE-specific flags (
--n-cpu-moe,-ger,-ser) were incorrectly applied to a non-MoE model (n_expert = 0) --mlockwas silently failing due to memory allocation limits (requiresulimit -l unlimitedor limits.conf entry)- Batch size
-b 4096was consuming excessive VRAM (2004 MiB compute buffer), nearly 2GB on an 8GB card
This configuration produced ~47.8 t/s generation speed and ~82 t/s prompt evaluation with VRAM at ~97%.
Optimization Results
After fixing the configuration issues and adjusting batch sizes to -b 2048 -ub 512 (reducing compute buffer to 501 MiB), the developer tested different KV cache configurations:
- Original (q4_0/q4_0, b4096): 47.8 t/s gen, 82.6 t/s prompt, ~97% VRAM
- Fixed flags + b2048/ub512, q8_0K/q4_0V: 48.4 t/s gen, 189.9 t/s prompt, ~80% VRAM
- q8_0K/q8_0V: 50.0 t/s gen, 213.0 t/s prompt, ~84% VRAM
The prompt evaluation speed increased dramatically from ~82 to ~213 t/s, primarily from reducing batch size to free up GPU memory. While generation speed showed minimal change (~2% difference between q4_0 and q8_0), the q8_0/q8_0 configuration produced noticeably more coherent and complete responses on longer outputs, worth the extra ~256 MiB VRAM usage.
Final Configuration
The optimized command for single-user local server use:
./build/bin/llama-server \
-m ./models/Qwen3.5-9B.Q4_K_M.gguf \
-ngl 999 \
-fa on \
-c 65536 \
-b 2048 \
-ub 512 \
-ctk q8_0 \
-ctv q8_0 \
--threads 6 \
--threads-batch 12Open Questions and Future Testing
The developer identified several areas for further investigation:
- GPU power limit tuning on mobile GPUs (potential to reduce TGP with minimal speed loss since inference is memory-bandwidth bound)
- Other 8GB-compatible models with good coding or reasoning performance
- Comparison of ik_llama.cpp vs mainline llama.cpp (ik-specific optimizations include fused ops and graph reuse)
- Tips for hybrid SSM architecture (context shift warnings cause hard stops when context fills, no sliding window)
The testing used a prompt requesting implementation of a Rust Sieve of Eratosthenes program with algorithm explanation, complexity analysis, and example output for N=50.
📖 Read the full source: r/LocalLLaMA
👀 See Also

RAG 챗봇 평가: 모델 스윕 + 검색 수정으로 비용 79% 절감 및 품질 19% 향상
한 개발자가 고객 지원 RAG 봇을 평가한 결과, 검색 설정 오류, 휴리스틱 평가자의 결함, 그리고 프로덕션 모델보다 성능이 뛰어난 더 저렴한 모델을 발견했습니다. 품질은 6.62에서 7.88로 향상되었고, 세션당 비용은 $0.002420에서 $0.000509로 감소했습니다.

OpenClaw 멀티 에이전트 플레이북: 5개/월당 7개의 격리된 에이전트
초점화된 메모리, 최소 권한, 스마트 모델 라우팅을 갖춘 전문화된 AI 에이전트 실행을 위한 완전한 아키텍처 가이드

오픈클로: 당신의 궁극적인 빠른 참조 치트시트
OpenClaw의 핵심을 파헤치는 유용한 레퍼런스 치트시트를 살펴보세요. AI 코딩 경험을 간소화하기 위한 핵심 기능과 기능들을 추출하세요.

클로드 코드 LSP 설정 가이드: 구조적 코드 이해
레딧 게시물은 구조적 코드 이해를 위해 텍스트 매칭 대신 Language Server Protocol을 사용하도록 Claude Code를 구성하는 방법을 설명하며, 정의로 이동, 참조 찾기, 호출 계층 구조 기능으로 쿼리 시간을 30-60초에서 ~50ms로 줄입니다.