Running OmniCoder-9B locally with llama.cpp configuration details

Hardware and Model Setup
The setup uses mid-range hardware: AMD Ryzen 9 5900X CPU (12 threads used for inference), 62GB DDR4 RAM, NVIDIA RTX 3080 with 10GB VRAM, NVMe SSD, and Ubuntu 22.04 on a remote server.
The model is OmniCoder-9B, based on Qwen3.5-9B, fine-tuned on 425k+ coding agent trajectories by Tesslate. It uses Q6_K quantization (6.85GB file size) with 128K token context window, sourced from HuggingFace.
llama.cpp Configuration
The model runs via llama.cpp server with these specific flags:
llama-server \ --model /home/openclaw/models/omnicoder-9b/omnicoder-9b-q6_k.gguf \ --host 0.0.0.0 --port 8080 \ --ctx-size 131072 \ --n-gpu-layers 99 \ --cache-type-k q8_0 \ --cache-type-v q4_0 \ --threads 12 \ --batch-size 128 \ --flash-attn on \ --temp 0.4 \ --top-k 20 \ --top-p 0.95 \ --jinja \ --reasoning-budget 0
Key parameters explained:
--ctx-size 131072: 128K context window for large codebases--n-gpu-layers 99: Offload all layers to GPU--cache-type-k q8_0 --cache-type-v q4_0: Compressed KV cache to fit 128K context in 10GB VRAM--threads 12: Match physical cores (not hyperthreads)--flash-attn on: Faster attention computation--reasoning-budget 0: Disables chain-of-thought output in the reasoning_content field, making the model output code directly
Performance and Testing
Performance metrics: prompt evaluation at ~300 tokens/s, generation at ~80-90 tokens/s, VRAM usage ~8.5GB/10GB, latency 1-5 seconds for typical coding tasks.
The testing was conducted by Agent Zero, an autonomous agent framework using GLM-5 as its main brain. Agent Zero discovered the --reasoning-budget 0 flag, SSH'd into the remote server, updated the systemd service, created benchmark scripts from scratch, ran multiple benchmarks (HumanEval base, HumanEval Pro, MBPP, MultiPL-E), and iterated on prompt engineering.
Benchmark Results
Benchmark results compared to official claims:
- HumanEval base: Official 92.7%, Run 1: 100%, Run 2: 95%, Run 3: 95%, Average: 96.7%
- HumanEval Pro: Official 70.1%, Run 1: 70%, Average: 70%
The average HumanEval base score of 96.7% exceeds the official 92.7%, while HumanEval Pro matches exactly at 70%.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw CLI Performance Triage Checklist
A Reddit user shares a six-step checklist to diagnose slow OpenClaw CLI commands, including commands to measure latency, monitor system resources, check gateway logs, and isolate configuration issues.

Research Shows Effective AI Prompting Is Cooperative Communication, Not Engineering
Peer-reviewed research indicates that effective prompting with AI models follows the same cooperative communication principles humans use, with Lakera's analysis showing most prompt failures stem from ambiguity rather than model limitations.

12 OpenClaw SOUL.md and STYLE.md Templates with Practical Lessons
A developer created 12 OpenClaw agent templates for common use cases, each following the official 4-section spec, and identified key lessons including the necessity of STYLE.md for defining communication patterns and the importance of specific boundaries over vague personality traits.

Configuring OpenClaw for Smooth Agent-to-Agent Communication
A Reddit user shares specific configuration settings for OpenClaw that reduce timeouts in agent-to-agent communication, including tool visibility settings, memory directives, and workarounds for the ANNOUNCE_SKIP limitation.