Run OmniCoder-9B Locally: llama.cpp Config & 96.7% Score

Hardware and Model Setup

The setup uses mid-range hardware: AMD Ryzen 9 5900X CPU (12 threads used for inference), 62GB DDR4 RAM, NVIDIA RTX 3080 with 10GB VRAM, NVMe SSD, and Ubuntu 22.04 on a remote server.

The model is OmniCoder-9B, based on Qwen3.5-9B, fine-tuned on 425k+ coding agent trajectories by Tesslate. It uses Q6_K quantization (6.85GB file size) with 128K token context window, sourced from HuggingFace.

llama.cpp Configuration

The model runs via llama.cpp server with these specific flags:

llama-server \
--model /home/openclaw/models/omnicoder-9b/omnicoder-9b-q6_k.gguf \
--host 0.0.0.0 --port 8080 \
--ctx-size 131072 \
--n-gpu-layers 99 \
--cache-type-k q8_0 \
--cache-type-v q4_0 \
--threads 12 \
--batch-size 128 \
--flash-attn on \
--temp 0.4 \
--top-k 20 \
--top-p 0.95 \
--jinja \
--reasoning-budget 0

Key parameters explained:

--ctx-size 131072: 128K context window for large codebases
--n-gpu-layers 99: Offload all layers to GPU
--cache-type-k q8_0 --cache-type-v q4_0: Compressed KV cache to fit 128K context in 10GB VRAM
--threads 12: Match physical cores (not hyperthreads)
--flash-attn on: Faster attention computation
--reasoning-budget 0: Disables chain-of-thought output in the reasoning_content field, making the model output code directly

Performance and Testing

Performance metrics: prompt evaluation at ~300 tokens/s, generation at ~80-90 tokens/s, VRAM usage ~8.5GB/10GB, latency 1-5 seconds for typical coding tasks.

The testing was conducted by Agent Zero, an autonomous agent framework using GLM-5 as its main brain. Agent Zero discovered the --reasoning-budget 0 flag, SSH'd into the remote server, updated the systemd service, created benchmark scripts from scratch, ran multiple benchmarks (HumanEval base, HumanEval Pro, MBPP, MultiPL-E), and iterated on prompt engineering.