Running OmniCoder-9B locally with llama.cpp configuration details

✍️ OpenClawRadar📅 Published: March 14, 2026🔗 Source
Running OmniCoder-9B locally with llama.cpp configuration details
Ad

Hardware and Model Setup

The setup uses mid-range hardware: AMD Ryzen 9 5900X CPU (12 threads used for inference), 62GB DDR4 RAM, NVIDIA RTX 3080 with 10GB VRAM, NVMe SSD, and Ubuntu 22.04 on a remote server.

The model is OmniCoder-9B, based on Qwen3.5-9B, fine-tuned on 425k+ coding agent trajectories by Tesslate. It uses Q6_K quantization (6.85GB file size) with 128K token context window, sourced from HuggingFace.

llama.cpp Configuration

The model runs via llama.cpp server with these specific flags:

llama-server \
--model /home/openclaw/models/omnicoder-9b/omnicoder-9b-q6_k.gguf \
--host 0.0.0.0 --port 8080 \
--ctx-size 131072 \
--n-gpu-layers 99 \
--cache-type-k q8_0 \
--cache-type-v q4_0 \
--threads 12 \
--batch-size 128 \
--flash-attn on \
--temp 0.4 \
--top-k 20 \
--top-p 0.95 \
--jinja \
--reasoning-budget 0

Key parameters explained:

  • --ctx-size 131072: 128K context window for large codebases
  • --n-gpu-layers 99: Offload all layers to GPU
  • --cache-type-k q8_0 --cache-type-v q4_0: Compressed KV cache to fit 128K context in 10GB VRAM
  • --threads 12: Match physical cores (not hyperthreads)
  • --flash-attn on: Faster attention computation
  • --reasoning-budget 0: Disables chain-of-thought output in the reasoning_content field, making the model output code directly
Ad

Performance and Testing

Performance metrics: prompt evaluation at ~300 tokens/s, generation at ~80-90 tokens/s, VRAM usage ~8.5GB/10GB, latency 1-5 seconds for typical coding tasks.

The testing was conducted by Agent Zero, an autonomous agent framework using GLM-5 as its main brain. Agent Zero discovered the --reasoning-budget 0 flag, SSH'd into the remote server, updated the systemd service, created benchmark scripts from scratch, ran multiple benchmarks (HumanEval base, HumanEval Pro, MBPP, MultiPL-E), and iterated on prompt engineering.

Benchmark Results

Benchmark results compared to official claims:

  • HumanEval base: Official 92.7%, Run 1: 100%, Run 2: 95%, Run 3: 95%, Average: 96.7%
  • HumanEval Pro: Official 70.1%, Run 1: 70%, Average: 70%

The average HumanEval base score of 96.7% exceeds the official 92.7%, while HumanEval Pro matches exactly at 70%.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also