Omnicoder-9B Performance Review: Speed vs. Tool Calling Issues

Technical Overview
Omnicoder-9B is a coding-specific model developed by Tesslate, based on the Qwen 3.5 architecture. It's fine-tuned on top of Qwen3.5 9B using outputs from multiple models including Opus 4.6, GPT 5.4, GPT 5.3 Codex, and Gemini 3.1 Pro.
Performance Characteristics
The model demonstrates strong performance on mid-tier hardware. With 12GB of VRAM, users report consistent token generation at 15 tokens/second even with context size set to 100k. Prompt processing is notably fast at approximately 265 tokens/second. The model runs without crashing systems or causing performance degradation.
Limitations and Issues
Despite the speed advantages, Omnicoder-9B shows several limitations in practical coding scenarios:
- Failed to generate a complete Super Mario clone in a standalone HTML file with a one-shot prompt
- Experienced tool calling failures with MCP servers, generating MCP errors during data fetching
- Issues executing write tool calls from Claude Code, though this may involve compatibility factors
IDE Integration Testing
Testing in development environments revealed mixed results:
- In LM Studio with Roo Code: Disconnections occurred as token size increased to 4k, though this appears to be an integration issue rather than model-specific
- The model successfully updated or wrote small scripts with token sizes between 2-3k
- API requests failed for tokens above 4k without error messages
- In Claude Code: Token generation felt slower compared to Roo Code, and the model failed to execute write tool calls after generating output
The user notes that Roo Code has been the most effective extension for local LLMs among Continue and other tested options.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude File History: VS Code Extension for Tracking Claude Code Sessions
A VS Code extension called Claude File History tracks every Claude Code session that touched your files, allowing you to find past conversations, preview what was discussed, and resume conversations with a double-click.

SpruceChat Runs 0.5B LLM On-Device on Miyoo Handhelds via llama.cpp
SpruceChat runs Qwen2.5-0.5B entirely on-device on handheld gaming devices using llama.cpp, with no cloud or WiFi required. On a Miyoo A30 (Cortex-A7 quad-core), it loads in ~60 seconds and generates at ~1-2 tokens/second.

PocketTeam: A Claude Code Pipeline with Hook-Based Safety and Learning Agents
PocketTeam is a Claude Code pipeline that implements 9 safety layers at the tool-call level to block dangerous operations like writes to .env or rm -rf commands. The system includes an Observer agent that analyzes completed tasks and writes structured learnings to improve future agent performance.

DecisionNode: CLI and MCP Server for Semantic Decision Storage
DecisionNode is a local-only CLI and MCP server that stores structured decisions as JSON, embeds them as vectors for semantic search, and makes them accessible across AI tools via MCP. It's MIT licensed and designed to work with Claude Code, Cursor, Windsurf, Antigravity, and other MCP clients.