Comparison of 8 AI Coding Models on Real-World TypeScript Feature Implementation

Real-World AI Coding Model Comparison
A developer conducted a practical comparison of 8 AI coding models by having them implement the same real-world feature in an existing TypeScript project. The goal was to move beyond synthetic benchmarks and see how models perform when working with actual codebases.
The Test Setup
The project used was OpenCode Telegram Bot, an open-source TypeScript bot built with the grammY framework that provides Telegram interface to Opencode capabilities. The bot has i18n support and existing test coverage.
The task was implementing a /rename command that renames the current working session. This feature touches all application layers and requires handling multiple edge cases. The original implementation had been reverted, providing a clean baseline for evaluation.
Each model received the same prompt in two phases: first in planning mode (studying the codebase and forming an implementation plan), then in coding mode. All testing was done using Opencode with "thinking" mode and reasoning enabled.
Models Tested
- Claude 4.6 Sonnet ($3.00 input/$15.00 output per 1M tokens)
- Claude 4.6 Opus ($5.00/$25.00)
- GLM 5 ($1.00/$3.20)
- Kimi K2.5 ($0.60/$3.00)
- MiniMax M2.5 ($0.30/$1.20)
- GPT 5.3 Codex (high) ($1.75/$14.00)
- GPT 5.4 (high) ($2.50/$15.00)
- Gemini 3.1 Pro (high) ($2.00/$12.00)
Coding Index and Agentic Index data came from Artificial Analysis. All models were accessed through OpenCode Zen, a provider from the OpenCode team that tests models for compatibility with their tool.
Evaluation Methodology
Four metrics were used:
- API cost ($) - Total cost of all API calls during the task, including sub-agents
- Execution time (mm:ss) - Total model working time
- Implementation correctness (0-10) - How well the behavior matches requirements and edge cases
- Technical quality (0-10) - Engineering quality of the solution
For correctness and quality scores, the existing /rename implementation was used to derive detailed evaluation criteria covering command integration, main flow, error handling, cancellation, i18n, documentation, architecture, state management, tests, and tech debt. Evaluation was performed by GPT-5.3 Codex against a structured rubric, with multiple runs showing variance within ±0.5 points.
Key Findings
The results showed GPT-5.4 (high) achieving the highest implementation correctness score of 57 out of 69 on the Agentic Index. GLM 5 demonstrated strong cost-performance ratio at $1.00/$3.20 per 1M tokens with a Coding Index of 53. The experiment revealed that inexpensive open-source models from China are approaching proprietary ones in practical coding tasks, though benchmarks alone don't tell the full story.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Homebutler: MCP Server for Multi-Server Homelab Management via Claude
Homebutler is a Go binary with a built-in MCP server that lets Claude manage multiple servers over SSH without installing agents on remote machines. It provides 9 tools including system status monitoring, Docker container management, port scanning, and alert rules.

Persistent Side Panel for Claude Code with Autonomous Content Management
A developer built a TUI panel that sits in an iTerm2 split pane next to the terminal, featuring three fixed panels that Claude autonomously manages to show relevant content like code, diagrams, and status updates.

tmux-claude: Monitor Claude Code Instances Across Tmux Panes
tmux-claude is a tool that adds live monitoring for Claude Code instances within tmux sessions. It provides a status bar, interactive dashboard, enhanced window chooser, and desktop notifications by reading local session files without API calls.

Claude-File-Recovery: CLI tool extracts files from Claude Code session history
claude-file-recovery is a Python CLI tool and TUI that parses JSONL session transcripts from ~/.claude/projects/ to recover files created, modified, or read by Claude Code, including point-in-time recovery of earlier file versions.