8 AI Coding Models Compared: TypeScript Feature Implementation

Real-World AI Coding Model Comparison

A developer conducted a practical comparison of 8 AI coding models by having them implement the same real-world feature in an existing TypeScript project. The goal was to move beyond synthetic benchmarks and see how models perform when working with actual codebases.

The Test Setup

The project used was OpenCode Telegram Bot, an open-source TypeScript bot built with the grammY framework that provides Telegram interface to Opencode capabilities. The bot has i18n support and existing test coverage.

The task was implementing a /rename command that renames the current working session. This feature touches all application layers and requires handling multiple edge cases. The original implementation had been reverted, providing a clean baseline for evaluation.

Each model received the same prompt in two phases: first in planning mode (studying the codebase and forming an implementation plan), then in coding mode. All testing was done using Opencode with "thinking" mode and reasoning enabled.

Models Tested

Claude 4.6 Sonnet ($3.00 input/$15.00 output per 1M tokens)
Claude 4.6 Opus ($5.00/$25.00)
GLM 5 ($1.00/$3.20)
Kimi K2.5 ($0.60/$3.00)
MiniMax M2.5 ($0.30/$1.20)
GPT 5.3 Codex (high) ($1.75/$14.00)
GPT 5.4 (high) ($2.50/$15.00)
Gemini 3.1 Pro (high) ($2.00/$12.00)

Coding Index and Agentic Index data came from Artificial Analysis. All models were accessed through OpenCode Zen, a provider from the OpenCode team that tests models for compatibility with their tool.

Evaluation Methodology

Four metrics were used:

API cost ($) - Total cost of all API calls during the task, including sub-agents
Execution time (mm:ss) - Total model working time
Implementation correctness (0-10) - How well the behavior matches requirements and edge cases
Technical quality (0-10) - Engineering quality of the solution

For correctness and quality scores, the existing /rename implementation was used to derive detailed evaluation criteria covering command integration, main flow, error handling, cancellation, i18n, documentation, architecture, state management, tests, and tech debt. Evaluation was performed by GPT-5.3 Codex against a structured rubric, with multiple runs showing variance within ±0.5 points.

Key Findings

The results showed GPT-5.4 (high) achieving the highest implementation correctness score of 57 out of 69 on the Agentic Index. GLM 5 demonstrated strong cost-performance ratio at $1.00/$3.20 per 1M tokens with a Coding Index of 53. The experiment revealed that inexpensive open-source models from China are approaching proprietary ones in practical coding tasks, though benchmarks alone don't tell the full story.

📖 Read the full source: r/LocalLLaMA