TranslateGemma-12b: 71% Errors Missed by Automated Metrics

A follow-up audit of TranslateGemma-12b subtitle translations reveals that automated metrics significantly underestimate real-world errors. The original benchmark showed the model beating frontier general models (Claude Sonnet, GPT-5.4, DeepSeek, Gemini Flash Lite) across 6 languages. To verify, the team added human review.

Setup

21 English subtitle segments from one tutorial video
TranslateGemma-12b translated into 4 languages: ES, JA, TH, ZH-CN (Korean and Traditional Chinese dropped)
84 translations total, preselected as scoring well on automated metrics
Every translation sent to human MQM review

Results

Under the dashboard's own red-flag threshold (MX ≥ 5 OR CK < 0.70):

Auto-flagged: 1/84 (1.2%)
Human-flagged (any): 60/84 (71%)
Human-flagged (Major): 13/84 (15%)

Per language:

ES: 0/21 auto, 11/21 human-flagged, 2/21 Major — mostly tone inconsistencies (formal/informal switches), easiest of the four
JA: 0/21 auto, 17/21 human-flagged, 3/21 Major — “fluent but wrong meaning” pattern; 10 of 15 total mistranslations in dataset. High COMETKiwi (0.86 mean) masked errors. Same failure mode seen in Claude Sonnet 4.6 on JA.
TH: 0/21 auto, 17/21 human-flagged, 5/21 Major — over-production: 5 Accuracy/Addition errors (inserting content not in source), plus punctuation errors from English-style periods.
ZH-CN: 1/21 auto (Style error), 15/21 human-flagged, 3/21 Major — including omission of “store” changing meaning, and inconsistent “ticket” translation across segments.

Of 25 Accuracy-class errors (mistranslation, omission, addition, untranslated), all were in the metric-blind quadrant. The metrics caught zero accuracy errors.

Takeaway

Small audit, one model, one content set — numbers are directional. But the pattern is clear: automated metrics alone miss the majority of real translation issues, especially accuracy errors. For production subtitle work, human review remains essential.

📖 Read the full source: r/LocalLLaMA

TranslateGemma-12b: Human Review Catches 71% Errors Missed by Automated Metrics

Setup

Results

Takeaway

👀 See Also

Solo Dev Builds 35-Module Household SaaS with Claude — Workflow Deep Dive

Claude Code v2.1.89 adds deferrable hooks, permission retry, and fixes memory leaks

Claude Cowork unifies slash commands and skills under single concept

Analysis of Anti-AI Sentiment and the Uncanny Valley Effect