TranslateGemma-12b: Human Review Catches 71% Errors Missed by Automated Metrics

✍️ OpenClawRadar📅 Published: May 12, 2026🔗 Source
TranslateGemma-12b: Human Review Catches 71% Errors Missed by Automated Metrics
Ad

A follow-up audit of TranslateGemma-12b subtitle translations reveals that automated metrics significantly underestimate real-world errors. The original benchmark showed the model beating frontier general models (Claude Sonnet, GPT-5.4, DeepSeek, Gemini Flash Lite) across 6 languages. To verify, the team added human review.

Setup

  • 21 English subtitle segments from one tutorial video
  • TranslateGemma-12b translated into 4 languages: ES, JA, TH, ZH-CN (Korean and Traditional Chinese dropped)
  • 84 translations total, preselected as scoring well on automated metrics
  • Every translation sent to human MQM review
Ad

Results

Under the dashboard's own red-flag threshold (MX ≥ 5 OR CK < 0.70):

  • Auto-flagged: 1/84 (1.2%)
  • Human-flagged (any): 60/84 (71%)
  • Human-flagged (Major): 13/84 (15%)

Per language:

  • ES: 0/21 auto, 11/21 human-flagged, 2/21 Major — mostly tone inconsistencies (formal/informal switches), easiest of the four
  • JA: 0/21 auto, 17/21 human-flagged, 3/21 Major — “fluent but wrong meaning” pattern; 10 of 15 total mistranslations in dataset. High COMETKiwi (0.86 mean) masked errors. Same failure mode seen in Claude Sonnet 4.6 on JA.
  • TH: 0/21 auto, 17/21 human-flagged, 5/21 Major — over-production: 5 Accuracy/Addition errors (inserting content not in source), plus punctuation errors from English-style periods.
  • ZH-CN: 1/21 auto (Style error), 15/21 human-flagged, 3/21 Major — including omission of “store” changing meaning, and inconsistent “ticket” translation across segments.

Of 25 Accuracy-class errors (mistranslation, omission, addition, untranslated), all were in the metric-blind quadrant. The metrics caught zero accuracy errors.

Takeaway

Small audit, one model, one content set — numbers are directional. But the pattern is clear: automated metrics alone miss the majority of real translation issues, especially accuracy errors. For production subtitle work, human review remains essential.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also