TranslateGemma-12b: Human Review Catches 71% Errors Missed by Automated Metrics

A follow-up audit of TranslateGemma-12b subtitle translations reveals that automated metrics significantly underestimate real-world errors. The original benchmark showed the model beating frontier general models (Claude Sonnet, GPT-5.4, DeepSeek, Gemini Flash Lite) across 6 languages. To verify, the team added human review.
Setup
- 21 English subtitle segments from one tutorial video
- TranslateGemma-12b translated into 4 languages: ES, JA, TH, ZH-CN (Korean and Traditional Chinese dropped)
- 84 translations total, preselected as scoring well on automated metrics
- Every translation sent to human MQM review
Results
Under the dashboard's own red-flag threshold (MX ≥ 5 OR CK < 0.70):
- Auto-flagged: 1/84 (1.2%)
- Human-flagged (any): 60/84 (71%)
- Human-flagged (Major): 13/84 (15%)
Per language:
- ES: 0/21 auto, 11/21 human-flagged, 2/21 Major — mostly tone inconsistencies (formal/informal switches), easiest of the four
- JA: 0/21 auto, 17/21 human-flagged, 3/21 Major — “fluent but wrong meaning” pattern; 10 of 15 total mistranslations in dataset. High COMETKiwi (0.86 mean) masked errors. Same failure mode seen in Claude Sonnet 4.6 on JA.
- TH: 0/21 auto, 17/21 human-flagged, 5/21 Major — over-production: 5 Accuracy/Addition errors (inserting content not in source), plus punctuation errors from English-style periods.
- ZH-CN: 1/21 auto (Style error), 15/21 human-flagged, 3/21 Major — including omission of “store” changing meaning, and inconsistent “ticket” translation across segments.
Of 25 Accuracy-class errors (mistranslation, omission, addition, untranslated), all were in the metric-blind quadrant. The metrics caught zero accuracy errors.
Takeaway
Small audit, one model, one content set — numbers are directional. But the pattern is clear: automated metrics alone miss the majority of real translation issues, especially accuracy errors. For production subtitle work, human review remains essential.
📖 Read the full source: r/LocalLLaMA
👀 See Also

US Job Losses Tied to AI Exposure Begin Mounting, Bloomberg Reports
Bloomberg reports that the US is seeing significant job losses in roles exposed to AI, with a Hacker News discussion pointing to real-world impact on developers and other knowledge workers.

Richard Dawkins Believes His Claude AI Chatbot Is Conscious: The Claude Delusion on HN
Richard Dawkins reportedly believes his female AI chatbot (Claude) is conscious, sparking a HN discussion with 57 points and 66 comments.

AI Didn't Delete Your Database — You Did: Accountability in the Age of AI Coding Agents
A viral story blamed an AI agent for deleting a production database, but the real issue is exposing destructive API endpoints and lack of process—not the tool.

GitHub Copilot Moves to Usage-Based Pricing: The End of Subsidized AI Coding
Microsoft will charge GitHub Copilot users by actual model costs starting June 1, 2026, ending the $20+/month subsidy per user. Agentic AI usage is cited as the reason.