Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite
Klaudia Thellmann, Bernhard Stadler, Michael F\"arber

TL;DR
This study presents an automated, scalable approach to assess and improve the translation quality of benchmark datasets like EU20, combining structural audits, neural metrics, and LLM-based error analysis.
Contribution
It introduces a three-step automated quality assurance method for translated benchmarks, enhancing reliability assessment and dataset correction.
Findings
Lower COMET scores correlate with more translation errors.
Datasets like HellaSwag show higher mistranslation rates.
Corrected datasets and code are publicly released.
Abstract
Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence. What matters is not merely whether we can translate, but also whether we can measure and verify translation reliability at scale. We study translation quality in the EU20 benchmark suite, which comprises five established benchmarks translated into 20 languages, via a three-step automated quality assurance approach: (i) a structural corpus audit with targeted fixes; (ii) quality profiling using a neural metric (COMET, reference-free and reference-based) with translation service comparisons (DeepL / ChatGPT / Google); and (iii) an LLM-based span-level translation error landscape. Trends are consistent: datasets with lower COMET scores exhibit a higher share of accuracy/mistranslation errors at span level (notably HellaSwag; ARC is comparatively clean).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
