The Effect of Translationese in Machine Translation Test Sets
Mike Zhang, Antonio Toral

TL;DR
This paper investigates how translationese in test sets influences human evaluation scores and system rankings in machine translation, revealing that translationese can inflate scores and affect rankings, with impact varying by translation quality.
Contribution
It provides an in-depth analysis of translationese effects on test data, showing its influence on evaluation scores and system rankings in multiple translation directions.
Findings
Translationese inflates human evaluation scores.
System rankings can change due to translationese.
Impact of translationese inversely correlates with translation quality.
Abstract
The effect of translationese has been studied in the field of machine translation (MT), mostly with respect to training data. We study in depth the effect of translationese on test data, using the test sets from the last three editions of WMT's news shared task, containing 17 translation directions. We show evidence that (i) the use of translationese in test sets results in inflated human evaluation scores for MT systems; (ii) in some cases system rankings do change and (iii) the impact translationese has on a translation direction is inversely correlated to the translation quality attainable by state-of-the-art MT systems for that direction.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research
