Translationese in Machine Translation Evaluation
Yvette Graham, Barry Haddow, Philipp Koehn

TL;DR
This paper examines how translationese affects machine translation evaluation accuracy, highlights issues with past assessments, and offers guidelines and statistical analyses to improve future evaluation reliability.
Contribution
It provides a detailed analysis of translationese effects, re-evaluates past human-parity claims, and offers a comprehensive checklist for more reliable future MT evaluations.
Findings
Translationese can bias MT evaluation results.
Past human-parity claims may be unreliable due to statistical issues.
A checklist is proposed to improve future MT evaluation practices.
Abstract
The term translationese has been used to describe the presence of unusual features of translated text. In this paper, we provide a detailed analysis of the adverse effects of translationese on machine translation evaluation results. Our analysis shows evidence to support differences in text originally written in a given language relative to translated text and this can potentially negatively impact the accuracy of machine translation evaluations. For this reason we recommend that reverse-created test data be omitted from future machine translation test sets. In addition, we provide a re-evaluation of a past high-profile machine translation evaluation claiming human-parity of MT, as well as analysis of the since re-evaluations of it. We find potential ways of improving the reliability of all three past evaluations. One important issue not previously considered is the statistical power of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
