Reassessing Claims of Human Parity and Super-Human Performance in Machine Translation at WMT 2019
Antonio Toral

TL;DR
This paper critically reevaluates the claims of human parity and super-human performance in machine translation at WMT 2019, revealing that most claims are unfounded when accounting for evaluation limitations.
Contribution
The study identifies key issues in previous human evaluation methods and provides a revised assessment that challenges earlier claims of human parity and super-human translation performance.
Findings
Most claims of human parity are refuted, except for English-to-German.
Evaluation issues include limited context, evaluator proficiency, and reference reliance.
Revised evaluation suggests current models still lag behind human translation in most cases.
Abstract
We reassess the claims of human parity and super-human performance made at the news shared task of WMT 2019 for three translation directions: English-to-German, English-to-Russian and German-to-English. First we identify three potential issues in the human evaluation of that shared task: (i) the limited amount of intersentential context available, (ii) the limited translation proficiency of the evaluators and (iii) the use of a reference translation. We then conduct a modified evaluation taking these issues into account. Our results indicate that all the claims of human parity and super-human performance made at WMT 2019 should be refuted, except the claim of human parity for English-to-German. Based on our findings, we put forward a set of recommendations and open questions for future assessments of human parity in machine translation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Hate Speech and Cyberbullying Detection
