TL;DR
This study introduces an error-based evaluation methodology for machine translation, revealing significant differences from crowd worker assessments and showing that some automatic metrics can outperform human evaluators.
Contribution
It presents the largest MQM-based human evaluation study for machine translation, highlighting the importance of context and error analysis in assessment procedures.
Findings
Different system rankings from crowd workers and professional translators
Automatic embedding-based metrics can outperform human crowd evaluations
Publicly available corpus for future research
Abstract
Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly-accepted standard procedure. As a step toward this goal, we propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework. We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with access to full document context. We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
