Experts, Errors, and Context: A Large-Scale Study of Human Evaluation   for Machine Translation

Markus Freitag; George Foster; David Grangier; Viresh Ratnakar; Qijun; Tan; Wolfgang Macherey

arXiv:2104.14478·cs.CL·April 27, 2022

Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation

Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun, Tan, Wolfgang Macherey

PDF

3 Repos

TL;DR

This study introduces an error-based evaluation methodology for machine translation, revealing significant differences from crowd worker assessments and showing that some automatic metrics can outperform human evaluators.

Contribution

It presents the largest MQM-based human evaluation study for machine translation, highlighting the importance of context and error analysis in assessment procedures.

Findings

01

Different system rankings from crowd workers and professional translators

02

Automatic embedding-based metrics can outperform human crowd evaluations

03

Publicly available corpus for future research

Abstract

Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly-accepted standard procedure. As a step toward this goal, we propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework. We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with access to full document context. We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.