Enhancing Human Evaluation in Machine Translation with Comparative Judgment
Yixiao Song, Parker Riley, Daniel Deutsch, Markus Freitag

TL;DR
This paper investigates comparative judgment methods to improve human evaluation consistency and efficiency in machine translation, demonstrating that pairwise approaches yield higher agreement and better error detection than traditional point-wise methods.
Contribution
It introduces and evaluates three annotation setups, showing that pairwise comparative judgments improve inter-annotator agreement and error marking consistency in MT evaluation.
Findings
SxS settings achieve higher inter-annotator agreement than MQM
SxS MQM improves error marking consistency by up to 38.5%
SxS RR provides a more efficient evaluation alternative
Abstract
Human evaluation is crucial for assessing rapidly evolving language models but is influenced by annotator proficiency and task design. This study explores the integration of comparative judgment into human annotation for machine translation (MT) and evaluates three annotation setups-point-wise Multidimensional Quality Metrics (MQM), side-by-side (SxS) MQM, and its simplified version SxS relative ranking (RR). In MQM, annotators mark error spans with categories and severity levels. SxS MQM extends MQM to pairwise error annotation for two translations of the same input, while SxS RR focuses on selecting the better output without labeling errors. Key findings are: (1) the SxS settings achieve higher inter-annotator agreement than MQM; (2) SxS MQM enhances inter-translation error marking consistency compared to MQM by, on average, 38.5% for explicitly compared MT systems and 19.5% for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
