Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine   Translation Evaluation Metrics

Nitika Mathur; Timothy Baldwin; Trevor Cohn

arXiv:2006.06264·cs.CL·June 15, 2020

Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Nitika Mathur, Timothy Baldwin, Trevor Cohn

PDF

1 Repo

TL;DR

This paper critically reevaluates how automatic machine translation metrics are assessed, revealing their sensitivity to outliers and proposing a new pairwise ranking method to better align metric evaluation with human judgments.

Contribution

It highlights the limitations of current evaluation protocols and introduces a pairwise ranking approach to improve the reliability of automatic metric assessments.

Findings

01

Current evaluation methods are highly sensitive to outliers.

02

Existing metrics may lead to false confidence in their efficacy.

03

A new pairwise ranking method quantifies errors in metric-based system comparisons.

Abstract

Automatic metrics are fundamental for the development and evaluation of machine translation systems. Judging whether, and to what extent, automatic metrics concur with the gold standard of human evaluation is not a straightforward problem. We show that current methods for judging metrics are highly sensitive to the translations used for assessment, particularly the presence of outliers, which often leads to falsely confident conclusions about a metric's efficacy. Finally, we turn to pairwise system ranking, developing a method for thresholding performance improvement under an automatic metric against human judgements, which allows quantification of type I versus type II errors incurred, i.e., insignificant human differences in system quality that are accepted, and significant human differences that are rejected. Together, these findings suggest improvements to the protocols for metric…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nitikam/tangled
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.