To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for   Machine Translation

Tom Kocmi; Christian Federmann; Roman Grundkiewicz; Marcin; Junczys-Dowmunt; Hitokazu Matsushita; Arul Menezes

arXiv:2107.10821·cs.CL·September 15, 2021·82 cites

To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

Tom Kocmi, Christian Federmann, Roman Grundkiewicz, Marcin, Junczys-Dowmunt, Hitokazu Matsushita, Arul Menezes

PDF

Open Access 2 Repos

TL;DR

This paper evaluates the reliability of automatic metrics for machine translation by comparing them against a large set of human judgments, revealing their strengths and limitations across languages and domains.

Contribution

It provides the largest dataset of human judgments for MT evaluation and systematically assesses how well automatic metrics predict human-perceived quality.

Findings

01

Metrics vary in accuracy across language pairs.

02

BLEU's exclusive use hindered model development.

03

Some metrics closely align with human judgments.

Abstract

Automatic metrics are commonly used as the exclusive tool for declaring the superiority of one machine translation system's quality over another. The community choice of automatic metric guides research directions and industrial developments by deciding which models are deemed better. Evaluating metrics correlations with sets of human judgements has been limited by the size of these sets. In this paper, we corroborate how reliable metrics are in contrast to human judgements on -- to the best of our knowledge -- the largest collection of judgements reported in the literature. Arguably, pairwise rankings of two systems are the most common evaluation tasks in research or deployment scenarios. Taking human judgement as a gold standard, we investigate which metrics have the highest accuracy in predicting translation quality rankings for such system pairs. Furthermore, we evaluate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research