To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation
Tom Kocmi, Christian Federmann, Roman Grundkiewicz, Marcin, Junczys-Dowmunt, Hitokazu Matsushita, Arul Menezes

TL;DR
This paper evaluates the reliability of automatic metrics for machine translation by comparing them against a large set of human judgments, revealing their strengths and limitations across languages and domains.
Contribution
It provides the largest dataset of human judgments for MT evaluation and systematically assesses how well automatic metrics predict human-perceived quality.
Findings
Metrics vary in accuracy across language pairs.
BLEU's exclusive use hindered model development.
Some metrics closely align with human judgments.
Abstract
Automatic metrics are commonly used as the exclusive tool for declaring the superiority of one machine translation system's quality over another. The community choice of automatic metric guides research directions and industrial developments by deciding which models are deemed better. Evaluating metrics correlations with sets of human judgements has been limited by the size of these sets. In this paper, we corroborate how reliable metrics are in contrast to human judgements on -- to the best of our knowledge -- the largest collection of judgements reported in the literature. Arguably, pairwise rankings of two systems are the most common evaluation tasks in research or deployment scenarios. Taking human judgement as a gold standard, we investigate which metrics have the highest accuracy in predicting translation quality rankings for such system pairs. Furthermore, we evaluate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research
