Can Automatic Metrics Assess High-Quality Translations?
Sweta Agrawal, Ant\'onio Farinhas, Ricardo Rei, Andr\'e F.T. Martins

TL;DR
This paper critically examines the limitations of current automatic translation evaluation metrics, revealing their insensitivity to nuanced quality differences and proposing a focus on detecting high-quality translations aligned with human judgments.
Contribution
It demonstrates that existing metrics poorly distinguish subtle quality differences and emphasizes the need to improve automatic evaluation for high-quality translation detection.
Findings
Current metrics are insensitive to nuanced translation quality differences.
Metrics often over or underestimate translation quality.
High-quality translation detection remains a significant challenge.
Abstract
Automatic metrics for evaluating translation quality are typically validated by measuring how well they correlate with human assessments. However, correlation methods tend to capture only the ability of metrics to differentiate between good and bad source-translation pairs, overlooking their reliability in distinguishing alternative translations for the same source. In this paper, we confirm that this is indeed the case by showing that current metrics are insensitive to nuanced differences in translation quality. This effect is most pronounced when the quality is high and the variance among alternatives is low. Given this finding, we shift towards detecting high-quality correct translations, an important problem in practical decision-making scenarios where a binary check of correctness is prioritized over a nuanced evaluation of quality. Using the MQM framework as the gold standard, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques
