TL;DR
This paper introduces an uncertainty-aware approach to machine translation evaluation that provides confidence intervals for quality scores, improving trustworthiness and usefulness in flagging critical translation errors.
Contribution
It combines the COMET framework with uncertainty estimation methods like Monte Carlo dropout and deep ensembles to enhance MT evaluation reliability.
Findings
Uncertainty-aware metrics outperform point estimates in reliability.
Confidence intervals help identify potentially critical translation errors.
Method is effective across multiple language pairs and datasets.
Abstract
Several neural-based metrics have been recently proposed to evaluate machine translation quality. However, all of them resort to point estimates, which provide limited information at segment level. This is made worse as they are trained on noisy, biased and scarce human judgements, often resulting in unreliable quality predictions. In this paper, we introduce uncertainty-aware MT evaluation and analyze the trustworthiness of the predicted quality. We combine the COMET framework with two uncertainty estimation methods, Monte Carlo dropout and deep ensembles, to obtain quality scores along with confidence intervals. We compare the performance of our uncertainty-aware MT evaluation methods across multiple language pairs from the QT21 dataset and the WMT20 metrics task, augmented with MQM annotations. We experiment with varying numbers of references and further discuss the usefulness of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDropout · Monte Carlo Dropout
