Pitfalls and Outlooks in Using COMET
Vil\'em Zouhar, Pinzhen Chen, Tsz Kin Lam, Nikita Moghe, Barry Haddow

TL;DR
This paper examines the pitfalls of using the COMET metric in machine translation, highlighting technical, data-related, and reporting issues, and proposes solutions to improve its reliability and comparability.
Contribution
It identifies key pitfalls in COMET's usage and reporting, and introduces sacreCOMET to standardize configurations and enhance metric reliability.
Findings
COMET scores vary due to software and hardware issues.
Data issues like language mismatch affect COMET reliability.
Standardized reporting can improve comparability across studies.
Abstract
The COMET metric has blazed a trail in the machine translation community, given its strong correlation with human judgements of translation quality. Its success stems from being a modified pre-trained multilingual model finetuned for quality assessment. However, it being a machine learning model also gives rise to a new set of pitfalls that may not be widely known. We investigate these unexpected behaviours from three aspects: 1) technical: obsolete software versions and compute precision; 2) data: empty content, language mismatch, and translationese at test time as well as distribution and domain biases in training; 3) usage and reporting: multi-reference support and model referencing in the literature. All of these problems imply that COMET scores are not comparable between papers or even technical setups and we put forward our perspective on fixing each issue. Furthermore, we release…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDistributed and Parallel Computing Systems · Embedded Systems Design Techniques
MethodsSparse Evolutionary Training
