Assessing Reference-Free Peer Evaluation for Machine Translation
Sweta Agrawal, George Foster, Markus Freitag, Colin Cherry

TL;DR
This paper explores the use of large multilingual models for reference-free machine translation evaluation, demonstrating that scaling these models can achieve performance comparable to traditional metrics like BLEU, with robustness across domains.
Contribution
It introduces a scalable, reference-free evaluation method using large multilingual models, matching BLEU's performance and showing robustness across various domains and system qualities.
Findings
Scaling the model improves evaluation accuracy.
The approach is robust across different domains.
Performance matches BLEU with sufficient scaling.
Abstract
Reference-free evaluation has the potential to make machine translation evaluation substantially more scalable, allowing us to pivot easily to new languages or domains. It has been recently shown that the probabilities given by a large, multilingual model can achieve state of the art results when used as a reference-free metric. We experiment with various modifications to this model and demonstrate that by scaling it up we can match the performance of BLEU. We analyze various potential weaknesses of the approach and find that it is surprisingly robust and likely to offer reasonable performance across a broad spectrum of domains and different system qualities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
