Large Language Models Are State-of-the-Art Evaluators of Translation Quality
Tom Kocmi, Christian Federmann

TL;DR
This paper introduces GEMBA, a GPT-based metric for translation quality assessment that achieves state-of-the-art accuracy, works with or without reference translations, and demonstrates the potential of large language models for this task.
Contribution
The paper presents GEMBA, a novel GPT-based evaluation metric for translation quality that outperforms previous metrics and works in both reference and reference-free modes.
Findings
GEMBA achieves state-of-the-art accuracy on WMT22 metrics shared task.
The method is effective with GPT-3.5 and larger models.
GEMBA performs well across multiple language pairs.
Abstract
We describe GEMBA, a GPT-based metric for assessment of translation quality, which works both with a reference translation and without. In our evaluation, we focus on zero-shot prompting, comparing four prompt variants in two modes, based on the availability of the reference. We investigate nine versions of GPT models, including ChatGPT and GPT-4. We show that our method for translation quality assessment only works with GPT~3.5 and larger models. Comparing to results from WMT22's Metrics shared task, our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels. Our results are valid on the system level for all three WMT22 Metrics shared task language pairs, namely English into German, English into Russian, and Chinese into English. This provides a first glimpse into the usefulness of pre-trained, generative large language models for quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)
MethodsAttention Is All You Need · Cosine Annealing · Discriminative Fine-Tuning · Linear Warmup With Cosine Annealing · Adam · Layer Normalization · Residual Connection · Dense Connections · Linear Layer · Dropout
