What do Large Language Models Need for Machine Translation Evaluation?

Shenbin Qian; Archchana Sindhujan; Minnie Kabra; Diptesh Kanojia,; Constantin Or\u{a}san; Tharindu Ranasinghe; Fr\'ed\'eric Blain

arXiv:2410.03278·cs.CL·October 10, 2024

What do Large Language Models Need for Machine Translation Evaluation?

Shenbin Qian, Archchana Sindhujan, Minnie Kabra, Diptesh Kanojia,, Constantin Or\u{a}san, Tharindu Ranasinghe, Fr\'ed\'eric Blain

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the requirements and effectiveness of large language models in evaluating machine translation quality, emphasizing the importance of reference translations and prompting techniques across various languages and model sizes.

Contribution

It provides a comprehensive analysis of LLM-based MT evaluation, highlighting the role of reference data and prompting methods, and offers publicly available resources for reproducibility.

Findings

01

Reference translations significantly improve evaluation accuracy.

02

Larger models benefit more from Chain of Thought prompting.

03

LLMs often do not produce numerical scores, raising reliability concerns.

Abstract

Leveraging large language models (LLMs) for various natural language processing tasks has led to superlative claims about their performance. For the evaluation of machine translation (MT), existing research shows that LLMs are able to achieve results comparable to fine-tuned multilingual pre-trained language models. In this paper, we explore what translation information, such as the source, reference, translation errors and annotation guidelines, is needed for LLMs to evaluate MT quality. In addition, we investigate prompting techniques such as zero-shot, Chain of Thought (CoT) and few-shot prompting for eight language pairs covering high-, medium- and low-resource languages, leveraging varying LLM variants. Our findings indicate the importance of reference translations for an LLM-based evaluation. While larger models do not necessarily fare better, they tend to benefit more from CoT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

surrey-nlp/LLM4MT_eval
noneOfficial

Videos

What do large language models need for machine translation evaluation?· underline

Taxonomy

TopicsNatural Language Processing Techniques