Machine Translation Meta Evaluation through Translation Accuracy   Challenge Sets

Nikita Moghe; Arnisa Fazla; Chantal Amrhein; Tom Kocmi; Mark Steedman,; Alexandra Birch; Rico Sennrich; Liane Guillou

arXiv:2401.16313·cs.CL·January 30, 2024·2 cites

Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets

Nikita Moghe, Arnisa Fazla, Chantal Amrhein, Tom Kocmi, Mark Steedman,, Alexandra Birch, Rico Sennrich, Liane Guillou

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces ACES, a comprehensive challenge set for evaluating machine translation metrics across 146 language pairs and 68 error types, revealing limitations of current metrics and LLM-based evaluators.

Contribution

The paper presents ACES, a large-scale, diverse benchmark for analyzing MT metric behaviour across phenomena and languages, and provides insights and recommendations for improving evaluation methods.

Findings

01

Metrics struggle with certain error types.

02

LLMs do not reliably evaluate translation quality.

03

Most metrics ignore source sentence information.

Abstract

Recent machine translation (MT) metrics calibrate their effectiveness by correlating with human judgement but without any insights about their behaviour across different error types. Challenge sets are used to probe specific dimensions of metric behaviour but there are very few such datasets and they either focus on a limited number of phenomena or a limited number of language pairs. We introduce ACES, a contrastive challenge set spanning 146 language pairs, aimed at discovering whether metrics can identify 68 translation accuracy errors. These phenomena range from simple alterations at the word/character level to more complex errors based on discourse and real-world knowledge. We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks. We benchmark metric performance, assess their incremental performance over successive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

edinburghnlp/aces
noneOfficial

Datasets

nikitam/ACES
dataset· 233 dl
233 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSparse Evolutionary Training · Focus · Balanced Selection