Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets
Nikita Moghe, Arnisa Fazla, Chantal Amrhein, Tom Kocmi, Mark Steedman,, Alexandra Birch, Rico Sennrich, Liane Guillou

TL;DR
This paper introduces ACES, a comprehensive challenge set for evaluating machine translation metrics across 146 language pairs and 68 error types, revealing limitations of current metrics and LLM-based evaluators.
Contribution
The paper presents ACES, a large-scale, diverse benchmark for analyzing MT metric behaviour across phenomena and languages, and provides insights and recommendations for improving evaluation methods.
Findings
Metrics struggle with certain error types.
LLMs do not reliably evaluate translation quality.
Most metrics ignore source sentence information.
Abstract
Recent machine translation (MT) metrics calibrate their effectiveness by correlating with human judgement but without any insights about their behaviour across different error types. Challenge sets are used to probe specific dimensions of metric behaviour but there are very few such datasets and they either focus on a limited number of phenomena or a limited number of language pairs. We introduce ACES, a contrastive challenge set spanning 146 language pairs, aimed at discovering whether metrics can identify 68 translation accuracy errors. These phenomena range from simple alterations at the word/character level to more complex errors based on discourse and real-world knowledge. We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks. We benchmark metric performance, assess their incremental performance over successive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training · Focus · Balanced Selection
