XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics
Jingxuan Liu, Zhi Qu, Jin Tei, Hidetaka Kamigaito, Lemao Liu, Taro Watanabe

TL;DR
XQ-MEval is a new dataset designed to benchmark translation metrics across multiple languages, revealing cross-lingual scoring biases and proposing normalization strategies to improve evaluation fairness.
Contribution
The paper introduces XQ-MEval, a semi-automatically constructed dataset for cross-lingual translation metric benchmarking, and provides empirical evidence of scoring bias and a normalization method.
Findings
Identifies cross-lingual scoring bias in translation metrics.
Demonstrates inconsistency between metric scores and human judgment.
Proposes normalization to improve multilingual metric evaluation.
Abstract
Automatic evaluation metrics are essential for building multilingual translation systems. The common practice of evaluating these systems is averaging metric scores across languages, yet this is suspicious since metrics may suffer from cross-lingual scoring bias, where translations of equal quality receive different scores across languages. This problem has not been systematically studied because no benchmark exists that provides parallel-quality instances across languages, and expert annotation is not realistic. In this work, we propose XQ-MEval, a semi-automatically built dataset covering nine translation directions, to benchmark translation metrics. Specifically, we inject MQM-defined errors into gold translations automatically, filter them by native speakers for reliability, and merge errors to generate pseudo translations with controllable quality. These pseudo translations are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
