XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics

Jingxuan Liu; Zhi Qu; Jin Tei; Hidetaka Kamigaito; Lemao Liu; Taro Watanabe

arXiv:2604.14934·cs.CL·April 21, 2026

XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics

Jingxuan Liu, Zhi Qu, Jin Tei, Hidetaka Kamigaito, Lemao Liu, Taro Watanabe

PDF

1 Datasets

TL;DR

XQ-MEval is a new dataset designed to benchmark translation metrics across multiple languages, revealing cross-lingual scoring biases and proposing normalization strategies to improve evaluation fairness.

Contribution

The paper introduces XQ-MEval, a semi-automatically constructed dataset for cross-lingual translation metric benchmarking, and provides empirical evidence of scoring bias and a normalization method.

Findings

01

Identifies cross-lingual scoring bias in translation metrics.

02

Demonstrates inconsistency between metric scores and human judgment.

03

Proposes normalization to improve multilingual metric evaluation.

Abstract

Automatic evaluation metrics are essential for building multilingual translation systems. The common practice of evaluating these systems is averaging metric scores across languages, yet this is suspicious since metrics may suffer from cross-lingual scoring bias, where translations of equal quality receive different scores across languages. This problem has not been systematically studied because no benchmark exists that provides parallel-quality instances across languages, and expert annotation is not realistic. In this work, we propose XQ-MEval, a semi-automatically built dataset covering nine translation directions, to benchmark translation metrics. Specifically, we inject MQM-defined errors into gold translations automatically, filter them by native speakers for reliability, and merge errors to generate pseudo translations with controllable quality. These pseudo translations are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

naist-nlp/XQ-MEval
dataset· 647 dl
647 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.