Alleviating Distribution Shift in Synthetic Data for Machine Translation Quality Estimation

Xiang Geng; Zhejian Lai; Jiajun Chen; Hao Yang; Shujian Huang

arXiv:2502.19941·cs.CL·June 19, 2025

Alleviating Distribution Shift in Synthetic Data for Machine Translation Quality Estimation

Xiang Geng, Zhejian Lai, Jiajun Chen, Hao Yang, Shujian Huang

PDF

Open Access 7 Models 1 Video

TL;DR

This paper introduces DCSQE, a framework that reduces distribution shift in synthetic data for machine translation quality estimation, improving model performance by enhancing translation diversity and label quality.

Contribution

DCSQE employs constrained beam search and diverse models to generate more aligned synthetic data, guiding both translation and annotation processes for better quality estimation.

Findings

01

DCSQE outperforms SOTA baselines like CometKiwi.

02

The framework improves both supervised and unsupervised QE tasks.

03

Enhanced synthetic data quality leads to better model performance.

Abstract

Quality Estimation (QE) models evaluate the quality of machine translations without reference translations, serving as the reward models for the translation task. Due to the data scarcity, synthetic data generation has emerged as a promising solution. However, synthetic QE data often suffers from distribution shift, which can manifest as discrepancies between pseudo and real translations, or in pseudo labels that do not align with human preferences. To tackle this issue, we introduce DCSQE, a novel framework for alleviating distribution shift in synthetic QE data. To reduce the difference between pseudo and real translations, we employ the constrained beam search algorithm and enhance translation diversity through the use of distinct generation models. DCSQE uses references, i.e., translation supervision signals, to guide both the generation and annotation processes, enhancing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

Alleviating Distribution Shift in Synthetic Data for Machine Translation Quality Estimation· underline

Taxonomy

TopicsNatural Language Processing Techniques · Machine Learning and Data Classification · Topic Modeling

MethodsALIGN