Alleviating Distribution Shift in Synthetic Data for Machine Translation Quality Estimation
Xiang Geng, Zhejian Lai, Jiajun Chen, Hao Yang, Shujian Huang

TL;DR
This paper introduces DCSQE, a framework that reduces distribution shift in synthetic data for machine translation quality estimation, improving model performance by enhancing translation diversity and label quality.
Contribution
DCSQE employs constrained beam search and diverse models to generate more aligned synthetic data, guiding both translation and annotation processes for better quality estimation.
Findings
DCSQE outperforms SOTA baselines like CometKiwi.
The framework improves both supervised and unsupervised QE tasks.
Enhanced synthetic data quality leads to better model performance.
Abstract
Quality Estimation (QE) models evaluate the quality of machine translations without reference translations, serving as the reward models for the translation task. Due to the data scarcity, synthetic data generation has emerged as a promising solution. However, synthetic QE data often suffers from distribution shift, which can manifest as discrepancies between pseudo and real translations, or in pseudo labels that do not align with human preferences. To tackle this issue, we introduce DCSQE, a novel framework for alleviating distribution shift in synthetic QE data. To reduce the difference between pseudo and real translations, we employ the constrained beam search algorithm and enhance translation diversity through the use of distinct generation models. DCSQE uses references, i.e., translation supervision signals, to guide both the generation and annotation processes, enhancing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Machine Learning and Data Classification · Topic Modeling
MethodsALIGN
