On the Language Coverage Bias for Neural Machine Translation
Shuo Wang, Zhaopeng Tu, Zhixing Tan, Shuming Shi, Maosong Sun, Yang, Liu

TL;DR
This paper analyzes language coverage bias in neural machine translation, showing that using only source-original data can match full data performance and proposing methods to mitigate bias, improving translation quality across multiple tasks.
Contribution
It provides a comprehensive analysis of language coverage bias and introduces simple approaches to mitigate it, enhancing NMT performance on several benchmarks.
Findings
Using only source-original data achieves comparable results to full data.
Explicitly distinguishing data origins improves translation performance.
Mitigating language coverage bias benefits both back- and forward-translation methods.
Abstract
Language coverage bias, which indicates the content-dependent differences between sentence pairs originating from the source and target languages, is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice. By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data, and find that using only the source-original data achieves comparable performance with using full training data. Based on these observations, we further propose two simple and effective approaches to alleviate the language coverage bias problem through explicitly distinguishing between the source- and target-original training data, which consistently improve the performance over strong baselines on six WMT20 translation tasks. Complementary to the translationese effect, language coverage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
