On Using Monolingual Corpora in Neural Machine Translation
Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loic Barrault,, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, Yoshua Bengio

TL;DR
This paper explores leveraging monolingual corpora to improve neural machine translation, demonstrating significant BLEU score gains across low-resource and high-resource language pairs.
Contribution
It introduces a method to incorporate monolingual data into neural machine translation, enhancing performance beyond traditional parallel data reliance.
Findings
Up to 1.96 BLEU improvement on Turkish-English
1.59 BLEU gain on Chinese-English chat translation
Additional improvements on Czech-English and German-English
Abstract
Recent work on end-to-end neural network-based architectures for machine translation has shown promising results for En-Fr and En-De translation. Arguably, one of the major factors behind this success has been the availability of high quality parallel corpora. In this work, we investigate how to leverage abundant monolingual corpora for neural machine translation. Compared to a phrase-based and hierarchical baseline, we obtain up to BLEU improvement on the low-resource language pair Turkish-English, and BLEU on the focused domain task of Chinese-English chat messages. While our method was initially targeted toward such tasks with less parallel data, we show that it also extends to high resource languages such as Cs-En and De-En where we obtain an improvement of and BLEU scores over the neural machine translation baselines, respectively.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
