TL;DR
This paper introduces RE-LM, a method that reuses a pretrained language model on a high-resource language, fine-tunes it on both languages, and extends its vocabulary to improve unsupervised neural machine translation for low-resource languages, achieving significant BLEU score improvements.
Contribution
The paper proposes a novel vocabulary extension method and a fine-tuning approach to effectively reuse pretrained LMs for unsupervised NMT involving low-resource languages.
Findings
RE-LM outperforms XLM in English-Macedonian and English-Albanian translation tasks.
Achieves over +8.3 BLEU points across four translation directions.
Effective vocabulary extension is key to reusing pretrained LMs for low-resource languages.
Abstract
Using a language model (LM) pretrained on two languages with large monolingual data in order to initialize an unsupervised neural machine translation (UNMT) system yields state-of-the-art results. When limited data is available for one language, however, this method leads to poor translations. We present an effective approach that reuses an LM that is pretrained only on the high-resource language. The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model. To reuse the pretrained LM, we have to modify its predefined vocabulary, to account for the new language. We therefore propose a novel vocabulary extension method. Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq), yielding more than +8.3 BLEU points for all four translation directions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
