Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation
Zihan Liu, Genta Indra Winata, Pascale Fung

TL;DR
This paper introduces a continual pre-training method for mBART that enhances neural machine translation for extremely low-resource and unseen languages by using mixed-language text constructed from monolingual corpora.
Contribution
It proposes a novel continual pre-training framework that adapts mBART to unseen languages using noisy mixed-language data, improving translation performance in low-resource scenarios.
Findings
Consistently improves translation quality over baselines.
Enhances performance on both unseen and seen language pairs.
Effective for extremely low-resource language translation.
Abstract
The data scarcity in low-resource languages has become a bottleneck to building robust neural machine translation systems. Fine-tuning a multilingual pre-trained model (e.g., mBART (Liu et al., 2020)) on the translation task is a good approach for low-resource languages; however, its performance will be greatly limited when there are unseen languages in the translation pairs. In this paper, we present a continual pre-training (CPT) framework on mBART to effectively adapt it to unseen languages. We first construct noisy mixed-language text from the monolingual corpus of the target language in the translation pair to cover both the source and target languages, and then, we continue pre-training mBART to reconstruct the original monolingual text. Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline, as well as other strong baselines,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsmBART
