TL;DR
This paper investigates pretraining strategies combining monolingual and parallel data to improve low-resource machine translation, specifically focusing on Lingala, and demonstrates that multi-language pretraining enhances translation quality.
Contribution
It introduces effective pretraining methods using both monolingual and parallel data for low-resource languages, extending prior high-resource language approaches to Lingala.
Findings
Pretraining on multiple languages improves translation quality.
Using both monolingual and parallel data yields better results.
The approach bridges performance gaps for low-resource languages.
Abstract
This research article examines the effectiveness of various pretraining strategies for developing machine translation models tailored to low-resource languages. Although this work considers several low-resource languages, including Afrikaans, Swahili, and Zulu, the translation model is specifically developed for Lingala, an under-resourced African language, building upon the pretraining approach introduced by Reid and Artetxe (2021), originally designed for high-resource languages. Through a series of comprehensive experiments, we explore different pretraining methodologies, including the integration of multiple languages and the use of both monolingual and parallel data during the pretraining phase. Our findings indicate that pretraining on multiple languages and leveraging both monolingual and parallel data significantly enhance translation quality. This study offers valuable insights…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
