Pretraining Strategies using Monolingual and Parallel Data for Low-Resource Machine Translation

Idriss Nguepi Nguefack; Mara Finkelstein; Toadoum Sari Sakayo

arXiv:2510.25116·cs.CL·October 30, 2025

Pretraining Strategies using Monolingual and Parallel Data for Low-Resource Machine Translation

Idriss Nguepi Nguefack, Mara Finkelstein, Toadoum Sari Sakayo

PDF

1 Video

TL;DR

This paper investigates pretraining strategies combining monolingual and parallel data to improve low-resource machine translation, specifically focusing on Lingala, and demonstrates that multi-language pretraining enhances translation quality.

Contribution

It introduces effective pretraining methods using both monolingual and parallel data for low-resource languages, extending prior high-resource language approaches to Lingala.

Findings

01

Pretraining on multiple languages improves translation quality.

02

Using both monolingual and parallel data yields better results.

03

The approach bridges performance gaps for low-resource languages.

Abstract

This research article examines the effectiveness of various pretraining strategies for developing machine translation models tailored to low-resource languages. Although this work considers several low-resource languages, including Afrikaans, Swahili, and Zulu, the translation model is specifically developed for Lingala, an under-resourced African language, building upon the pretraining approach introduced by Reid and Artetxe (2021), originally designed for high-resource languages. Through a series of comprehensive experiments, we explore different pretraining methodologies, including the integration of multiple languages and the use of both monolingual and parallel data during the pretraining phase. Our findings indicate that pretraining on multiple languages and leveraging both monolingual and parallel data significantly enhance translation quality. This study offers valuable insights…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Pretraining Strategies using Monolingual and Parallel Data for Low-Resource Machine Translation· underline