Exploiting Domain-Specific Parallel Data on Multilingual Language Models for Low-resource Language Translation

Surangika Ranathungaa; Shravan Nayak; Shih-Ting Cindy Huang; Yanke Mao; Tong Su; Yun-Hsiang Ray Chan; Songchen Yuan; Anthony Rinaldi; Annie En-Shiun Lee

arXiv:2412.19522·cs.CL·March 27, 2026

Exploiting Domain-Specific Parallel Data on Multilingual Language Models for Low-resource Language Translation

Surangika Ranathungaa, Shravan Nayak, Shih-Ting Cindy Huang, Yanke Mao, Tong Su, Yun-Hsiang Ray Chan, Songchen Yuan, Anthony Rinaldi, Annie En-Shiun Lee

PDF

Open Access

TL;DR

This paper evaluates how using auxiliary domain-specific parallel data can improve low-resource language translation in multilingual models, highlighting strategies to enhance domain-specific NMT performance despite limited data.

Contribution

It systematically assesses fine-tuning and pre-training with auxiliary data for low-resource languages, providing practical strategies for domain adaptation in NMT.

Findings

01

Fine-tuning with auxiliary data improves translation quality.

02

Domain divergence impacts model performance significantly.

03

Recommended strategies enhance low-resource language translation.

Abstract

Neural Machine Translation (NMT) systems built on multilingual sequence-to-sequence Language Models (msLMs) fail to deliver expected results when the amount of parallel data for a language, as well as the language's representation in the model are limited. This restricts the capabilities of domain-specific NMT systems for low-resource languages (LRLs). As a solution, parallel data from auxiliary domains can be used either to fine-tune or to further pre-train the msLM. We present an evaluation of the effectiveness of these two techniques in the context of domain-specific LRL-NMT. We also explore the impact of domain divergence on NMT model performance. We recommend several strategies for utilizing auxiliary parallel data in building domain-specific NMT models for LRLs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling