Targeted Multilingual Adaptation for Low-resource Language Families

C.M. Downey; Terra Blevins; Dhwani Serai; Dwija Parikh; Shane; Steinert-Threlkeld

arXiv:2405.12413·cs.CL·May 22, 2024·1 cites

Targeted Multilingual Adaptation for Low-resource Language Families

C.M. Downey, Terra Blevins, Dhwani Serai, Dwija Parikh, Shane, Steinert-Threlkeld

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that targeted multilingual adaptation, focusing on closely related languages, significantly improves performance on low-resource languages, establishing new best practices for language adaptation.

Contribution

It introduces a systematic approach for adapting pre-trained models to language families, showing the effectiveness of targeted multilingual training for low-resource languages.

Findings

01

Adapted models outperform baselines on downstream tasks.

02

Vocabulary size has limited impact on low-resource language performance.

03

Aggressive up-sampling benefits low-resource languages with minimal impact on high-resource languages.

Abstract

The "massively-multilingual" training of multilingual models is known to limit their utility in any one language, and they perform particularly poorly on low-resource languages. However, there is evidence that low-resource languages can benefit from targeted multilinguality, where the model is trained on closely related languages. To test this approach more rigorously, we systematically study best practices for adapting a pre-trained model to a language family. Focusing on the Uralic family as a test case, we adapt XLM-R under various configurations to model 15 languages; we then evaluate the performance of each experimental setting on two downstream tasks and 11 evaluation languages. Our adapted models significantly outperform mono- and multilingual baselines. Furthermore, a regression analysis of hyperparameter effects reveals that adapted vocabulary size is relatively unimportant for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Targeted Multilingual Adaptation for Low-resource Language Families· underline

Taxonomy

TopicsSecond Language Learning and Teaching · Language Development and Disorders

MethodsXLM-R