When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages
Tyler A. Chang, Catherine Arnett, Zhuowen Tu, Benjamin K. Bergen

TL;DR
This study systematically evaluates how multilingual pre-training affects language modeling performance across 250 languages, revealing benefits for low-resource languages in moderation but performance degradation for high-resource languages and with larger datasets due to limited model capacity.
Contribution
It provides the first large-scale empirical analysis of multilingual language models across diverse languages, highlighting the 'curse of multilinguality' and suggesting targeted models as a better approach.
Findings
Multilingual data improves low-resource language performance up to a point.
High-resource languages perform worse in multilingual pre-training scenarios.
Adding more multilingual data can hurt performance due to limited model capacity.
Abstract
Multilingual language models are widely used to extend NLP systems to low-resource languages. However, concrete evidence for the effects of multilinguality on language modeling performance in individual languages remains scarce. Here, we pre-train over 10,000 monolingual and multilingual language models for over 250 languages, including multiple language families that are under-studied in NLP. We assess how language modeling performance in each language varies as a function of (1) monolingual dataset size, (2) added multilingual dataset size, (3) linguistic similarity of the added languages, and (4) model size (up to 45M parameters). We find that in moderation, adding multilingual data improves low-resource language modeling performance, similar to increasing low-resource dataset sizes by up to 33%. Improvements depend on the syntactic similarity of the added multilingual data, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
