What Drives Performance in Multilingual Language Models?
Sina Bagheri Nezhad, Ameeta Agrawal

TL;DR
This paper analyzes key factors affecting multilingual language model performance across diverse languages, highlighting data size, script, and language family as critical influences, especially for unseen languages.
Contribution
It provides a comprehensive analysis of factors influencing MLLM performance across multiple languages, emphasizing the roles of data size, script, and language family, with insights for future model development.
Findings
Pretraining data size is the most influential factor for seen languages.
Script type and language family are crucial for unseen languages.
Model size and architecture have minimal impact on key performance factors.
Abstract
This study investigates the factors influencing the performance of multilingual large language models (MLLMs) across diverse languages. We study 6 MLLMs, including masked language models, autoregressive models, and instruction-tuned LLMs, on the SIB-200 dataset, a topic classification dataset encompassing 204 languages. Our analysis considers three scenarios: ALL languages, SEEN languages (present in the model's pretraining data), and UNSEEN languages (not present or documented in the model's pretraining data in any meaningful way). We examine the impact of factors such as pretraining data size, general resource availability, language family, and script type on model performance. Decision tree analysis reveals that pretraining data size is the most influential factor for SEEN languages. However, interestingly, script type and language family are crucial for UNSEEN languages,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
