What Drives Performance in Multilingual Language Models?

Sina Bagheri Nezhad; Ameeta Agrawal

arXiv:2404.19159·cs.CL·December 10, 2024

What Drives Performance in Multilingual Language Models?

Sina Bagheri Nezhad, Ameeta Agrawal

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper analyzes key factors affecting multilingual language model performance across diverse languages, highlighting data size, script, and language family as critical influences, especially for unseen languages.

Contribution

It provides a comprehensive analysis of factors influencing MLLM performance across multiple languages, emphasizing the roles of data size, script, and language family, with insights for future model development.

Findings

01

Pretraining data size is the most influential factor for seen languages.

02

Script type and language family are crucial for unseen languages.

03

Model size and architecture have minimal impact on key performance factors.

Abstract

This study investigates the factors influencing the performance of multilingual large language models (MLLMs) across diverse languages. We study 6 MLLMs, including masked language models, autoregressive models, and instruction-tuned LLMs, on the SIB-200 dataset, a topic classification dataset encompassing 204 languages. Our analysis considers three scenarios: ALL languages, SEEN languages (present in the model's pretraining data), and UNSEEN languages (not present or documented in the model's pretraining data in any meaningful way). We examine the impact of factors such as pretraining data size, general resource availability, language family, and script type on model performance. Decision tree analysis reveals that pretraining data size is the most influential factor for SEEN languages. However, interestingly, script type and language family are crucial for UNSEEN languages,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

portnlp/mllms_performance
noneOfficial

Videos

What Drives Performance in Multilingual Language Models?· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling