Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models
Sina Bagheri Nezhad, Ameeta Agrawal, Rhitabrat Pokharel

TL;DR
This study identifies key factors such as token and country similarity, alongside data and model size, that significantly influence multilingual language model performance across diverse languages.
Contribution
The paper reveals additional critical factors affecting MLLM effectiveness beyond data quantity, emphasizing token and country similarity through extensive analysis.
Findings
Token similarity enhances cross-lingual transfer.
Country similarity impacts model performance.
Pre-train data and model size are important but not sole factors.
Abstract
Multilingual language models (MLLMs) are crucial for handling text across various languages, yet they often show performance disparities due to differences in resource availability and linguistic characteristics. While the impact of pre-train data percentage and model size on performance is well-known, our study reveals additional critical factors that significantly influence MLLM effectiveness. Analyzing a wide range of features, including geographical, linguistic, and resource-related aspects, we focus on the SIB-200 dataset for classification and the Flores-200 dataset for machine translation, using regression models and SHAP values across 204 languages. Our findings identify token similarity and country similarity as pivotal factors, alongside pre-train data and model size, in enhancing model performance. Token similarity facilitates cross-lingual transfer, while country similarity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsFocus · Shapley Additive Explanations
