Beyond Data Quantity: Key Factors Driving Performance in Multilingual   Language Models

Sina Bagheri Nezhad; Ameeta Agrawal; Rhitabrat Pokharel

arXiv:2412.12500·cs.CL·December 18, 2024·2 cites

Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models

Sina Bagheri Nezhad, Ameeta Agrawal, Rhitabrat Pokharel

PDF

Open Access 1 Repo

TL;DR

This study identifies key factors such as token and country similarity, alongside data and model size, that significantly influence multilingual language model performance across diverse languages.

Contribution

The paper reveals additional critical factors affecting MLLM effectiveness beyond data quantity, emphasizing token and country similarity through extensive analysis.

Findings

01

Token similarity enhances cross-lingual transfer.

02

Country similarity impacts model performance.

03

Pre-train data and model size are important but not sole factors.

Abstract

Multilingual language models (MLLMs) are crucial for handling text across various languages, yet they often show performance disparities due to differences in resource availability and linguistic characteristics. While the impact of pre-train data percentage and model size on performance is well-known, our study reveals additional critical factors that significantly influence MLLM effectiveness. Analyzing a wide range of features, including geographical, linguistic, and resource-related aspects, we focus on the SIB-200 dataset for classification and the Flores-200 dataset for machine translation, using regression models and SHAP values across 204 languages. Our findings identify token similarity and country similarity as pivotal factors, alongside pre-train data and model size, in enhancing model performance. Token similarity facilitates cross-lingual transfer, while country similarity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

PortNLP/SHAP-MLLM-Analysis
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsFocus · Shapley Additive Explanations