Understanding the effects of language-specific class imbalance in multilingual fine-tuning
Vincent Jung, Lonneke van der Plas

TL;DR
This paper investigates how language-specific class imbalance in multilingual datasets affects fine-tuned transformer models, revealing negative impacts on performance and latent space separation, and proposes a class weighting method to mitigate these issues.
Contribution
It identifies the detrimental effects of language imbalance in multilingual fine-tuning and introduces a language-specific class weighting technique to improve model performance.
Findings
Imbalance worsens model performance and increases language separation.
Language-specific class weights mitigate negative effects.
Models tend to rely on language separation rather than informative features.
Abstract
We study the effect of one type of imbalance often present in real-life multilingual classification datasets: an uneven distribution of labels across languages. We show evidence that fine-tuning a transformer-based Large Language Model (LLM) on a dataset with this imbalance leads to worse performance, a more pronounced separation of languages in the latent space, and the promotion of uninformative features. We modify the traditional class weighing approach to imbalance by calculating class weights separately for each language and show that this helps mitigate those detrimental effects. These results create awareness of the negative effects of language-specific class imbalance in multilingual fine-tuning and the way in which the model learns to rely on the separation of languages to perform the task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLinguistic Variation and Morphology · Linguistics, Language Diversity, and Identity · Stuttering Research and Treatment
