How do languages influence each other? Studying cross-lingual data sharing during LM fine-tuning
Rochelle Choenni, Dan Garrette, Ekaterina Shutova

TL;DR
This paper investigates how multilingual language models share training data across languages during fine-tuning, revealing early-stage reliance on multiple languages and how different languages influence model performance on specific tasks.
Contribution
It introduces a novel data attribution approach to analyze cross-lingual data sharing at the data level, expanding understanding beyond parameter-level studies.
Findings
MLLMs rely on multiple languages early in fine-tuning
Cross-lingual data sharing increases as training progresses
Different fine-tuning languages can reinforce or complement test language knowledge
Abstract
Multilingual large language models (MLLMs) are jointly trained on data from many different languages such that representation of individual languages can benefit from other languages' data. Impressive performance on zero-shot cross-lingual transfer shows that these models are capable of exploiting data from other languages. Yet, it remains unclear to what extent, and under which conditions, languages rely on each other's data. In this study, we use TracIn (Pruthi et al., 2020), a training data attribution (TDA) method, to retrieve the most influential training samples seen during multilingual fine-tuning for a particular test language. This allows us to analyse cross-lingual sharing mechanisms of MLLMs from a new perspective. While previous work studied cross-lingual sharing at the level of model parameters, we present the first approach to study cross-lingual sharing at the data level.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Machine Learning in Healthcare
MethodsTest
