Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of Multilingual Language Models
Terra Blevins, Hila Gonen, Luke Zettlemoyer

TL;DR
This paper investigates the dynamics of multilingual pretraining in models like XLM-R, revealing how in-language and cross-lingual abilities develop at different stages and layers during training.
Contribution
It provides the first detailed analysis of pretraining dynamics across checkpoints, showing how linguistic skills and transfer abilities evolve over time.
Findings
High in-language performance emerges early in pretraining.
Cross-lingual transferability varies across language pairs.
Final layer performance degrades over time, with linguistic knowledge moving to lower layers.
Abstract
The emergent cross-lingual transfer seen in multilingual pretrained models has sparked significant interest in studying their behavior. However, because these analyses have focused on fully trained multilingual models, little is known about the dynamics of the multilingual pretraining process. We investigate when these models acquire their in-language and cross-lingual abilities by probing checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks. Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones. In contrast, the point in pretraining when the model learns to transfer cross-lingually differs across language pairs. Interestingly, we also observe that, across many languages and tasks, the final model layer exhibits significant performance degradation over time,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning
MethodsXLM-R
