First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT
Benjamin Muller, Yanai Elazar, Beno\^it Sagot, Djam\'e Seddah

TL;DR
This paper investigates how multilingual BERT achieves zero-shot cross-lingual transfer, revealing it as a combination of a language-agnostic encoder and a task-specific predictor, with the encoder being the key component.
Contribution
It introduces a novel layer ablation technique and demonstrates that the encoder is essential for transfer, while the predictor can be reinitialized without affecting performance.
Findings
The encoder remains stable during fine-tuning and is crucial for transfer.
The task predictor can be reinitialized without impacting cross-lingual performance.
Multilingual BERT can be viewed as two stacked sub-networks: encoder and predictor.
Abstract
Multilingual pretrained language models have demonstrated remarkable zero-shot cross-lingual transfer capabilities. Such transfer emerges by fine-tuning on a task of interest in one language and evaluating on a distinct language, not seen during the fine-tuning. Despite promising results, we still lack a proper understanding of the source of this transfer. Using a novel layer ablation technique and analyses of the model's internal representations, we show that multilingual BERT, a popular multilingual language model, can be viewed as the stacking of two sub-networks: a multilingual encoder followed by a task-specific language-agnostic predictor. While the encoder is crucial for cross-lingual transfer and remains mostly unchanged during fine-tuning, the task predictor has little importance on the transfer and can be reinitialized during fine-tuning. We present extensive experiments with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeurobiology of Language and Bilingualism · Topic Modeling · Language Development and Disorders
MethodsLinear Layer · Layer Normalization · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · WordPiece · Attention Is All You Need · Residual Connection · Dense Connections · Adam · Linear Warmup With Linear Decay
