First Align, then Predict: Understanding the Cross-Lingual Ability of   Multilingual BERT

Benjamin Muller; Yanai Elazar; Beno\^it Sagot; Djam\'e Seddah

arXiv:2101.11109·cs.CL·January 28, 2021

First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT

Benjamin Muller, Yanai Elazar, Beno\^it Sagot, Djam\'e Seddah

PDF

Open Access 1 Repo

TL;DR

This paper investigates how multilingual BERT achieves zero-shot cross-lingual transfer, revealing it as a combination of a language-agnostic encoder and a task-specific predictor, with the encoder being the key component.

Contribution

It introduces a novel layer ablation technique and demonstrates that the encoder is essential for transfer, while the predictor can be reinitialized without affecting performance.

Findings

01

The encoder remains stable during fine-tuning and is crucial for transfer.

02

The task predictor can be reinitialized without impacting cross-lingual performance.

03

Multilingual BERT can be viewed as two stacked sub-networks: encoder and predictor.

Abstract

Multilingual pretrained language models have demonstrated remarkable zero-shot cross-lingual transfer capabilities. Such transfer emerges by fine-tuning on a task of interest in one language and evaluating on a distinct language, not seen during the fine-tuning. Despite promising results, we still lack a proper understanding of the source of this transfer. Using a novel layer ablation technique and analyses of the model's internal representations, we show that multilingual BERT, a popular multilingual language model, can be viewed as the stacking of two sub-networks: a multilingual encoder followed by a task-specific language-agnostic predictor. While the encoder is crucial for cross-lingual transfer and remains mostly unchanged during fine-tuning, the task predictor has little importance on the transfer and can be reinitialized during fine-tuning. We present extensive experiments with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

benjamin-mlr/first-align-then-predict
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeurobiology of Language and Bilingualism · Topic Modeling · Language Development and Disorders

MethodsLinear Layer · Layer Normalization · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · WordPiece · Attention Is All You Need · Residual Connection · Dense Connections · Adam · Linear Warmup With Linear Decay