On the Language-specificity of Multilingual BERT and the Impact of Fine-tuning
Marc Tanti, Lonneke van der Plas, Claudia Borg, Albert Gatt

TL;DR
This paper investigates how fine-tuning affects multilingual BERT's language-specific and language-neutral knowledge, showing that fine-tuning reorganizes representations to favor language-independent features at the cost of language-specific ones.
Contribution
It provides a detailed analysis of the impact of fine-tuning on mBERT's language representations and explores methods to unlearn language-specific features.
Findings
Fine-tuning reduces mBERT's ability to cluster by language.
Language identification accuracy drops after fine-tuning.
Unlearning methods do not significantly improve language-independent representations.
Abstract
Recent work has shown evidence that the knowledge acquired by multilingual BERT (mBERT) has two components: a language-specific and a language-neutral one. This paper analyses the relationship between them, in the context of fine-tuning on two tasks -- POS tagging and natural language inference -- which require the model to bring to bear different degrees of language-specific knowledge. Visualisations reveal that mBERT loses the ability to cluster representations by language after fine-tuning, a result that is supported by evidence from language identification experiments. However, further experiments on 'unlearning' language-specific representations using gradient reversal and iterative adversarial learning are shown not to add further improvement to the language-independent component over and above the effect of fine-tuning. The results presented here suggest that the process of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · mBERT · Layer Normalization · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Adam · Residual Connection · Multi-Head Attention
