On the Prunability of Attention Heads in Multilingual BERT
Aakriti Budhraja, Madhura Pande, Pratyush Kumar, Mitesh M. Khapra

TL;DR
This paper investigates the robustness and layer importance of multilingual BERT through pruning, revealing that multilingual models maintain robustness similar to monolingual BERT on most tasks, but show lower robustness in crosslingual transfer, with layer importance varying by language family.
Contribution
It provides a systematic analysis of layer-wise importance and robustness of mBERT via pruning, highlighting language-dependent layer significance and differences between monolingual and multilingual models.
Findings
Pruning causes similar accuracy drops in mBERT and BERT on GLUE tasks.
Lower robustness observed in crosslingual transfer tasks like XNLI.
Layer importance varies with language family and pre-training data size.
Abstract
Large multilingual models, such as mBERT, have shown promise in crosslingual transfer. In this work, we employ pruning to quantify the robustness and interpret layer-wise importance of mBERT. On four GLUE tasks, the relative drops in accuracy due to pruning have almost identical results on mBERT and BERT suggesting that the reduced attention capacity of the multilingual models does not affect robustness to pruning. For the crosslingual task XNLI, we report higher drops in accuracy with pruning indicating lower robustness in crosslingual transfer. Also, the importance of the encoder layers sensitively depends on the language family and the pre-training corpus size. The top layers, which are relatively more influenced by fine-tuning, encode important information for languages similar to English (SVO) while the bottom layers, which are relatively less influenced by fine-tuning, are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Pruning · Linear Layer · mBERT · Attention Dropout · Weight Decay · Linear Warmup With Linear Decay · Residual Connection · Softmax
