Contributions of Transformer Attention Heads in Multi- and Cross-lingual Tasks
Weicheng Ma, Kai Zhang, Renze Lou, Lili Wang, Soroush Vosoughi

TL;DR
This study investigates the importance of attention heads in Transformer models for cross-lingual and multi-lingual tasks, finding that pruning heads can improve performance and identify key heads using gradient-based methods.
Contribution
It demonstrates that pruning attention heads in multi-lingual Transformers can enhance performance and introduces a gradient-based ranking method to identify heads to prune.
Findings
Pruning attention heads generally improves cross-lingual and multi-lingual task performance.
Gradient-based ranking effectively identifies heads for pruning.
Results are consistent across mBERT and XLM-R models on multiple languages.
Abstract
This paper studies the relative importance of attention heads in Transformer-based models to aid their interpretability in cross-lingual and multi-lingual tasks. Prior research has found that only a few attention heads are important in each mono-lingual Natural Language Processing (NLP) task and pruning the remaining heads leads to comparable or improved performance of the model. However, the impact of pruning attention heads is not yet clear in cross-lingual and multi-lingual tasks. Through extensive experiments, we show that (1) pruning a number of attention heads in a multi-lingual Transformer-based model has, in general, positive effects on its performance in cross-lingual and multi-lingual tasks and (2) the attention heads to be pruned can be ranked using gradients and identified with a few trial experiments. Our experiments focus on sequence labeling tasks, with potential…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Pruning · XLM-R · Linear Layer · Multi-Head Attention · WordPiece · Softmax · Residual Connection · Attention Dropout · Layer Normalization
