Contributions of Transformer Attention Heads in Multi- and Cross-lingual   Tasks

Weicheng Ma; Kai Zhang; Renze Lou; Lili Wang; Soroush Vosoughi

arXiv:2108.08375·cs.CL·August 20, 2021

Contributions of Transformer Attention Heads in Multi- and Cross-lingual Tasks

Weicheng Ma, Kai Zhang, Renze Lou, Lili Wang, Soroush Vosoughi

PDF

TL;DR

This study investigates the importance of attention heads in Transformer models for cross-lingual and multi-lingual tasks, finding that pruning heads can improve performance and identify key heads using gradient-based methods.

Contribution

It demonstrates that pruning attention heads in multi-lingual Transformers can enhance performance and introduces a gradient-based ranking method to identify heads to prune.

Findings

01

Pruning attention heads generally improves cross-lingual and multi-lingual task performance.

02

Gradient-based ranking effectively identifies heads for pruning.

03

Results are consistent across mBERT and XLM-R models on multiple languages.

Abstract

This paper studies the relative importance of attention heads in Transformer-based models to aid their interpretability in cross-lingual and multi-lingual tasks. Prior research has found that only a few attention heads are important in each mono-lingual Natural Language Processing (NLP) task and pruning the remaining heads leads to comparable or improved performance of the model. However, the impact of pruning attention heads is not yet clear in cross-lingual and multi-lingual tasks. Through extensive experiments, we show that (1) pruning a number of attention heads in a multi-lingual Transformer-based model has, in general, positive effects on its performance in cross-lingual and multi-lingual tasks and (2) the attention heads to be pruned can be ranked using gradients and identified with a few trial experiments. Our experiments focus on sequence labeling tasks, with potential…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Pruning · XLM-R · Linear Layer · Multi-Head Attention · WordPiece · Softmax · Residual Connection · Attention Dropout · Layer Normalization