Finding the Pillars of Strength for Multi-Head Attention
Jinjie Ni, Rui Mao, Zonglin Yang, Han Lei, Erik Cambria

TL;DR
This paper introduces Grouped Head Attention with a self-supervised grouping and a voting-based pruning method to reduce redundancy in Multi-Head Attention, leading to more efficient transformers with improved performance.
Contribution
It proposes a novel grouping and pruning approach for MHA that enhances efficiency and effectiveness by focusing on essential, distinctive features.
Findings
Significant performance improvements on three tasks.
Substantial parameter reduction in transformers.
Effective removal of redundant attention heads.
Abstract
Recent studies have revealed some issues of Multi-Head Attention (MHA), e.g., redundancy and over-parameterization. Specifically, the heads of MHA were originally designed to attend to information from different representation subspaces, whereas prior studies found that some attention heads likely learn similar features and can be pruned without harming performance. Inspired by the minimum-redundancy feature selection, we assume that focusing on the most representative and distinctive features with minimum resources can mitigate the above issues and lead to more effective and efficient MHAs. In particular, we propose Grouped Head Attention, trained with a self-supervised group constraint that group attention heads, where each group focuses on an essential but distinctive feature subset. We additionally propose a Voting-to-Stay procedure to remove redundant heads, thus achieving a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBrain Tumor Detection and Classification · EEG and Brain-Computer Interfaces · Multimodal Machine Learning Applications
MethodsSoftmax · Linear Layer
