Finding the Pillars of Strength for Multi-Head Attention

Jinjie Ni; Rui Mao; Zonglin Yang; Han Lei; Erik Cambria

arXiv:2305.14380·cs.LG·October 17, 2023·1 cites

Finding the Pillars of Strength for Multi-Head Attention

Jinjie Ni, Rui Mao, Zonglin Yang, Han Lei, Erik Cambria

PDF

Open Access 2 Repos

TL;DR

This paper introduces Grouped Head Attention with a self-supervised grouping and a voting-based pruning method to reduce redundancy in Multi-Head Attention, leading to more efficient transformers with improved performance.

Contribution

It proposes a novel grouping and pruning approach for MHA that enhances efficiency and effectiveness by focusing on essential, distinctive features.

Findings

01

Significant performance improvements on three tasks.

02

Substantial parameter reduction in transformers.

03

Effective removal of redundant attention heads.

Abstract

Recent studies have revealed some issues of Multi-Head Attention (MHA), e.g., redundancy and over-parameterization. Specifically, the heads of MHA were originally designed to attend to information from different representation subspaces, whereas prior studies found that some attention heads likely learn similar features and can be pruned without harming performance. Inspired by the minimum-redundancy feature selection, we assume that focusing on the most representative and distinctive features with minimum resources can mitigate the above issues and lead to more effective and efficient MHAs. In particular, we propose Grouped Head Attention, trained with a self-supervised group constraint that group attention heads, where each group focuses on an essential but distinctive feature subset. We additionally propose a Voting-to-Stay procedure to remove redundant heads, thus achieving a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBrain Tumor Detection and Classification · EEG and Brain-Computer Interfaces · Multimodal Machine Learning Applications

MethodsSoftmax · Linear Layer