Interactive Multi-Head Self-Attention with Linear Complexity
Hankyul Kang, Ming-Hsuan Yang, Jongbin Ryu

TL;DR
This paper introduces a novel, efficient multi-head self-attention method that captures cross-head interactions with reduced computational complexity, improving performance over existing models.
Contribution
It proposes a decomposition-based approach to enable cross-head interactions in self-attention with linear complexity, enhancing information flow and model efficiency.
Findings
Outperforms existing efficient attention methods.
Achieves better results on benchmark tasks.
Reduces computational complexity significantly.
Abstract
We propose an efficient interactive method for multi-head self-attention via decomposition. For existing methods using multi-head self-attention, the attention operation of each head is computed independently. However, we show that the interactions between cross-heads of the attention matrix enhance the information flow of the attention operation. Considering that the attention matrix of each head can be seen as a feature of networks, it is beneficial to establish connectivity between them to capture interactions better. However, a straightforward approach to capture the interactions between the cross-heads is computationally prohibitive as the complexity grows substantially with the high dimension of an attention matrix. In this work, we propose an effective method to decompose the attention operation into query- and key-less components. This will result in a more manageable size for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Online Learning and Analytics · Data Stream Mining Techniques
