Exploring Attention Map Reuse for Efficient Transformer Neural Networks
Kyuhong Shim, Jungwook Choi, Wonyong Sung

TL;DR
This paper investigates attention map reuse in Transformer models, demonstrating its effectiveness in accelerating inference and reducing computational costs for long sequences on CPU and GPU platforms.
Contribution
It provides a comprehensive analysis of attention map reuse, comparing it with other SA compression techniques and highlighting its advantages for long sequence processing.
Findings
Attention map reuse significantly speeds up inference.
It reduces memory and computation costs for long sequences.
Effective on both CPU and GPU platforms.
Abstract
Transformer-based deep neural networks have achieved great success in various sequence applications due to their powerful ability to model long-range dependency. The key module of Transformer is self-attention (SA) which extracts features from the entire sequence regardless of the distance between positions. Although SA helps Transformer performs particularly well on long-range tasks, SA requires quadratic computation and memory complexity with the input sequence length. Recently, attention map reuse, which groups multiple SA layers to share one attention map, has been proposed and achieved significant speedup for speech recognition models. In this paper, we provide a comprehensive study on attention map reuse focusing on its ability to accelerate inference. We compare the method with other SA compression techniques and conduct a breakdown analysis of its advantages for a long sequence.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Neural Networks and Applications · Advanced Neural Network Applications
MethodsAttention Is All You Need · Linear Layer · Softmax · Absolute Position Encodings · Byte Pair Encoding · Adam · Layer Normalization · Label Smoothing · Multi-Head Attention · Dense Connections
