Exploring Attention Map Reuse for Efficient Transformer Neural Networks

Kyuhong Shim; Jungwook Choi; Wonyong Sung

arXiv:2301.12444·cs.AI·January 31, 2023·1 cites

Exploring Attention Map Reuse for Efficient Transformer Neural Networks

Kyuhong Shim, Jungwook Choi, Wonyong Sung

PDF

Open Access

TL;DR

This paper investigates attention map reuse in Transformer models, demonstrating its effectiveness in accelerating inference and reducing computational costs for long sequences on CPU and GPU platforms.

Contribution

It provides a comprehensive analysis of attention map reuse, comparing it with other SA compression techniques and highlighting its advantages for long sequence processing.

Findings

01

Attention map reuse significantly speeds up inference.

02

It reduces memory and computation costs for long sequences.

03

Effective on both CPU and GPU platforms.

Abstract

Transformer-based deep neural networks have achieved great success in various sequence applications due to their powerful ability to model long-range dependency. The key module of Transformer is self-attention (SA) which extracts features from the entire sequence regardless of the distance between positions. Although SA helps Transformer performs particularly well on long-range tasks, SA requires quadratic computation and memory complexity with the input sequence length. Recently, attention map reuse, which groups multiple SA layers to share one attention map, has been proposed and achieved significant speedup for speech recognition models. In this paper, we provide a comprehensive study on attention map reuse focusing on its ability to accelerate inference. We compare the method with other SA compression techniques and conduct a breakdown analysis of its advantages for a long sequence.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Neural Networks and Applications · Advanced Neural Network Applications

MethodsAttention Is All You Need · Linear Layer · Softmax · Absolute Position Encodings · Byte Pair Encoding · Adam · Layer Normalization · Label Smoothing · Multi-Head Attention · Dense Connections