Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization
Tan Nguyen, Richard G. Baraniuk, Robert M. Kirby, Stanley J., Osher, Bao Wang

TL;DR
The paper introduces the momentum transformer, which enhances linear attention mechanisms in transformers by incorporating momentum, leading to improved accuracy and efficiency in sequence modeling tasks without increasing computational complexity.
Contribution
It proposes a novel momentum-based approach for linear transformers, with an adaptive strategy for momentum calculation, significantly improving performance over existing linear attention methods.
Findings
Outperforms existing linear transformers in accuracy and efficiency
Effective in both autoregressive and non-autoregressive tasks
Reduces training time while maintaining linear complexity
Abstract
Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy. In response, we first interpret the linear attention and residual connections in computing the attention map as gradient descent steps. We then introduce momentum into these components and propose the \emph{momentum transformer}, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities. Furthermore, we develop an adaptive strategy to compute the momentum value for our model based on the optimal momentum for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis
