Long-Short Transformer: Efficient Transformers for Language and Vision
Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein,, Anima Anandkumar, Bryan Catanzaro

TL;DR
The paper introduces Long-Short Transformer, an efficient self-attention mechanism that scales linearly for long sequences in language and vision tasks, outperforming state-of-the-art models in speed and accuracy.
Contribution
It proposes a novel dual attention mechanism with dynamic projection and normalization, enabling scalable, efficient transformers for long sequences without additional complexity.
Findings
Achieves 0.97 BPC on enwik8 with half the parameters of previous models.
Outperforms state-of-the-art on Long Range Arena benchmark.
Attains 84.1% Top-1 accuracy on ImageNet with a moderate-sized model.
Abstract
Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic time and memory complexities with respect to the input sequence length. In this paper, we propose Long-Short Transformer (Transformer-LS), an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. We propose a dual normalization strategy to account for the scale mismatch between the two attention mechanisms. Transformer-LS can be applied to both autoregressive and bidirectional models without additional complexity. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Dropout · Layer Normalization · Byte Pair Encoding · Label Smoothing
