Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever

TL;DR
This paper introduces Sparse Transformers with efficient attention mechanisms, enabling modeling of very long sequences across various modalities, achieving state-of-the-art results and demonstrating global coherence in generated samples.
Contribution
The paper presents novel sparse attention architectures and training techniques that significantly reduce computational complexity, allowing modeling of sequences tens of thousands of steps long.
Findings
Achieved state-of-the-art density modeling on Enwik8, CIFAR-10, and ImageNet-64.
Generated globally coherent and diverse long sequences.
Demonstrated the potential to model sequences of one million or more in length.
Abstract
Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to . We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsLinear Layer · Cosine Annealing · Multi-Head Attention · Residual Connection · Attention Dropout · Linear Warmup With Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · Dense Connections · Adam
