Generating Long Sequences with Sparse Transformers

Rewon Child; Scott Gray; Alec Radford; and Ilya Sutskever

arXiv:1904.10509·cs.LG·April 25, 2019·489 cites

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever

PDF

Open Access 5 Repos

TL;DR

This paper introduces Sparse Transformers with efficient attention mechanisms, enabling modeling of very long sequences across various modalities, achieving state-of-the-art results and demonstrating global coherence in generated samples.

Contribution

The paper presents novel sparse attention architectures and training techniques that significantly reduce computational complexity, allowing modeling of sequences tens of thousands of steps long.

Findings

01

Achieved state-of-the-art density modeling on Enwik8, CIFAR-10, and ImageNet-64.

02

Generated globally coherent and diverse long sequences.

03

Demonstrated the potential to model sequences of one million or more in length.

Abstract

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O (n n)$ . We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsLinear Layer · Cosine Annealing · Multi-Head Attention · Residual Connection · Attention Dropout · Linear Warmup With Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · Dense Connections · Adam