Compressive Transformers for Long-Range Sequence Modelling
Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Timothy, P. Lillicrap

TL;DR
The paper introduces the Compressive Transformer, a novel sequence model that compresses past memories to effectively handle long-range dependencies, achieving state-of-the-art results in language modeling and other tasks.
Contribution
It presents the first comprehensive approach to compressive memory in transformers, improving long-range sequence modeling across multiple domains.
Findings
Achieves 17.1 perplexity on WikiText-103
Achieves 0.97 bits per character on Enwik8
Effective in high-frequency speech modeling and reinforcement learning tasks
Abstract
We present the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning. We find the Compressive Transformer obtains state-of-the-art language modelling results in the WikiText-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task. To promote the domain of long-range sequence learning, we propose a new open-vocabulary language modelling benchmark derived from books, PG-19.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗BSC-LT/salamandra-7b-instructmodel· 81k dl· ♡ 7781k dl♡ 77
- 🤗deepnet/ShortGptmodel
- 🤗BSC-LT/salamandra-7bmodel· 355 dl· ♡ 29355 dl♡ 29
- 🤗BSC-LT/salamandra-2bmodel· 1.3k dl· ♡ 251.3k dl♡ 25
- 🤗BSC-LT/salamandra-2b-instructmodel· 6.3k dl· ♡ 276.3k dl♡ 27
- 🤗robbiemu/salamandra-2b-instructmodel· 92 dl92 dl
- 🤗RichardErkhov/BSC-LT_-_salamandra-7b-instruct-ggufmodel· 141 dl141 dl
- 🤗RichardErkhov/BSC-LT_-_salamandra-7b-ggufmodel· 73 dl73 dl
- 🤗robbiemu/salamandra-2bmodel· 111 dl111 dl
- 🤗RichardErkhov/BSC-LT_-_salamandra-2b-instruct-ggufmodel· 356 dl356 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · Cosine Annealing · Gradient Clipping · Adaptive Softmax · Variational Dropout · Transformer-XL · Linear Warmup With Cosine Annealing · Adaptive Input Representations · Compressed Memory · Compressive Transformer
