Do Transformers Need Deep Long-Range Memory

Jack W. Rae; Ali Razavi

arXiv:2007.03356·cs.LG·July 8, 2020·1 cites

Do Transformers Need Deep Long-Range Memory

Jack W. Rae, Ali Razavi

PDF

Open Access 1 Repo

TL;DR

This paper investigates the necessity of extensive long-range memory in Transformer models, demonstrating that comparable or better performance can be achieved with significantly less memory and limited attention range in lower layers.

Contribution

The study shows that reducing long-range memory and attention range in Transformers does not harm performance and can even improve it, challenging the assumption that large memory is essential.

Findings

01

Comparable performance with 6X less memory

02

Better performance by limiting attention range in lower layers

03

Long-range memory may not be necessary for effective Transformers

Abstract

Deep attention models have advanced the modelling of sequential data across many domains. For language modelling in particular, the Transformer-XL -- a Transformer augmented with a long-range memory of past activations -- has been shown to be state-of-the-art across a variety of well-studied benchmarks. The Transformer-XL incorporates a long-range memory at every layer of the network, which renders its state to be thousands of times larger than RNN predecessors. However it is unclear whether this is necessary. We perform a set of interventions to show that comparable performance can be obtained with 6X fewer long range memories and better performance can be obtained by limiting the range of attention in lower layers of the network.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lucidrains/memorizing-transformers-pytorch
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Adam · Multi-Head Attention · Layer Normalization · Residual Connection · Attention Is All You Need · Linear Warmup With Cosine Annealing