Sub-Linear Memory: How to Make Performers SLiM
Valerii Likhosherstov, Krzysztof Choromanski, Jared Davis, Xingyou, Song, Adrian Weller

TL;DR
This paper analyzes Performer-based linear self-attention mechanisms in Transformers, revealing a flexible time-memory tradeoff that enables training with sublinear memory, facilitating deployment on resource-constrained devices.
Contribution
It provides a thorough complexity analysis of linear self-attention, demonstrating that training can be performed with sublinear memory at the cost of increased time, enabling low-memory training and fine-tuning.
Findings
Performers can be trained with sublinear memory usage.
Backward and forward passes are feasible with no approximations.
The approach enables training on low-resource devices.
Abstract
The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitous in state-of-the-art solutions for a wide variety of applications. Yet vanilla Transformers are notoriously resource-expensive, requiring in serial time and memory as functions of input length . Recent works proposed various linear self-attention mechanisms, scaling only as for serial computation. We perform a thorough analysis of recent Transformer mechanisms with linear self-attention, Performers, in terms of overall computational complexity. We observe a remarkable computational flexibility: forward and backward propagation can be performed with no approximations using sublinear memory as a function of (in addition to negligible storage for the input sequence), at a cost of greater time complexity in the parallel setting. In the extreme case, a Performer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Fast Attention Via Positive Orthogonal Random Features · Performer · Byte Pair Encoding · Softmax · Dropout · Attention Is All You Need · Label Smoothing
