Sub-Linear Memory: How to Make Performers SLiM

Valerii Likhosherstov; Krzysztof Choromanski; Jared Davis; Xingyou; Song; Adrian Weller

arXiv:2012.11346·cs.LG·December 22, 2020·5 cites

Sub-Linear Memory: How to Make Performers SLiM

Valerii Likhosherstov, Krzysztof Choromanski, Jared Davis, Xingyou, Song, Adrian Weller

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper analyzes Performer-based linear self-attention mechanisms in Transformers, revealing a flexible time-memory tradeoff that enables training with sublinear memory, facilitating deployment on resource-constrained devices.

Contribution

It provides a thorough complexity analysis of linear self-attention, demonstrating that training can be performed with sublinear memory at the cost of increased time, enabling low-memory training and fine-tuning.

Findings

01

Performers can be trained with sublinear memory usage.

02

Backward and forward passes are feasible with no approximations.

03

The approach enables training on low-resource devices.

Abstract

The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitous in state-of-the-art solutions for a wide variety of applications. Yet vanilla Transformers are notoriously resource-expensive, requiring $O (L^{2})$ in serial time and memory as functions of input length $L$ . Recent works proposed various linear self-attention mechanisms, scaling only as $O (L)$ for serial computation. We perform a thorough analysis of recent Transformer mechanisms with linear self-attention, Performers, in terms of overall computational complexity. We observe a remarkable computational flexibility: forward and backward propagation can be performed with no approximations using sublinear memory as a function of $L$ (in addition to negligible storage for the input sequence), at a cost of greater time complexity in the parallel setting. In the extreme case, a Performer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Sub-Linear Memory: How to Make Performers SLiM· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Fast Attention Via Positive Orthogonal Random Features · Performer · Byte Pair Encoding · Softmax · Dropout · Attention Is All You Need · Label Smoothing