Inverted Activations: Reducing Memory Footprint in Neural Network Training
Georgii Novikov, Ivan Oseledets

TL;DR
This paper introduces a memory-efficient method for neural network training that saves activation outputs instead of inputs, using inverse functions during backpropagation, significantly reducing memory use especially in transformer models.
Contribution
The paper presents a novel approach to reduce memory footprint in neural network training by storing output tensors and approximating inverse nonlinearities, compatible with existing frameworks.
Findings
Memory usage is significantly reduced in experiments.
Training accuracy and speed are unaffected.
Applicable to transformer architectures like GPT and BERT.
Abstract
The scaling of neural networks with increasing data and model sizes necessitates the development of more efficient deep learning algorithms. A significant challenge in neural network training is the memory footprint associated with activation tensors, particularly in pointwise nonlinearity layers that traditionally save the entire input tensor for the backward pass, leading to substantial memory consumption. In this paper, we propose a modification to the handling of activation tensors in pointwise nonlinearity layers. Our method involves saving the output tensor instead of the input tensor during the forward pass. Since the subsequent layer typically also saves its input tensor, this approach reduces the total memory required by storing only one tensor between layers instead of two. This optimization is especially beneficial for transformer-based architectures like GPT, BERT,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Byte Pair Encoding · Discriminative Fine-Tuning · Attention Dropout · Linear Warmup With Linear Decay · Dense Connections · Linear Warmup With Cosine Annealing
