Inverted Activations: Reducing Memory Footprint in Neural Network   Training

Georgii Novikov; Ivan Oseledets

arXiv:2407.15545·cs.LG·October 8, 2024

Inverted Activations: Reducing Memory Footprint in Neural Network Training

Georgii Novikov, Ivan Oseledets

PDF

Open Access 1 Repo

TL;DR

This paper introduces a memory-efficient method for neural network training that saves activation outputs instead of inputs, using inverse functions during backpropagation, significantly reducing memory use especially in transformer models.

Contribution

The paper presents a novel approach to reduce memory footprint in neural network training by storing output tensors and approximating inverse nonlinearities, compatible with existing frameworks.

Findings

01

Memory usage is significantly reduced in experiments.

02

Training accuracy and speed are unaffected.

03

Applicable to transformer architectures like GPT and BERT.

Abstract

The scaling of neural networks with increasing data and model sizes necessitates the development of more efficient deep learning algorithms. A significant challenge in neural network training is the memory footprint associated with activation tensors, particularly in pointwise nonlinearity layers that traditionally save the entire input tensor for the backward pass, leading to substantial memory consumption. In this paper, we propose a modification to the handling of activation tensors in pointwise nonlinearity layers. Our method involves saving the output tensor instead of the input tensor during the forward pass. Since the subsequent layer typically also saves its input tensor, this approach reduces the total memory required by storing only one tensor between layers instead of two. This optimization is especially beneficial for transformer-based architectures like GPT, BERT,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pglolo/optiacts
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Byte Pair Encoding · Discriminative Fine-Tuning · Attention Dropout · Linear Warmup With Linear Decay · Dense Connections · Linear Warmup With Cosine Annealing