Less Memory Means smaller GPUs: Backpropagation with Compressed   Activations

Daniel Barley; Holger Fr\"oning

arXiv:2409.11902·cs.LG·September 19, 2024

Less Memory Means smaller GPUs: Backpropagation with Compressed Activations

Daniel Barley, Holger Fr\"oning

PDF

Open Access

TL;DR

This paper proposes compressing activation maps during backpropagation in neural network training to reduce memory usage, enabling training on smaller GPUs without sacrificing accuracy, though with longer training times.

Contribution

It introduces a novel method of compressing activations with pooling during backpropagation to decrease memory footprint in DNN training.

Findings

01

Achieved 29% reduction in peak memory consumption

02

Maintained prediction accuracy with compressed activations

03

Longer training schedule required due to compression

Abstract

The ever-growing scale of deep neural networks (DNNs) has lead to an equally rapid growth in computational resource requirements. Many recent architectures, most prominently Large Language Models, have to be trained using supercomputers with thousands of accelerators, such as GPUs or TPUs. Next to the vast number of floating point operations the memory footprint of DNNs is also exploding. In contrast, GPU architectures are notoriously short on memory. Even comparatively small architectures like some EfficientNet variants cannot be trained on a single consumer-grade GPU at reasonable mini-batch sizes. During training, intermediate input activations have to be stored until backpropagation for gradient calculation. These make up the vast majority of the memory footprint. In this work we therefore consider compressing activation maps for the backward pass using pooling, which can reduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Medical Image Segmentation Techniques

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Depthwise Convolution · Kaiming Initialization · Pointwise Convolution · Depthwise Separable Convolution · Sigmoid Activation · Batch Normalization · Max Pooling · (FiLe@Against@Claim)How do I file a claim against Expedia? · Convolution