Few-Bit Backward: Quantized Gradients of Activation Functions for Memory   Footprint Reduction

Georgii Novikov; Daniel Bershatsky; Julia Gusak; Alex Shonenkov; Denis; Dimitrov; and Ivan Oseledets

arXiv:2202.00441·cs.LG·February 4, 2022·5 cites

Few-Bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction

Georgii Novikov, Daniel Bershatsky, Julia Gusak, Alex Shonenkov, Denis, Dimitrov, and Ivan Oseledets

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces a method to quantize activation function gradients to reduce memory usage during neural network training, maintaining performance while significantly lowering memory footprint.

Contribution

It presents a systematic approach for optimal low-bit quantization of activation gradients using dynamic programming, applicable to various nonlinearities and compatible with existing training pipelines.

Findings

01

Memory footprint is significantly reduced.

02

Training convergence is maintained.

03

Applicable to all popular nonlinearities.

Abstract

Memory footprint is one of the main limiting factors for large neural network training. In backpropagation, one needs to store the input to each operation in the computational graph. Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operation induces additional memory costs which -- as we show -- can be significantly reduced by quantization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions with only a few bits per each element. We show that such approximation can be achieved by computing optimal piecewise-constant approximation of the derivative of the activation function, which can be done by dynamic programming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Few-bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Neural Networks and Applications · Domain Adaptation and Few-Shot Learning