Low-Memory Neural Network Training: A Technical Report

Nimit S. Sohoni; Christopher R. Aberger; Megan Leszczynski and; Jian Zhang; Christopher R\'e

arXiv:1904.10631·cs.LG·April 12, 2022·38 cites

Low-Memory Neural Network Training: A Technical Report

Nimit S. Sohoni, Christopher R. Aberger, Megan Leszczynski and, Jian Zhang, Christopher R\'e

PDF

Open Access

TL;DR

This paper investigates the actual memory needs for training neural networks and evaluates techniques like sparsity, low precision, microbatching, and gradient checkpointing to significantly reduce memory usage with minimal impact on model quality.

Contribution

It provides a comprehensive analysis of memory reduction techniques for training neural networks and demonstrates their effectiveness through extensive experiments on benchmark models.

Findings

01

Up to 60.7x memory reduction for WideResNet training with 0.4% accuracy loss.

02

Up to 8.7x memory reduction for Transformer training with 0.15 BLEU score drop.

03

Tradeoffs between memory savings, accuracy, and computation are characterized.

Abstract

Memory is increasingly often the bottleneck when training neural network models. Despite this, techniques to lower the overall memory requirements of training have been less widely studied compared to the extensive literature on reducing the memory requirements of inference. In this paper we study a fundamental question: How much memory is actually needed to train a neural network? To answer this question, we profile the overall memory usage of training on two representative deep learning benchmarks -- the WideResNet model for image classification and the DynamicConv Transformer model for machine translation -- and comprehensively evaluate four standard techniques for reducing the training memory requirements: (1) imposing sparsity on the model, (2) using low precision, (3) microbatching, and (4) gradient checkpointing. We explore how each of these techniques in isolation affects both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dynamic Convolution · Average Pooling · Convolution · Batch Normalization · Global Average Pooling · Kaiming Initialization · Wide Residual Block