Low-Memory Neural Network Training: A Technical Report
Nimit S. Sohoni, Christopher R. Aberger, Megan Leszczynski and, Jian Zhang, Christopher R\'e

TL;DR
This paper investigates the actual memory needs for training neural networks and evaluates techniques like sparsity, low precision, microbatching, and gradient checkpointing to significantly reduce memory usage with minimal impact on model quality.
Contribution
It provides a comprehensive analysis of memory reduction techniques for training neural networks and demonstrates their effectiveness through extensive experiments on benchmark models.
Findings
Up to 60.7x memory reduction for WideResNet training with 0.4% accuracy loss.
Up to 8.7x memory reduction for Transformer training with 0.15 BLEU score drop.
Tradeoffs between memory savings, accuracy, and computation are characterized.
Abstract
Memory is increasingly often the bottleneck when training neural network models. Despite this, techniques to lower the overall memory requirements of training have been less widely studied compared to the extensive literature on reducing the memory requirements of inference. In this paper we study a fundamental question: How much memory is actually needed to train a neural network? To answer this question, we profile the overall memory usage of training on two representative deep learning benchmarks -- the WideResNet model for image classification and the DynamicConv Transformer model for machine translation -- and comprehensively evaluate four standard techniques for reducing the training memory requirements: (1) imposing sparsity on the model, (2) using low precision, (3) microbatching, and (4) gradient checkpointing. We explore how each of these techniques in isolation affects both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dynamic Convolution · Average Pooling · Convolution · Batch Normalization · Global Average Pooling · Kaiming Initialization · Wide Residual Block
