Revisiting Small Batch Training for Deep Neural Networks
Dominic Masters, Carlo Luschi

TL;DR
This paper investigates the effects of mini-batch size on deep neural network training, revealing that smaller batches often yield better stability and performance, challenging the trend of using very large mini-batches.
Contribution
The study provides an experimental comparison of different mini-batch sizes under a consistent learning rate scheme, highlighting the advantages of small batches for stability and generalization.
Findings
Small mini-batches (2-32) outperform large ones in stability and test performance.
Increasing mini-batch size narrows the range of stable learning rates.
Large mini-batches in the thousands are less effective for training stability.
Abstract
Modern deep neural network training is typically based on mini-batch stochastic gradient optimization. While the use of large mini-batches increases the available computational parallelism, small batch training has been shown to provide improved generalization performance and allows a significantly smaller memory footprint, which might also be exploited to improve machine throughput. In this paper, we review common assumptions on learning rate scaling and training duration, as a basis for an experimental comparison of test performance for different mini-batch sizes. We adopt a learning rate that corresponds to a constant average weight update per gradient calculation (i.e., per unit cost of computation), and point out that this results in a variance of the weight updates that increases linearly with the mini-batch size . The collected experimental results for the CIFAR-10,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques
