Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent
Hikaru Umeda, Hideaki Iiduka

TL;DR
This paper provides theoretical analysis and numerical evidence that increasing both batch size and learning rate, especially with specific schedulers, accelerates stochastic gradient descent in training deep neural networks.
Contribution
It introduces and analyzes new schedulers combining increasing batch size with increasing or warm-up learning rates, showing they outperform traditional methods in minimizing gradient norms.
Findings
Schedulers with increasing batch size and learning rate accelerate convergence.
Increasing batch size and learning rate together reduces the full gradient norm faster.
Schedulers with warm-up or increasing learning rate outperform constant or decaying schedules.
Abstract
The performance of mini-batch stochastic gradient descent (SGD) strongly depends on setting the batch size and learning rate to minimize the empirical loss in training the deep neural network. In this paper, we present theoretical analyses of mini-batch SGD with four schedulers: (i) constant batch size and decaying learning rate scheduler, (ii) increasing batch size and decaying learning rate scheduler, (iii) increasing batch size and increasing learning rate scheduler, and (iv) increasing batch size and warm-up decaying learning rate scheduler. We show that mini-batch SGD using scheduler (i) does not always minimize the expectation of the full gradient norm of the empirical loss, whereas it does using any of schedulers (ii), (iii), and (iv). Furthermore, schedulers (iii) and (iv) accelerate mini-batch SGD. The paper also provides numerical results of supporting analyses showing that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Stochastic Gradient Optimization Techniques · Neural Networks and Applications
MethodsStochastic Gradient Descent
