Adaptive Batch Size and Learning Rate Scheduler for Stochastic Gradient Descent Based on Minimization of Stochastic First-order Oracle Complexity
Hikaru Umeda, Hideaki Iiduka

TL;DR
This paper proposes an adaptive scheduling strategy for mini-batch SGD that dynamically adjusts batch size and learning rate based on theoretical insights, leading to faster convergence in training neural networks.
Contribution
It introduces a novel adaptive scheduler for SGD that leverages the critical batch size concept to improve convergence speed.
Findings
Adaptive scheduler outperforms existing methods in convergence speed.
Adjusting batch size and learning rate based on gradient decay enhances training efficiency.
Theoretical analysis supports the effectiveness of the proposed strategy.
Abstract
The convergence behavior of mini-batch stochastic gradient descent (SGD) is highly sensitive to the batch size and learning rate settings. Recent theoretical studies have identified the existence of a critical batch size that minimizes stochastic first-order oracle (SFO) complexity, defined as the expected number of gradient evaluations required to reach a stationary point of the empirical loss function in a deep neural network. An adaptive scheduling strategy is introduced to accelerate SGD that leverages theoretical findings on the critical batch size. The batch size and learning rate are adjusted on the basis of the observed decay in the full gradient norm during training. Experiments using an adaptive joint scheduler based on this strategy demonstrated improved convergence speed compared with that of existing schedulers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
