Optimal Growth Schedules for Batch Size and Learning Rate in SGD that Reduce SFO Complexity
Hikaru Umeda, Hideaki Iiduka

TL;DR
This paper derives optimal schedules for increasing batch size and learning rate in SGD to minimize stochastic first-order oracle complexity, improving training efficiency for large deep learning models.
Contribution
It provides the first theoretical derivation of optimal growth schedules for batch size and learning rate in SGD based on SFO complexity, with validated practical guidelines.
Findings
Optimal growth schedules reduce SFO complexity.
Schedules improve training efficiency for large-batch deep learning.
Validated through extensive experiments.
Abstract
The unprecedented growth of deep learning models has enabled remarkable advances but introduced substantial computational bottlenecks. A key factor contributing to training efficiency is batch-size and learning-rate scheduling in stochastic gradient methods. However, naive scheduling of these hyperparameters can degrade optimization efficiency and compromise generalization. Motivated by recent theoretical insights, we investigated how the batch size and learning rate should be increased during training to balance efficiency and convergence. We analyzed this problem on the basis of stochastic first-order oracle (SFO) complexity, defined as the expected number of gradient evaluations needed to reach an -approximate stationary point of the empirical loss. We theoretically derived optimal growth schedules for the batch size and learning rate that reduce SFO complexity and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
