Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling
Shuaipeng Li, Penghao Zhao, Hailin Zhang, Xingwu Sun, Hao Wu, Dian, Jiao, Weiyan Wang, Chengjun Liu, Zheng Fang, Jinbao Xue, Yangyu Tao, Bin Cui,, Di Wang

TL;DR
This paper investigates the relationship between optimal learning rates and batch sizes for Adam style optimizers, revealing a surge phenomenon where the optimal learning rate first increases then decreases with batch size, supported by theory and experiments.
Contribution
It introduces a new scaling law for Adam style optimizers showing a surge in optimal learning rate with batch size, supported by theoretical proof and extensive experiments.
Findings
Optimal learning rate first rises then falls with batch size.
The surge peak shifts toward larger batch sizes over training.
Experimental results verify the proposed scaling law.
Abstract
In current deep learning tasks, Adam style optimizers such as Adam, Adagrad, RMSProp, Adafactor, and Lion have been widely used as alternatives to SGD style optimizers. These optimizers typically update model parameters using the sign of gradients, resulting in more stable convergence curves. The learning rate and the batch size are the most critical hyperparameters for optimizers, which require careful tuning to enable effective convergence. Previous research has shown that the optimal learning rate increases linearly or follows similar rules with batch size for SGD style optimizers. However, this conclusion is not applicable to Adam style optimizers. In this paper, we elucidate the connection between optimal learning rates and batch sizes for Adam style optimizers through both theoretical analysis and extensive experiments. First, we raise the scaling law between batch sizes and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsEducational Technology and Assessment · Metaheuristic Optimization Algorithms Research · Image Processing Techniques and Applications
MethodsEvolved Sign Momentum · Adafactor · RMSProp · Stochastic Gradient Descent · Adam
