Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite Data Limit
Oleg Filatov, Jan Ebert, Jiangtao Wang, Stefan Kesselheim

TL;DR
This paper investigates the optimal scaling of learning rate and batch size in large language model pretraining as data size approaches infinity, revealing intricate dependencies on data and model parameters.
Contribution
It uncovers the dependence of optimal learning rate and batch size on pretraining token budget and critical batch size, extending scaling rules to the infinite data limit.
Findings
Optimal $ ext{η}$ and $B$ depend on token budget $T$ and critical batch size $B_ ext{crit}$.
Critical batch size $B_ ext{crit}$ scales proportionally with $T$.
Sensitivity of loss to learning rate decreases with increasing $T$.
Abstract
One of the main challenges in optimal scaling of large language models (LLMs) is the prohibitive cost of hyperparameter tuning, particularly learning rate and batch size . While techniques like P (Yang et al., 2022) provide scaling rules for optimal transfer in the infinite model size limit, the optimal scaling behavior in the infinite data size limit remains unknown. We fill in this gap by observing for the first time an intricate dependence of optimal scaling on the pretraining token budget , and its relation to the critical batch size , which we measure to evolve as . Furthermore, we show that the optimal batch size is positively correlated with : keeping it fixed becomes suboptimal over time even if learning rate is scaled optimally. Surprisingly, our results demonstrate that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaussian Processes and Bayesian Inference · Advanced Thermodynamics and Statistical Mechanics · Distributed Sensor Networks and Detection Algorithms
