Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite   Data Limit

Oleg Filatov; Jan Ebert; Jiangtao Wang; Stefan Kesselheim

arXiv:2410.05838·cs.LG·January 10, 2025

Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite Data Limit

Oleg Filatov, Jan Ebert, Jiangtao Wang, Stefan Kesselheim

PDF

Open Access

TL;DR

This paper investigates the optimal scaling of learning rate and batch size in large language model pretraining as data size approaches infinity, revealing intricate dependencies on data and model parameters.

Contribution

It uncovers the dependence of optimal learning rate and batch size on pretraining token budget and critical batch size, extending scaling rules to the infinite data limit.

Findings

01

Optimal $ ext{η}$ and $B$ depend on token budget $T$ and critical batch size $B_ ext{crit}$.

02

Critical batch size $B_ ext{crit}$ scales proportionally with $T$.

03

Sensitivity of loss to learning rate decreases with increasing $T$.

Abstract

One of the main challenges in optimal scaling of large language models (LLMs) is the prohibitive cost of hyperparameter tuning, particularly learning rate $η$ and batch size $B$ . While techniques like $μ$ P (Yang et al., 2022) provide scaling rules for optimal $η$ transfer in the infinite model size limit, the optimal scaling behavior in the infinite data size limit remains unknown. We fill in this gap by observing for the first time an intricate dependence of optimal $η$ scaling on the pretraining token budget $T$ , $B$ and its relation to the critical batch size $B_{crit}$ , which we measure to evolve as $B_{crit} \propto T$ . Furthermore, we show that the optimal batch size is positively correlated with $B_{crit}$ : keeping it fixed becomes suboptimal over time even if learning rate is scaled optimally. Surprisingly, our results demonstrate that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaussian Processes and Bayesian Inference · Advanced Thermodynamics and Statistical Mechanics · Distributed Sensor Networks and Detection Algorithms