Scaling Optimal LR Across Token Horizons
Johan Bjorck, Alon Benhaim, Vishrav Chaudhary, Furu Wei, Xia Song

TL;DR
This paper investigates how the optimal learning rate for training large language models varies with token horizon, demonstrating that it follows a predictable scaling law and providing practical transfer rules.
Contribution
It introduces the first large-scale empirical study on hyperparameter transfer across dataset size (token horizon), revealing a scaling law for optimal learning rate.
Findings
Optimal LR decreases with longer token horizons.
Optimal LR follows a predictable scaling law.
Applying the scaling law improves training efficiency.
Abstract
State-of-the-art LLMs are powered by scaling -- scaling model size, dataset size and cluster size. It is economically infeasible to extensively tune hyperparameter for the largest runs. Instead, approximately optimal hyperparameters must be inferred or \textit{transferred} from smaller experiments. Hyperparameter transfer across model sizes has been studied in Yang et al. However, hyperparameter transfer across dataset size -- or token horizon -- has not been studied yet. To remedy this we conduct a large scale empirical study on how optimal learning rate (LR) depends on token horizon in LLM training. We first demonstrate that the optimal LR changes significantly with token horizon -- longer training necessitates smaller LR. Secondly we demonstrate the the optimal LR follows a scaling law, and that the optimal LR for longer horizons can be accurately estimated from shorter horizons via…
Peer Reviews
Decision·ICLR 2025 Poster
The paper has significant strong points. 1. Experimental Scale. The authors had the capability to run large-scale experiments (billion scale). This makes the results very reliable in the specific experimental setting adopted here (i.e. architecture, model size, optimizer setting). 2. The research question of how the learning rate should scale up is very important in practical settings and it is not covered in either theoretical or empirical research. Thus, the experimental findings are both n
The paper has some weaknesses, mainly due to the depth of the investigation that is performed. More concretely (in order of importance): 1. It is unclear whether the observed inverse relationship between the number of tokens used for pretraining and optimal learning rate is due to the fact the model is trained progressively for a longer number of time steps, or because the network has processed more data. This is quite a fundamental experiment, and it’s unclear what the view taken by the author
As described, the problem studied by the paper is important and the paper's findings give an important starting point for predicting optimal LR.
There are some other hyperparameters which strongly interact with learning rate such as weight decay and warmup. The paper does not explore the interaction between these factors and learning rate.
The paper is clear and well written. It delivers what it promises. This is an impactful research area. Efficient methods or scaling laws for finding best step-sizes helps to improve overall training efficiency and performance.
The paper provides no analysis or in-depth intuition on why optimal LR scales with exponent -.32 ~= -1/3. The scale of experiments are rather small for a fully empirical paper. Unless there is an analytical explanation or intuition provided for the observed patters, the trends might be unreliable for larger networks and longer horizons. In the same vein, would the scaling law remain the same for other transformers like Llama and Mistral? It would also be interesting to study how optimal LR sca
Videos
Taxonomy
TopicsWelding Techniques and Residual Stresses · Advanced Measurement and Detection Methods · Geophysical Methods and Applications
