Optimization Hyper-parameter Laws for Large Language Models
Xingyu Xie, Kuangyu Ding, Shuicheng Yan, Kim-Chuan Toh, Tianwen Wei

TL;DR
This paper introduces Opt-Laws, a framework that predicts final training loss of large language models based on hyper-parameters, model size, and data, aiding in hyper-parameter schedule selection.
Contribution
Opt-Laws provides a theoretically grounded, interpretable method to predict training outcomes and select hyper-parameter schedules for large language models.
Findings
Achieves 94% Top-2 hit rate in schedule candidate identification
Correctly identifies best schedule family in all out-of-family tests
Detects training divergence with F1 score of 0.92
Abstract
Large Language Models have driven significant AI advancements, yet their training is resource-intensive and highly sensitive to hyper-parameter selection. While scaling laws provide valuable guidance on model size and data requirements, they fall short in choosing dynamic hyper-parameters, such as learning-rate (LR) schedules, that evolve during training. To bridge this gap, we present Optimization Hyper-parameter Laws (Opt-Laws), a framework that predicts final training loss as a function of LR schedule, model size, and data size. Grounded in SDE-based convergence and escape analyses, Opt-Laws yield interpretable convergence and escape features that predict final training loss across model scales, enabling schedule pre-selection from small-scale experiments. Empirically, Opt-Laws achieve a 94% Top-2 hit rate for identifying near-optimal schedule candidates on held-out configurations,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
