Optimization Hyper-parameter Laws for Large Language Models

Xingyu Xie; Kuangyu Ding; Shuicheng Yan; Kim-Chuan Toh; Tianwen Wei

arXiv:2409.04777·cs.LG·May 21, 2026

Optimization Hyper-parameter Laws for Large Language Models

Xingyu Xie, Kuangyu Ding, Shuicheng Yan, Kim-Chuan Toh, Tianwen Wei

PDF

TL;DR

This paper introduces Opt-Laws, a framework that predicts final training loss of large language models based on hyper-parameters, model size, and data, aiding in hyper-parameter schedule selection.

Contribution

Opt-Laws provides a theoretically grounded, interpretable method to predict training outcomes and select hyper-parameter schedules for large language models.

Findings

01

Achieves 94% Top-2 hit rate in schedule candidate identification

02

Correctly identifies best schedule family in all out-of-family tests

03

Detects training divergence with F1 score of 0.92

Abstract

Large Language Models have driven significant AI advancements, yet their training is resource-intensive and highly sensitive to hyper-parameter selection. While scaling laws provide valuable guidance on model size and data requirements, they fall short in choosing dynamic hyper-parameters, such as learning-rate (LR) schedules, that evolve during training. To bridge this gap, we present Optimization Hyper-parameter Laws (Opt-Laws), a framework that predicts final training loss as a function of LR schedule, model size, and data size. Grounded in SDE-based convergence and escape analyses, Opt-Laws yield interpretable convergence and escape features that predict final training loss across model scales, enabling schedule pre-selection from small-scale experiments. Empirically, Opt-Laws achieve a 94% Top-2 hit rate for identifying near-optimal schedule candidates on held-out configurations,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques