Rethinking Language Model Scaling under Transferable Hypersphere Optimization
Liliang Ren, Yang Liu, Yelong Shen, Weizhu Chen

TL;DR
This paper introduces HyperP, a hypersphere optimization framework for large language models that enables stable, efficient transfer of hyperparameters across scales and architectures, improving training stability and performance.
Contribution
HyperP is the first framework to transfer optimal learning rates across model scales under the Frobenius-sphere constraint using the Muon optimizer, enhancing stability and efficiency.
Findings
A single base learning rate transfers across scales, improving compute efficiency by 1.58×.
HyperP maintains bounded instability indicators during training at various scales.
SqrtGate improves MoE gating stability and expert balance under hypersphere constraints.
Abstract
Scaling laws for large language models depend critically on the optimizer and parameterization. Existing hyperparameter transfer laws are mainly developed for first-order optimizers, and they do not structurally prevent training instability at scale. Recent hypersphere optimization methods constrain weight matrices to a fixed-norm hypersphere, offering a promising alternative for more stable scaling. We introduce HyperP (Hypersphere Parameterization), the first framework for transferring optimal learning rates across model width, depth, training tokens, and Mixture-of-Experts (MoE) granularity under the Frobenius-sphere constraint with the Muon optimizer. We prove that weight decay is a first-order no-op on the Frobenius sphere, show that Depth-P remains necessary, and find that the optimal learning rate follows the same data-scaling power law with the "magic exponent" 0.32…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
