Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Liliang Ren; Yang Liu; Yelong Shen; Weizhu Chen

arXiv:2603.28743·cs.LG·April 7, 2026

Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Liliang Ren, Yang Liu, Yelong Shen, Weizhu Chen

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces HyperP, a hypersphere optimization framework for large language models that enables stable, efficient transfer of hyperparameters across scales and architectures, improving training stability and performance.

Contribution

HyperP is the first framework to transfer optimal learning rates across model scales under the Frobenius-sphere constraint using the Muon optimizer, enhancing stability and efficiency.

Findings

01

A single base learning rate transfers across scales, improving compute efficiency by 1.58×.

02

HyperP maintains bounded instability indicators during training at various scales.

03

SqrtGate improves MoE gating stability and expert balance under hypersphere constraints.

Abstract

Scaling laws for large language models depend critically on the optimizer and parameterization. Existing hyperparameter transfer laws are mainly developed for first-order optimizers, and they do not structurally prevent training instability at scale. Recent hypersphere optimization methods constrain weight matrices to a fixed-norm hypersphere, offering a promising alternative for more stable scaling. We introduce HyperP (Hypersphere Parameterization), the first framework for transferring optimal learning rates across model width, depth, training tokens, and Mixture-of-Experts (MoE) granularity under the Frobenius-sphere constraint with the Muon optimizer. We prove that weight decay is a first-order no-op on the Frobenius sphere, show that Depth- $μ$ P remains necessary, and find that the optimal learning rate follows the same data-scaling power law with the "magic exponent" 0.32…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/ArchScale
github

Datasets

jsun/Prolong_64K_v2_Llama2_Tokenizer
dataset· 56 dl
56 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.