Controlled LLM Training on Spectral Sphere
Tian Xie, Haoming Luo, Haoyu Tang, Yiwen Hu, Jason Klein Liu, Qingnan Ren, Yang Wang, Wayne Xin Zhao, Rui Yan, Bing Su, Chong Luo, Baining Guo

TL;DR
This paper introduces the Spectral Sphere Optimizer (SSO), a new optimization method for large language models that enforces spectral constraints to improve stability, convergence, and performance during training.
Contribution
The paper presents SSO, a spectral sphere-based optimizer fully aligned with Maximal Update Parametrization, enabling stable large-scale training of diverse models.
Findings
SSO outperforms AdamW and Muon in large-scale pretraining.
SSO improves model stability and activation bounds.
Enhanced load balancing and reduced outliers in MoE models.
Abstract
Scaling large models requires optimization strategies that ensure rapid convergence grounded in stability. Maximal Update Parametrization (P) provides a theoretical safeguard for width-invariant activation control, whereas emerging optimizers like Muon are only ``half-aligned'' with these constraints: they control updates but allow weights to drift. To address this limitation, we introduce the \textbf{Spectral Sphere Optimizer (SSO)}, which enforces strict module-wise spectral constraints on both weights and their updates. By deriving the steepest descent direction on the spectral sphere, SSO realizes a fully P-aligned optimization process. To enable large-scale training, we implement SSO as an efficient parallel algorithm within Megatron. Through extensive pretraining on diverse architectures, including Dense 1.7B, MoE 8B-A1B, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMuon and positron interactions and applications · Computational Physics and Python Applications · Machine Learning in Materials Science
