Demystifying Manifold Constraints in LLM Pre-training
Kang An, Jiaxiang Li, Donald Goldfarb, Shiqian Ma

TL;DR
This paper clarifies how explicit manifold constraints in LLM pre-training stabilize training and improve performance, using a novel Riemannian optimizer called MACRO.
Contribution
It introduces MACRO, a provably convergent Riemannian optimizer that disentangles manifold constraints from heuristic normalization techniques in LLM training.
Findings
Manifold constraints independently stabilize activation scales.
MACRO achieves competitive performance with theoretical guarantees.
Constraints enforce stable rotational equilibrium in weights.
Abstract
The empirical success of large language model (LLM) pre-training relies heavily on heuristic stabilization techniques, such as explicit normalization layers and weight decay. While recent constrained optimization approaches that explicitly restrict weights may improve numerical stability and performance, the mechanism and motivation for adding constraints still remain elusive. This paper systematically demystifies the role of explicit manifold constraints in LLM pre-training. By introducing the Msign-Aligned Constrained Riemannian Optimizer (MACRO)-a provably convergent, single-loop optimization framework-our study disentangles weight regularization heuristics from interacting mechanisms like RMS normalization and decoupled weight decay. Theoretical analyses and comprehensive empirical evaluations reveal that manifold constraints independently bound forward activation scales and enforce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
