Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer
Jinghui Yuan, Jiaxuan Zou, Shuo Wang, Yong Liu, Feiping Nie

TL;DR
Nora is a scalable, efficient optimizer for training large language models that unifies stability, speed, and preconditioning by orthogonal row alignment and norm stabilization.
Contribution
Nora introduces a novel optimizer that combines stability, efficiency, and structured preconditioning with a simple implementation and theoretical scalability guarantees.
Findings
Nora achieves stability by stabilizing weight norms and angular velocities.
Nora approximates structured preconditioning with linear computational complexity.
Preliminary experiments show Nora is effective for large-scale training.
Abstract
Matrix-based optimizers have demonstrated immense potential in training Large Language Models (LLMs), however, designing an ideal optimizer remains a formidable challenge. A superior optimizer must satisfy three core desiderata: efficiency, achieving Muon-like preconditioning to accelerate optimization; stability, strictly adhering to the scale-invariance inherent in neural networks; and speed, minimizing computational overhead. While existing methods address these aspects to varying degrees, they often fail to unify them, either incurring prohibitive computational costs like Muon, or allowing radial jitters that compromise stability like RMNP. To bridge this gap, we propose Nora, an optimizer that rigorously satisfies all three requirements. Nora achieves training stability by explicitly stabilizing weight norms and angular velocities through row-wise momentum projection onto the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
