OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling
Yuxuan Lou, Yang You

TL;DR
OrScale enhances neural network training by orthogonalizing updates with layer-wise trust ratios, improving convergence guarantees and empirical performance on image and language models.
Contribution
Introduces OrScale, a trust-ratio extension of Muon that stabilizes layer updates and provides theoretical guarantees and empirical improvements.
Findings
OrScale achieves state-of-the-art results on CIFAR-10 with CIFAR-10/DavidNet.
OrScale-LM outperforms Muon+Moonlight and AdamW on language model pre-training.
Theoretically, OrScale guarantees nonconvex convergence with layer-adaptive descent.
Abstract
Muon improves neural-network training by orthogonalizing matrix-valued updates, but it leaves each layer's update magnitude controlled mostly by a global learning rate. We introduce OrScale, a trust-ratio extension of Muon built on a simple rule: the denominator of a layer-wise ratio should measure the Frobenius norm of the actual parameter-space direction that will be applied. This yields OrScale for general matrix layers and OrScale-LM for language models, where Moonlight shape scaling is combined with one-time per-layer calibration so every trust ratio starts at one. We analyze why three natural Muon-LAMB hybrids fail through shape-degenerate denominators, raw-momentum clip saturation, and decoupled weight-decay runaway, and show that the real-update-direction denominator with coupled weight decay avoids these failures. Theoretically, OrScale admits an O(1/sqrt(T)) nonconvex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
