OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

Yuxuan Lou; Yang You

arXiv:2605.07815·cs.LG·May 11, 2026

OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

Yuxuan Lou, Yang You

PDF

TL;DR

OrScale enhances neural network training by orthogonalizing updates with layer-wise trust ratios, improving convergence guarantees and empirical performance on image and language models.

Contribution

Introduces OrScale, a trust-ratio extension of Muon that stabilizes layer updates and provides theoretical guarantees and empirical improvements.

Findings

01

OrScale achieves state-of-the-art results on CIFAR-10 with CIFAR-10/DavidNet.

02

OrScale-LM outperforms Muon+Moonlight and AdamW on language model pre-training.

03

Theoretically, OrScale guarantees nonconvex convergence with layer-adaptive descent.

Abstract

Muon improves neural-network training by orthogonalizing matrix-valued updates, but it leaves each layer's update magnitude controlled mostly by a global learning rate. We introduce OrScale, a trust-ratio extension of Muon built on a simple rule: the denominator of a layer-wise ratio should measure the Frobenius norm of the actual parameter-space direction that will be applied. This yields OrScale for general matrix layers and OrScale-LM for language models, where Moonlight shape scaling is combined with one-time per-layer calibration so every trust ratio starts at one. We analyze why three natural Muon-LAMB hybrids fail through shape-degenerate denominators, raw-momentum clip saturation, and decoupled weight-decay runaway, and show that the real-update-direction denominator with coupled weight decay avoids these failures. Theoretically, OrScale admits an O(1/sqrt(T)) nonconvex…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.