Distance-Aware Muon: Adaptive Step Scaling for Normalized Optimization

Yury Demidovich; Abhishek Chakraborty; Grigory Malinovsky; Angelia Nedi\'c; Peter Richt\'arik

arXiv:2605.18999·cs.LG·May 20, 2026

Distance-Aware Muon: Adaptive Step Scaling for Normalized Optimization

Yury Demidovich, Abhishek Chakraborty, Grigory Malinovsky, Angelia Nedi\'c, Peter Richt\'arik

PDF

TL;DR

This paper introduces three adaptive step scaling algorithms for Muon optimizers, improving their robustness and performance across various tasks and geometries, with theoretical guarantees and empirical validation.

Contribution

The paper develops novel adaptive scaling rules for Muon optimizers, including Distance-Adaptive Muon, Scale-Calibrated Muon, and Distance-Free Muon, with theoretical analysis and practical benefits.

Findings

01

Distance-Adaptive Muon guarantees stationarity under bounded trajectories.

02

Scale-Calibrated Muon achieves O(1/T) objective-gap bounds in star-convex settings.

03

Experiments show adaptive rules reduce tuning sensitivity and match or outperform fixed-scale baselines.

Abstract

Muon and related normalized optimizers decouple the choice of update direction from the choice of step scale, but their practical performance remains sensitive to the scale of the normalized step. We study adaptive scaling rules for Muon in general norm geometries and develop three complementary algorithms. For smooth non-convex objectives, we introduce Distance-Adaptive Muon, whose trust-region radius is set from the radius explored by the trajectory, and prove a stationarity guarantee under a bounded-trajectory assumption. We then turn to star-convex objectives, a tractable model of the favorable global geometry often used to reason about the empirical loss landscapes of deep neural networks, where objective-gap guarantees are possible. In this setting, we first introduce Scale-Calibrated Muon, which keeps Muon's exponential moving average but sets the step length from a local descent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.