Orth-Dion: Eliminating Geometric Mismatch in Distributed Low-Rank Spectral Optimization

Tatsuhiro Nakamori; Laura Gomezjurado Gonzalez; Ganesh Talluri; Ansh Tiwari; Hideyuki Kawashima; Ioannis Mitliagkas; Guillaume Rabusseau; Hiroki Naganuma

arXiv:2605.16341·cs.LG·May 19, 2026

Orth-Dion: Eliminating Geometric Mismatch in Distributed Low-Rank Spectral Optimization

Tatsuhiro Nakamori, Laura Gomezjurado Gonzalez, Ganesh Talluri, Ansh Tiwari, Hideyuki Kawashima, Ioannis Mitliagkas, Guillaume Rabusseau, Hiroki Naganuma

PDF

TL;DR

Orth-Dion improves distributed low-rank spectral optimization by replacing column normalization with QR orthogonalization, closing the convergence gap to spectral methods while maintaining low communication costs.

Contribution

It introduces Orth-Dion, a novel method that corrects geometric mismatch in Dion, achieving spectral-level convergence rates with efficient communication.

Findings

01

Orth-Dion matches the convergence rate of exact spectral methods.

02

Experiments confirm Orth-Dion's effectiveness on large-scale language models.

03

Orth-Dion reduces the convergence gap while maintaining low communication overhead.

Abstract

Low-rank gradient compression reduces communication in distributed training by representing updates with rank- $r$ factors. Dion is a recent method that approximates Muon, a spectral optimizer that orthogonalizes momentum, using one step of power iteration followed by column normalization (rescaling each column of the right factor to unit length). This makes it compatible with fully sharded data parallel training, but it converges more slowly than full-rank spectral methods. We show that this gap is geometric: column normalization does not yield the rank- $r$ polar factor that Muon implicitly targets, so the resulting direction violates the dual-norm constraint of the low-rank spectral geometry, and the rate picks up an extra factor of $r$ even though the low-rank approximation of the gradient itself is accurate. The same mismatch enters the smoothness term and the error-feedback…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.