Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning
Thibaut Boissin (IRIT-MISFIT), Thomas Massena (DTIPG - SNCF, IRIT-MISFIT), Franck Mamalet, Mathieu Serrurier (IRIT-MISFIT)

TL;DR
This paper introduces a preconditioning method that accelerates orthogonality-based optimizers like Muon, significantly reducing computational costs and improving training efficiency without sacrificing model performance.
Contribution
We propose a novel preconditioning technique that speeds up Newton-Schulz convergence, enabling faster orthogonality-based optimization with minimal overhead and no hyperparameter tuning.
Findings
Achieves up to 2.8x speedup in Newton-Schulz approximation
Improves end-to-end training runtime by 5-10% in realistic scenarios
Maintains or improves model performance on complex tasks
Abstract
Orthogonality-based optimizers, such as Muon, have recently shown strong performance across large-scale training and community-driven efficiency challenges. However, these methods rely on a costly gradient orthogonalization step. Even efficient iterative approximations such as Newton-Schulz remain expensive, typically requiring dozens of matrix multiplications to converge. We introduce a preconditioning procedure that accelerates Newton-Schulz convergence and reduces its computational cost. We evaluate its impact and show that the overhead of our preconditioning can be made negligible. Furthermore, the faster convergence it enables allows us to remove one iteration out of the usual five without degrading approximation quality. Our publicly available implementation achieves up to a 2.8x speedup in the Newton-Schulz approximation. We also show that this has a direct impact on end-to-end…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and Data Classification · Machine Learning and Algorithms
