Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning

Thibaut Boissin (IRIT-MISFIT); Thomas Massena (DTIPG - SNCF; IRIT-MISFIT); Franck Mamalet; Mathieu Serrurier (IRIT-MISFIT)

arXiv:2512.04632·cs.AI·December 5, 2025

Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning

Thibaut Boissin (IRIT-MISFIT), Thomas Massena (DTIPG - SNCF, IRIT-MISFIT), Franck Mamalet, Mathieu Serrurier (IRIT-MISFIT)

PDF

Open Access

TL;DR

This paper introduces a preconditioning method that accelerates orthogonality-based optimizers like Muon, significantly reducing computational costs and improving training efficiency without sacrificing model performance.

Contribution

We propose a novel preconditioning technique that speeds up Newton-Schulz convergence, enabling faster orthogonality-based optimization with minimal overhead and no hyperparameter tuning.

Findings

01

Achieves up to 2.8x speedup in Newton-Schulz approximation

02

Improves end-to-end training runtime by 5-10% in realistic scenarios

03

Maintains or improves model performance on complex tasks

Abstract

Orthogonality-based optimizers, such as Muon, have recently shown strong performance across large-scale training and community-driven efficiency challenges. However, these methods rely on a costly gradient orthogonalization step. Even efficient iterative approximations such as Newton-Schulz remain expensive, typically requiring dozens of matrix multiplications to converge. We introduce a preconditioning procedure that accelerates Newton-Schulz convergence and reduces its computational cost. We evaluate its impact and show that the overhead of our preconditioning can be made negligible. Furthermore, the faster convergence it enables allows us to remove one iteration out of the usual five without degrading approximation quality. Our publicly available implementation achieves up to a 2.8x speedup in the Newton-Schulz approximation. We also show that this has a direct impact on end-to-end…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and Data Classification · Machine Learning and Algorithms