Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials
Ekaterina Grishina, Matvey Smirnov, Maxim Rakhuba

TL;DR
This paper introduces Chebyshev-optimized Newton-Schulz iteration (CANS) to improve matrix orthogonalization efficiency, optimizing coefficients for better performance in machine learning tasks like Riemannian optimization and the Muon optimizer.
Contribution
It develops a Chebyshev-based approach to optimize Newton-Schulz iteration coefficients, enhancing orthogonalization methods for deep learning applications.
Findings
Theoretical derivation of optimal coefficients for 3rd order Newton-Schulz iteration.
Application of Remez algorithm for higher-degree polynomial optimization.
Demonstrated improved efficiency in orthogonalization tasks in deep learning.
Abstract
The problem of computing optimal orthogonal approximation to a given matrix has attracted growing interest in machine learning. Notable applications include the recent Muon optimizer or Riemannian optimization on the Stiefel manifold. Among existing approaches, the Newton-Schulz iteration has emerged as a particularly effective solution, as it relies solely on matrix multiplications and thus achieves high computational efficiency on GPU hardware. Despite its efficiency, the method has inherent limitations - its coefficients are fixed and thus not optimized for a given matrix. In this paper we address this issue by proposing a Chebyshev-optimized version of Newton-Schulz (CANS). Based on the Chebyshev's alternance theorem, we theoretically derive optimal coefficients for the 3-rd order Newton-Schulz iteration and apply a Remez algorithm to compute optimal higher-degree polynomials. We…
Peer Reviews
Decision·Submitted to ICLR 2026
Through rigorous theory, optimal polynomials are found for this task (sections 3 and 4) A trick is proposed for using Gelfand's formula with almost no computational overhead in order to get accurate upper bounds on the spectral norm.
The theory seems to me to be more like a corollary of prior studies, but this does not necessarily undermine the value of this approach in this context. Polynomials better fit to the task See the questions/suggestions section for more. typo in appendix K, it references Figure 1 instead of Figure 2
1. The work presents a novel theoretical framework for optimizing Newton-Schulz iteration coefficients. While polynomial approximation theory is classical, its systematic application to this problem through Chebyshev's alternance theorem is creative. 2. The theoretical contributions are rigorous with complete proofs in the appendices. Proposition 2 provides closed-form solutions for degree-3 polynomials, and the convergence analysis establishing quadratic convergence is solid. The experimental
1. The paper cites concurrent work by Amsel et al. (2025). A slightly more detailed comparison in the related work section could help readers more clearly understand the overlapping and distinct contributions of the two papers regarding the exact case.
1.This paper systematically applies Chebyshev approximation theory to optimize the coefficients of the Newton-Schulz iteration, proposing the Chebyshev-Accelerated Newton-Schulz (CANS) framework for finding "provably optimal" odd polynomials. 2.The paper is built upon a solid mathematical theory, with detailed and rigorous proofs for each proposition and corollary provided in the appendix. This offers robust mathematical support for the uniqueness and key properties of the optimal odd polynomia
1.The typesetting for the proof of Proposition 1 in the appendix is slightly disorganized and could be improved. Additionally, the paper contains some errors; for instance, in the equation on line 50, the final exponent appears to be a transpose symbol 'T' when it should likely be 't'. The paper also mistakenly includes two distinct algorithms both labeled as "Algorithm 1". 2.The experiments on the Stiefel manifold are conducted solely with a Wide ResNet-16-10 on the CIFAR-10 dataset. The evalu
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Tensor decomposition and applications · Machine Learning in Materials Science
