Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials

Ekaterina Grishina; Matvey Smirnov; Maxim Rakhuba

arXiv:2506.10935·math.NA·February 25, 2026

Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials

Ekaterina Grishina, Matvey Smirnov, Maxim Rakhuba

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces Chebyshev-optimized Newton-Schulz iteration (CANS) to improve matrix orthogonalization efficiency, optimizing coefficients for better performance in machine learning tasks like Riemannian optimization and the Muon optimizer.

Contribution

It develops a Chebyshev-based approach to optimize Newton-Schulz iteration coefficients, enhancing orthogonalization methods for deep learning applications.

Findings

01

Theoretical derivation of optimal coefficients for 3rd order Newton-Schulz iteration.

02

Application of Remez algorithm for higher-degree polynomial optimization.

03

Demonstrated improved efficiency in orthogonalization tasks in deep learning.

Abstract

The problem of computing optimal orthogonal approximation to a given matrix has attracted growing interest in machine learning. Notable applications include the recent Muon optimizer or Riemannian optimization on the Stiefel manifold. Among existing approaches, the Newton-Schulz iteration has emerged as a particularly effective solution, as it relies solely on matrix multiplications and thus achieves high computational efficiency on GPU hardware. Despite its efficiency, the method has inherent limitations - its coefficients are fixed and thus not optimized for a given matrix. In this paper we address this issue by proposing a Chebyshev-optimized version of Newton-Schulz (CANS). Based on the Chebyshev's alternance theorem, we theoretically derive optimal coefficients for the 3-rd order Newton-Schulz iteration and apply a Remez algorithm to compute optimal higher-degree polynomials. We…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

Through rigorous theory, optimal polynomials are found for this task (sections 3 and 4) A trick is proposed for using Gelfand's formula with almost no computational overhead in order to get accurate upper bounds on the spectral norm.

Weaknesses

The theory seems to me to be more like a corollary of prior studies, but this does not necessarily undermine the value of this approach in this context. Polynomials better fit to the task See the questions/suggestions section for more. typo in appendix K, it references Figure 1 instead of Figure 2

Reviewer 02Rating 4Confidence 3

Strengths

1. The work presents a novel theoretical framework for optimizing Newton-Schulz iteration coefficients. While polynomial approximation theory is classical, its systematic application to this problem through Chebyshev's alternance theorem is creative. 2. The theoretical contributions are rigorous with complete proofs in the appendices. Proposition 2 provides closed-form solutions for degree-3 polynomials, and the convergence analysis establishing quadratic convergence is solid. The experimental

Weaknesses

1. The paper cites concurrent work by Amsel et al. (2025). A slightly more detailed comparison in the related work section could help readers more clearly understand the overlapping and distinct contributions of the two papers regarding the exact case.

Reviewer 03Rating 6Confidence 4

Strengths

1.This paper systematically applies Chebyshev approximation theory to optimize the coefficients of the Newton-Schulz iteration, proposing the Chebyshev-Accelerated Newton-Schulz (CANS) framework for finding "provably optimal" odd polynomials. 2.The paper is built upon a solid mathematical theory, with detailed and rigorous proofs for each proposition and corollary provided in the appendix. This offers robust mathematical support for the uniqueness and key properties of the optimal odd polynomia

Weaknesses

1.The typesetting for the proof of Proposition 1 in the appendix is slightly disorganized and could be improved. Additionally, the paper contains some errors; for instance, in the equation on line 50, the final exponent appears to be a transpose symbol 'T' when it should likely be 't'. The paper also mistakenly includes two distinct algorithms both labeled as "Algorithm 1". 2.The experiments on the Stiefel manifold are conducted solely with a Wide ResNet-16-10 on the CIFAR-10 dataset. The evalu

Code & Models

Repositories

grishkate/accelerating_orthogonalization
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Tensor decomposition and applications · Machine Learning in Materials Science