Global Convergence of Four-Layer Matrix Factorization under Random Initialization

Minrui Luo; Weihang Xu; Xiang Gao; Maryam Fazel; Simon Shaolei Du

arXiv:2511.09925·math.OC·November 20, 2025

Global Convergence of Four-Layer Matrix Factorization under Random Initialization

Minrui Luo, Weihang Xu, Xiang Gao, Maryam Fazel, Simon Shaolei Du

PDF

Open Access 3 Reviews

TL;DR

This paper proves that gradient descent globally converges for four-layer matrix factorization with random initialization under certain conditions, advancing theoretical understanding of deep matrix factorization.

Contribution

It provides the first polynomial-time global convergence guarantee for deep matrix factorization with four layers under random initialization, using novel analytical techniques.

Findings

01

Gradient descent avoids saddle points in four-layer matrix factorization.

02

Convergence depends on conditions on the target matrix and regularization.

03

The analysis extends previous theories to deeper matrix factorizations.

Abstract

Gradient descent dynamics on the deep matrix factorization problem is extensively studied as a simplified theoretical model for deep neural networks. Although the convergence theory for two-layer matrix factorization is well-established, no global convergence guarantee for general deep matrix factorization under random initialization has been established to date. To address this gap, we provide a polynomial-time global convergence guarantee for randomly initialized gradient descent on four-layer matrix factorization, given certain conditions on the target matrix and a standard balanced regularization term. Our analysis employs new techniques to show saddle-avoidance properties of gradient decent dynamics, and extends previous theories to characterize the change in eigenvalues of layer weights.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 3

Strengths

- The paper is well-organized and, although technically challenging, is written in a way that makes the content as accessible as possible to readers. The problem settings and notations are clearly presented, which helps readers follow the subsequent sections. In addition, Section 4 (Gradient Flow under Balanced Gaussian Initialization) effectively serves as a warm-up, while Section 5 (Gradient Descent under Unbalanced Gaussian Initialization) presents the main results and provides a well-structu

Weaknesses

- The paper lacks a clear explanation of the convergence rate derived in Theorems 1 and 2. A more detailed discussion of this rate, including how tight the bound is, would strengthen the analysis. It would also be helpful to compare the convergence rate with that of the depth-2 case (e.g., [1]). In addition, although the authors note that it is difficult to analyze odd factorizations theoretically, it would still be valuable to include an empirical comparison of the convergence behavior across d

Reviewer 02Rating 6Confidence 4

Strengths

1. Strong theoretical novelty: This work establishes the first polynomial-time global convergence guarantee for gradient descent on a deep matrix factorization problem (N>2) under random initialization, specifically targeting the four-layer (N=4) architecture. This result addresses a longstanding open question regarding general deep matrix factorization and Deep Linear Networks. The paper is technically impressive and represents a meaningful step forward in understanding deep gradient dynamics.

Weaknesses

1. Limited scope and overstated generality: The analysis applies specifically to the four-layer case with identical singular values of the target matrix. The extension to general depth or non-isotropic targets is only conjectural. The paper’s framing (“global convergence for deep networks”) slightly overstates the reach of the results. 2. Dense presentation and navigational difficulty: Despite occasional intuitive remarks, the exposition remains heavy, with long theorem sequences spanning most

Reviewer 03Rating 2Confidence 3

Strengths

+ Proving global convergence of GD in the general $L$-layer case is an interesting problem

Weaknesses

- Quality of the writing makes it really hard to parse statements - Quantities are used before they are defined - Relevant literature is not taken into account - 4-layer case with balancedness regularization is a very special setting and main theorem only applies to $\Sigma$ with flat singular spectrum, i.e., all singular values are identical

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Tensor decomposition and applications