Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rankness

Baekrok Shin; Chulhee Yun

arXiv:2603.04703·cs.LG·March 6, 2026

Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rankness

Baekrok Shin, Chulhee Yun

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how increasing depth in deep matrix factorization models enhances low-rank bias and coupling dynamics, explaining phenomena like plasticity loss and providing theoretical insights into convergence behaviors.

Contribution

It reveals that network depth promotes low-rank bias through coupled dynamics and resolves open questions about convergence in deep linear networks.

Findings

01

Depth ≥ 3 networks exhibit coupling unless initialized diagonally.

02

Coupled dynamics lead to convergence to rank-1 solutions.

03

Deep models avoid plasticity loss due to their low-rank bias.

Abstract

We study matrix completion via deep matrix factorization (a.k.a. deep linear neural networks) as a simplified testbed to examine how network depth influences training dynamics. Despite the simplicity and importance of the problem, prior theory largely focuses on shallow (depth-2) models and does not fully explain the implicit low-rank bias observed in deeper networks. We identify coupled dynamics as a key mechanism behind this bias and show that it intensifies with increasing depth. Focusing on gradient flow under block-diagonal observations, we prove: (a) networks of depth $\geq 3$ exhibit coupling unless initialized diagonally, and (b) convergence to rank-1 occurs if and only if the dynamics is coupled -- resolving an open question by Menon (2024) for a family of initializations. We also revisit the loss of plasticity phenomenon in matrix completion (Kleinman et al., 2024), where…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

- This work addresses the important and thorny question of implicit bias in deep learning - Generalizes the independent observations mechanism beyond the 2 layer case - Provides an potentially useful perspective on the plasticity phenomenon

Weaknesses

Maybe due to the challenging nature of the questions studied, the connection between the highly restricted theoretical setting and the claims about the behavior of the real algorithm (deep learning or even just deep matrix completion) is somewhat tenuous at times. I am especially doubtful about the additional insight of the implicit characterization (8), (9). If it is possible to derive additional insight from it rather than just solving it numerically, this would significantly strengthen the pa

Reviewer 02Rating 6Confidence 3

Strengths

1. A clear, mechanism-first account that unifies several scattered observations about low-rank bias and depth. 2. Nontrivial analytical traction via diagonal observations and limiting SVD characterizations, with numerics that mirror the theory. 3. Bridges the mechanism to LoP with formal statements (stable-rank lower bounds) rather than just empirical anecdotes.

Weaknesses

1. Scope of formalism: The most rigorous theorems rely on diagonal or highly structured observation patterns and gradient-flow analysis. This leaves a substantial gap to practical regimes, for example finite-step SGD with noise/momentum and unstructured sparsity. Without non-asymptotic bounds in these regimes, it’s unclear how predictive the theory is for typical training runs. 2. Initialization dependence: Several results hinge on a specific initialization family that tunes coupling. While the

Reviewer 03Rating 4Confidence 5

Strengths

- This paper is well-written and has a clear overall structure, making it relatively easy to read. - The paper addresses a key issue in matrix completion, specifically the role of network depth in shaping training dynamics and the implicit bias toward low-rank solutions. This focus provides valuable insights into how deeper networks perform better in this context. - The paper builds on and extends the work of Bai et al. (2024), with a emphasis on the coupling dynamics and their relation to low-r

Weaknesses

- While the paper’s theoretical results are solid, some claims feel weak or underexplored. For instance, Theorem 3.1 is focused on a very specific case (2x2 matrices), and while it does provide some insights, it lacks generality. Additionally, Theorem 3.3 is reduced to an implicit equation that is not deeply analyzed. - Although numerical experiments (e.g., Figure 2) are used to validate the low-rank results, the theoretical argumentation feels limited. The paper would benefit from a more detail

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Model Reduction and Neural Networks · Tensor decomposition and applications