On the Crucial Role of Initialization for Matrix Factorization
Bingcong Li, Liang Zhang, Aryan Mokhtari, Niao He

TL;DR
This paper highlights the importance of initialization in matrix factorization, introducing Nystrom initialization to enhance convergence and applying it to improve low-rank adapters for large models.
Contribution
It introduces Nystrom initialization for matrix factorization, achieving quadratic convergence, and extends this method to improve low-rank adapters like LoRA for large-scale models.
Findings
Nystrom initialization significantly improves convergence rates.
NoRA outperforms standard LoRA across multiple tasks.
Quadratic convergence achieved with Nystrom initialization.
Abstract
This work revisits the classical low-rank matrix factorization problem and unveils the critical role of initialization in shaping convergence rates for such nonconvex and nonsmooth optimization. We introduce Nystrom initialization, which significantly improves the global convergence of Scaled Gradient Descent (ScaledGD) in both symmetric and asymmetric matrix factorization tasks. Specifically, we prove that ScaledGD with Nystrom initialization achieves quadratic convergence in cases where only linear rates were previously known. Furthermore, we extend this initialization to low-rank adapters (LoRA) commonly used for finetuning foundation models. Our approach, NoRA, i.e., LoRA with Nystrom initialization, demonstrates superior performance across various downstream tasks and model scales, from 1B to 7B parameters, in large language and diffusion models.
Peer Reviews
Decision·ICLR 2025 Poster
On one hand, I feel that the quadratic convergence with the Nyström initialization is significant. I have not directly worked in the area of matrix completion, so I do not have full confidence in this judgment, but this type of quadratic convergence result seems quite important. At the same time, however, I'm not quite sure if the quadratic convergence is coming from the scaled GD algorithm or from the Nyström initialization. Since the scaled GD algorithm is basically a Newton update, shouldn
On the other hand, I feel that the matrix completion result and the Nyström-initialized LoRA (NoRA) are strongly connected. The theory doesn't apply to the NoRA setting, but I do see that the Nyström initialization is a nice conceptual motivation for a potentially better LoRA initialization. However, now with the flood of empirical LoRA papers, I think the community has a very high bar for recognizing work that purports to improve LoRA in a practical sense. I am not sure if the experimental v
Good coverage on matrix factorization problems (symmetric/asymmetric, exact/over/under-parametrization); Writings are clear; Some connection to practical problems.
The reviewer has concerns regarding the significance and practical relevance of the theoretical results and the way results are connected to LoRA. Specifically: 1. The matrix factorization problem is not a practical problem to be solved, but rather a problem through which one builds a theoretical understanding of GD and its variants in a nonconvex setting. In this regard, any procedure using the target matrix itself, in the reviewer's opinion, is not allowed: if we know the target matrix, we ca
The current work studies how initialization affects ScaledGD method in matrix factorization problems, and considers its extension to LoRA model. I personally find the topic valuable. The theoretic bound improvement seems huge for classic matrix factorization problem. For LoRA model, recent work has studied customized optimizers and stepsize scheduler, NoRA proposes a new initialization scheme which has its theoretic grounds. Moreover, compared to PiSSA, NoRA is naturally zero-initialized and is
1. The convergence bound to under-parameterized case involves weak optimality. I'm not very familiar with this notion, the authors show all globally optimal solutions are weakly optimal, is there any intuition of this weak optimality or is it just curated from technical stuff? How larger is the space of weak optimality compared to true optimality? 2. Ideally to apply Nystrom initialization to LoRA, one should learn $\Delta W$, which suggests that one should know how much weight change is in ord
Videos
Taxonomy
TopicsMatrix Theory and Algorithms
MethodsDiffusion
