On the Crucial Role of Initialization for Matrix Factorization

Bingcong Li; Liang Zhang; Aryan Mokhtari; Niao He

arXiv:2410.18965·cs.LG·December 16, 2024

On the Crucial Role of Initialization for Matrix Factorization

Bingcong Li, Liang Zhang, Aryan Mokhtari, Niao He

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper highlights the importance of initialization in matrix factorization, introducing Nystrom initialization to enhance convergence and applying it to improve low-rank adapters for large models.

Contribution

It introduces Nystrom initialization for matrix factorization, achieving quadratic convergence, and extends this method to improve low-rank adapters like LoRA for large-scale models.

Findings

01

Nystrom initialization significantly improves convergence rates.

02

NoRA outperforms standard LoRA across multiple tasks.

03

Quadratic convergence achieved with Nystrom initialization.

Abstract

This work revisits the classical low-rank matrix factorization problem and unveils the critical role of initialization in shaping convergence rates for such nonconvex and nonsmooth optimization. We introduce Nystrom initialization, which significantly improves the global convergence of Scaled Gradient Descent (ScaledGD) in both symmetric and asymmetric matrix factorization tasks. Specifically, we prove that ScaledGD with Nystrom initialization achieves quadratic convergence in cases where only linear rates were previously known. Furthermore, we extend this initialization to low-rank adapters (LoRA) commonly used for finetuning foundation models. Our approach, NoRA, i.e., LoRA with Nystrom initialization, demonstrates superior performance across various downstream tasks and model scales, from 1B to 7B parameters, in large language and diffusion models.

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 3

Strengths

On one hand, I feel that the quadratic convergence with the Nyström initialization is significant. I have not directly worked in the area of matrix completion, so I do not have full confidence in this judgment, but this type of quadratic convergence result seems quite important. At the same time, however, I'm not quite sure if the quadratic convergence is coming from the scaled GD algorithm or from the Nyström initialization. Since the scaled GD algorithm is basically a Newton update, shouldn

Weaknesses

On the other hand, I feel that the matrix completion result and the Nyström-initialized LoRA (NoRA) are strongly connected. The theory doesn't apply to the NoRA setting, but I do see that the Nyström initialization is a nice conceptual motivation for a potentially better LoRA initialization. However, now with the flood of empirical LoRA papers, I think the community has a very high bar for recognizing work that purports to improve LoRA in a practical sense. I am not sure if the experimental v

Reviewer 02Rating 5Confidence 5

Strengths

Good coverage on matrix factorization problems (symmetric/asymmetric, exact/over/under-parametrization); Writings are clear; Some connection to practical problems.

Weaknesses

The reviewer has concerns regarding the significance and practical relevance of the theoretical results and the way results are connected to LoRA. Specifically: 1. The matrix factorization problem is not a practical problem to be solved, but rather a problem through which one builds a theoretical understanding of GD and its variants in a nonconvex setting. In this regard, any procedure using the target matrix itself, in the reviewer's opinion, is not allowed: if we know the target matrix, we ca

Reviewer 03Rating 8Confidence 4

Strengths

The current work studies how initialization affects ScaledGD method in matrix factorization problems, and considers its extension to LoRA model. I personally find the topic valuable. The theoretic bound improvement seems huge for classic matrix factorization problem. For LoRA model, recent work has studied customized optimizers and stepsize scheduler, NoRA proposes a new initialization scheme which has its theoretic grounds. Moreover, compared to PiSSA, NoRA is naturally zero-initialized and is

Weaknesses

1. The convergence bound to under-parameterized case involves weak optimality. I'm not very familiar with this notion, the authors show all globally optimal solutions are weakly optimal, is there any intuition of this weak optimality or is it just curated from technical stuff? How larger is the space of weak optimality compared to true optimality? 2. Ideally to apply Nystrom initialization to LoRA, one should learn $\Delta W$, which suggests that one should know how much weight change is in ord

Videos

On the Crucial Role of Initialization for Matrix Factorization· slideslive

Taxonomy

TopicsMatrix Theory and Algorithms

MethodsDiffusion