Convergence of Alternating Gradient Descent for Matrix Factorization

Rachel Ward; Tamara G. Kolda

arXiv:2305.06927·cs.LG·February 9, 2024·1 cites

Convergence of Alternating Gradient Descent for Matrix Factorization

Rachel Ward, Tamara G. Kolda

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper proves that alternating gradient descent converges efficiently to a low-rank matrix factorization for asymmetric matrices, with a simple proof technique and practical initialization that improves convergence.

Contribution

It provides a convergence guarantee for AGD on asymmetric matrix factorization with a novel, simple proof and effective random initialization.

Findings

01

AGD reaches $rac{1}{ ext{poly}(n)}$-accuracy in $O((rac{ ext{cond}( extbf{A})}{ ext{approximation}})^2 ext{polylog}(n))$ iterations.

02

A simple uniform PL inequality and Lipschitz smoothness are established for the iterates.

03

Experimental results show the proposed initialization significantly accelerates convergence.

Abstract

We consider alternating gradient descent (AGD) with fixed step size applied to the asymmetric matrix factorization objective. We show that, for a rank- $r$ matrix $A \in R^{m \times n}$ , $T = C (\frac{σ _{1} ( A )}{σ _{r} ( A )})^{2} lo g (1/ ϵ)$ iterations of alternating gradient descent suffice to reach an $ϵ$ -optimal factorization $∥ A - X Y^{T} ∥^{2} \leq ϵ ∥ A ∥^{2}$ with high probability starting from an atypical random initialization. The factors have rank $d \geq r$ so that $X_{T} \in R^{m \times d}$ and $Y_{T} \in R^{n \times d}$ , and mild overparameterization suffices for the constant $C$ in the iteration complexity $T$ to be an absolute constant. Experiments suggest that our proposed initialization is not merely of theoretical benefit, but rather significantly…

Peer Reviews

Decision·NeurIPS 2023 spotlight

Reviewer 01Rating 8· Strong Accept: Technically strong paper, with novel ideas, excellent impact on at least one area, or high-to-excellent impact on multiple areas, with excellent evaluation, resources, and reproducibility, and no unaddressed ethical considerations.Confidence 4

Strengths

The (asymmetric) matrix factorization problem is both an important problem in itself, and an important test bed for developing techniques to analyze non-convex optimization. I think this paper takes a significant step towards understanding (alternating) gradient descent in this setting. It improves on the previous best known bound for GD which was ~ $\kappa^3.$ I find the improvement due to the asymmetric initialization quite illuminating, and I think it will inspire other advantageous (yet quic

Weaknesses

I don't see any major weaknesses.

Reviewer 02Rating 8· Strong Accept: Technically strong paper, with novel ideas, excellent impact on at least one area, or high-to-excellent impact on multiple areas, with excellent evaluation, resources, and reproducibility, and no unaddressed ethical considerations.Confidence 3

Strengths

- Excellent novelty - Theoretical proof is elegant - Promising numerical results

Weaknesses

It is worth noting that there are numerous competitors beyond first-order methods in the field to solve an exact rank-r matrix decomposition problem. To further enrich the discussion, it would be valuable if the authors explored potential avenues for generalizing their results. For example, investigating the application of their findings in matrix completion or the singular value decomposition (SVD) of a low-rank matrix with added noise could provide insights into the broader applicability of th

Reviewer 03Rating 8· Strong Accept: Technically strong paper, with novel ideas, excellent impact on at least one area, or high-to-excellent impact on multiple areas, with excellent evaluation, resources, and reproducibility, and no unaddressed ethical considerations.Confidence 4

Strengths

- The main technical contribution of this paper is the introduction and analysis of an asymmetric warm starting rule for matrix factorization. To the best of my knowledge this is an original technical contribution. - The theoretical analysis is interesting and advances the state of the art. Matrix factorization is a fundamental optimization task that among others, can be viewed as a subproblem of neural network training. As a result, progress in theoretical understanding of optimization algorith

Weaknesses

- It seems to me like the assumption in line 454 that $V^\top \Phi_1$ has i.i.d entries is incorrect, and it only has i.i.d columns. In this case, a different lemma should be used in place of Proposition A.1. Does this change anything in the bounds? - I have some comments about the experiments in Section 6. It seems like the authors compare different algorithms using the same step size, however in my opinion it would be fairer to compare using the best fixed step size for each algorithm. This i

Videos

Convergence of Alternating Gradient Descent for Matrix Factorization· slideslive

Taxonomy

TopicsFace and Expression Recognition · Matrix Theory and Algorithms