Global Optimality in Low-rank Matrix Optimization

Zhihui Zhu; Qiuwei Li; Gongguo Tang; Michael B. Wakin

arXiv:1702.07945·cs.IT·July 4, 2018

Global Optimality in Low-rank Matrix Optimization

Zhihui Zhu, Qiuwei Li, Gongguo Tang, Michael B. Wakin

PDF

TL;DR

This paper analyzes the geometric landscape of low-rank matrix optimization problems, showing that under certain conditions, the factored formulation has no spurious local minima and satisfies the strict saddle property, enabling global convergence of algorithms.

Contribution

It provides a geometric analysis of the factored low-rank matrix optimization problem for well-conditioned functions, establishing conditions for no spurious minima and strict saddle property.

Findings

01

The reformulated problem has no spurious local minima.

02

The objective function satisfies the strict saddle property.

03

Gradient-based algorithms can provably find global solutions.

Abstract

This paper considers the minimization of a general objective function $f (X)$ over the set of rectangular $n \times m$ matrices that have rank at most $r$ . To reduce the computational burden, we factorize the variable $X$ into a product of two smaller matrices and optimize over these two matrices instead of $X$ . Despite the resulting nonconvexity, recent studies in matrix completion and sensing have shown that the factored problem has no spurious local minima and obeys the so-called strict saddle property (the function has a directional negative curvature at all critical points but local minima). We analyze the global geometry for a general and yet well-conditioned objective function $f (X)$ whose restricted strong convexity and restricted strong smoothness constants are comparable. In particular, we show that the reformulated objective function has no spurious local minima and obeys the…

Equations256

X \in R^{n \times m} minimize f (X) subject to rank (X) \leq r,

X \in R^{n \times m} minimize f (X) subject to rank (X) \leq r,

U \in R^{n \times r}, V \in R^{m \times r} minimize h (U, V) := f (U V^{T}) .

U \in R^{n \times r}, V \in R^{m \times r} minimize h (U, V) := f (U V^{T}) .

α ∥ G ∥_{F}^{2} \leq [\nabla^{2} f (X)] (G, G) \leq β ∥ G ∥_{F}^{2}

α ∥ G ∥_{F}^{2} \leq [\nabla^{2} f (X)] (G, G) \leq β ∥ G ∥_{F}^{2}

f (X^{⋆}) \leq f (X), \forall X \in R^{n \times m}, rank (X) \leq r

f (X^{⋆}) \leq f (X), \forall X \in R^{n \times m}, rank (X) \leq r

\nabla f (X^{⋆}) = 0 .

\nabla f (X^{⋆}) = 0 .

f (X) =

f (X) =

+ \frac{1}{2} [\nabla^{2} f (X)] (X - X^{⋆}, X - X^{⋆}),

f (X) - f (X^{⋆})

f (X) - f (X^{⋆})

\geq \frac{α}{2} ∥ X - X^{⋆} ∥_{F}^{2} .

W = [U V], \leavevmode W = [U - V], \leavevmode X = U V^{T} .

W = [U V], \leavevmode W = [U - V], \leavevmode X = U V^{T} .

U^{⋆} = Q_{U^{⋆}} Σ^{⋆}^{1/2}, V^{⋆} = Q_{V^{⋆}} Σ^{⋆}^{1/2},

U^{⋆} = Q_{U^{⋆}} Σ^{⋆}^{1/2}, V^{⋆} = Q_{V^{⋆}} Σ^{⋆}^{1/2},

W^{⋆} = [U^{⋆} V^{⋆}], W^{⋆} = [U^{⋆} - V^{⋆}] .

W^{⋆} = [U^{⋆} V^{⋆}], W^{⋆} = [U^{⋆} - V^{⋆}] .

g (U, V) = \frac{μ}{4} U^{T} U - V^{T} V_{F}^{2}

g (U, V) = \frac{μ}{4} U^{T} U - V^{T} V_{F}^{2}

U \in R^{n \times r}, V \in R^{m \times r} minimize ρ (U, V) := f (U V^{T}) + g (U, V),

U \in R^{n \times r}, V \in R^{m \times r} minimize ρ (U, V) := f (U V^{T}) + g (U, V),

E := {W = [U V] : U^{T} U - V^{T} V = 0} .

E := {W = [U V] : U^{T} U - V^{T} V = 0} .

U^{T} U - V^{T} V = 0 .

U^{T} U - V^{T} V = 0 .

λ_{m i n} (\nabla^{2} (ρ (W))) \leq ⎩ ⎨ ⎧ - 0.08 α σ_{r} (X^{⋆}), - 0.05 α \cdot min {σ_{r^{c}}^{2} (W), 2 σ_{r^{⋆}} (X^{⋆})}, - 0.1 α σ_{r^{⋆}} (X^{⋆}), r = r^{⋆} r > r^{⋆} r_{c} = 0,

λ_{m i n} (\nabla^{2} (ρ (W))) \leq ⎩ ⎨ ⎧ - 0.08 α σ_{r} (X^{⋆}), - 0.05 α \cdot min {σ_{r^{c}}^{2} (W), 2 σ_{r^{⋆}} (X^{⋆})}, - 0.1 α σ_{r^{⋆}} (X^{⋆}), r = r^{⋆} r > r^{⋆} r_{c} = 0,

Ω = [1 + a 1 1 1 + a]

Ω = [1 + a 1 1 1 + a]

X^{⋆} = [1111], \leavevmode and \leavevmode U = [x y] .

X^{⋆} = [1111], \leavevmode and \leavevmode U = [x y] .

h (U) = \frac{1}{2} ∥ Ω ⊙ (U U^{T} - X^{⋆}) ∥_{F}^{2} = \frac{1 + a}{2} (x^{2} - 1)^{2} + \frac{1 + a}{2} (y^{2} - 1)^{2} + (x y - 1)^{2},

h (U) = \frac{1}{2} ∥ Ω ⊙ (U U^{T} - X^{⋆}) ∥_{F}^{2} = \frac{1 + a}{2} (x^{2} - 1)^{2} + \frac{1 + a}{2} (y^{2} - 1)^{2} + (x y - 1)^{2},

\nabla h (U) = 2 [(a + 1) (x^{2} - 1) x + y (x y - 1) (a + 1) (y^{2} - 1) y + x (x y - 1)],

\nabla h (U) = 2 [(a + 1) (x^{2} - 1) x + y (x y - 1) (a + 1) (y^{2} - 1) y + x (x y - 1)],

\nabla^{2} h (U) =

\nabla^{2} h (U) =

2 [y^{2} + (3 x^{2} - 1) (a + 1) 2 x y - 1 2 x y - 1 x^{2} + (3 y^{2} - 1) (a + 1)] .

U = [\frac{a}{a + 2} - \frac{a}{a + 2}]

U = [\frac{a}{a + 2} - \frac{a}{a + 2}]

\nabla^{2} h (U) = [4 a + \frac{8}{a + 2} - 6 \frac{8}{a + 2} - 6 \frac{8}{a + 2} - 6 4 a + \frac{8}{a + 2} - 6],

\nabla^{2} h (U) = [4 a + \frac{8}{a + 2} - 6 \frac{8}{a + 2} - 6 \frac{8}{a + 2} - 6 4 a + \frac{8}{a + 2} - 6],

λ_{1} = \frac{4 ( a - 2 ) ( a + 1 )}{a + 2} {< 0, > 0, a \in [0, 2), a > 2,

λ_{1} = \frac{4 ( a - 2 ) ( a + 1 )}{a + 2} {< 0, > 0, a \in [0, 2), a > 2,

f (X) = \frac{1}{2} ∥ A (X - X^{⋆}) ∥_{2}^{2} .

f (X) = \frac{1}{2} ∥ A (X - X^{⋆}) ∥_{2}^{2} .

(1 - δ_{r}) ∥ X ∥_{F}^{2} \leq ∥ A (X) ∥^{2} \leq (1 + δ_{r}) ∥ X ∥_{F}^{2}

(1 - δ_{r}) ∥ X ∥_{F}^{2} \leq ∥ A (X) ∥^{2} \leq (1 + δ_{r}) ∥ X ∥_{F}^{2}

\nabla f (X^{⋆}) = A^{*} A (X^{⋆} - X^{⋆}) = 0,

\nabla f (X^{⋆}) = A^{*} A (X^{⋆} - X^{⋆}) = 0,

\nabla^{2} f (X) [Y, Y] = ∥ A (Y) ∥^{2} .

\nabla^{2} f (X) [Y, Y] = ∥ A (Y) ∥^{2} .

(1 - δ_{4 r}) ∥ Y ∥_{F}^{2} \leq ∥ A (Y) ∥^{2} \leq (1 + δ_{4 r}) ∥ Y ∥_{F}^{2}

(1 - δ_{4 r}) ∥ Y ∥_{F}^{2} \leq ∥ A (Y) ∥^{2} \leq (1 + δ_{4 r}) ∥ Y ∥_{F}^{2}

U \in R^{n \times r}, V \in R^{n \times r} minimize \frac{1}{2} A (U V^{T} - X^{⋆})_{2}^{2} + g (U, V),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Global Optimality in Low-rank Matrix Optimization

Zhihui Zhu, Qiuwei Li, Gongguo Tang, and Michael B. Wakin This work was supported by NSF grant CCF-1409261, NSF grant CCF-1464205, and NSF CAREER grant CCF-1149225.Z. Zhu, Q. Li, G. Tang, and M. B. Wakin are with the Department of Electrical Engineering, Colorado School of Mines, Golden, CO 80401 USA. Email: {zzhu, qiuli, gtang, mwakin}@mines.edu.

Abstract

This paper considers the minimization of a general objective function $f(\boldsymbol{X})$ over the set of rectangular $n\times m$ matrices that have rank at most $r$ . To reduce the computational burden, we factorize the variable $\boldsymbol{X}$ into a product of two smaller matrices and optimize over these two matrices instead of $\boldsymbol{X}$ . Despite the resulting nonconvexity, recent studies in matrix completion and sensing have shown that the factored problem has no spurious local minima and obeys the so-called strict saddle property (the function has a directional negative curvature at all critical points but local minima). We analyze the global geometry for a general and yet well-conditioned objective function $f(\boldsymbol{X})$ whose restricted strong convexity and restricted strong smoothness constants are comparable. In particular, we show that the reformulated objective function has no spurious local minima and obeys the strict saddle property. These geometric properties imply that a number of iterative optimization algorithms (such as gradient descent) can provably solve the factored problem with global convergence.

Index Terms:

Low-rank matrix optimization, matrix sensing, noncovnex optimization, optimization geometry, strict saddle

I Introduction

Consider the minimization of a general objective function $f(\boldsymbol{X})$ over all low-rank $n\times m$ matrices:

[TABLE]

where the objective function $f:\mathbb{R}^{n\times m}\rightarrow\mathbb{R}$ is smooth. Low-rank matrix optimizations of the form (1) appear in a wide variety of applications, including quantum tomography [1, 2], collaborative filtering [3, 4], sensor localization [5], low-rank matrix recovery from compressive measurements [6, 7], and matrix completion [8, 9]. Due to the rank constraint, however, low-rank matrix optimizations of the form (1) are highly nonconvex and computationally NP-hard in general [10] even if $f$ itself is convex. In order to deal with the rank constraint and to find a low-rank solution, the nuclear norm is widely used in matrix inverse problems [11, 7] arising in machine learning [12], signal processing [13], and control [14]. Although nuclear norm minimization enjoys strong statistical guarantees [8], its computational complexity is very high (as most algorithms require performing an expensive singular value decomposition (SVD) in each iteration), prohibiting it from scaling to practical problems.

To relieve the computational bottleneck and provide an alternative way of dealing with the rank constraint, recent studies propose to factorize the variable into the Burer-Monteiro type decomposition [15, 16] with $\boldsymbol{X}=\boldsymbol{U}\boldsymbol{V}^{\mathrm{T}}$ , and optimize over the $n\times r$ and $m\times r$ matrices $\boldsymbol{U}$ and $\boldsymbol{V}$ . With this parameterization of $\boldsymbol{X}$ , we can recast (1) into the following program:

[TABLE]

The bilinear nature of the parameterization renders the objective function of (2) nonconvex even when $f(\boldsymbol{X})$ is a convex function. Hence, the objective function in (2) can potentially have spurious local minima (i.e., local minimizers that are not global minimizers) or “bad” saddle points that prevent a number of iterative algorithms from converging to the global solution. By analyzing the landscape of nonconvex functions, several recent works have shown that the factored objective function $h(\boldsymbol{U},\boldsymbol{V})$ in certain matrix inverse problems has no spurious local minima [17, 18, 19].

We generalize this line of work by focusing on a general objective function $f(\boldsymbol{X})$ in the optimization (1), not necessarily a quadratic loss function coming from a matrix inverse problem. By focusing on a general objective function, we attempt to provide a unifying framework for low-rank matrix optimizations with the factorization approach. We provide a geometric analysis for the factored program (2) and show that, under certain conditions on $f(\boldsymbol{X})$ , all critical points of the objective function $h(\boldsymbol{U},\boldsymbol{V})$ are well-behaved. Our characterization of the geometry of the objective function ensures that a number of iterative optimization algorithms converge to a global minimum.

I-A Summary of Results

The purpose of this paper is to analyze the geometry of the factored problem $h(\boldsymbol{U},\boldsymbol{V})$ in (2). In particular, we attempt to understand the behavior of all of the critical points of the objective function in the reformulated problem (2).

Before presenting our main results, we lay out the necessary assumptions on the objective function $f(\boldsymbol{X})$ . As is known, without any assumptions on the problem, even minimizing traditional quadratic objective functions is challenging. For this purpose, we focus on the model where $f(\boldsymbol{X})$ is $(2r,4r)$ -restricted strongly convex and smooth, i.e., for any $n\times m$ matrices $\boldsymbol{X},\boldsymbol{G}$ with $\operatorname{rank}(\boldsymbol{X})\leq 2r$ and $\operatorname{rank}(\boldsymbol{G})\leq 4r$ , the Hessian of $f(\boldsymbol{X})$ satisfies

[TABLE]

for some positive $\alpha$ and $\beta$ . A similar assumption is also utilized in [20, Conditions 5.3 and 5.4]. With this assumption on $f(\boldsymbol{X})$ , we summarize our main results in the following informal theorem.

Theorem 1.

(informal)* Suppose the function $f(\boldsymbol{X})$ satisfies the $(2r,4r)$ -restricted strong convexity and smoothness condition (3) and has a critical point $\boldsymbol{X}^{\star}\in\mathbb{R}^{n\times m}$ with $\operatorname{rank}(\boldsymbol{X}^{\star})=r^{\star}\leq r$ . Then the factored objective function $h(\boldsymbol{U},\boldsymbol{V})$ (with an additional regularizer, see Theorem 3) in (2) has no spurious local minima and obeys the strict saddle property (see Definition 3 in Section II).*

*Remark 1**.*

As guaranteed by Proposition 1 (in Section III), the $(2r,4r)$ -restricted strong convexity and smoothness property (3) ensures that $\boldsymbol{X}^{\star}$ is the unique global minimum of (1). Theorem 1 then implies that we can recover the rank- $r^{\star}$ global minimizer $\boldsymbol{X}^{\star}$ of (1) by many iterative algorithms (such as the trust region method [21] and stochastic gradient descent [22]) even from a random initialization. This is because 1) as guaranteed by Theorem 2, the strict saddle property ensures local search algorithms converge to a local minimum, and 2) there are no spurious local minima.

*Remark 2**.*

Since our main result only requires the $(2r,4r)$ -restricted strong convexity and smoothness property (3), aside from low-rank matrix recovery [11], it can also be applied to many other low-rank matrix optimization problems [23] which do not necessarily involve quadratic loss functions. Typical examples include robust PCA [24, 25], 1-bit matrix completion [26, 27] and Poisson principal component analysis (PCA) [28].

*Remark 3**.*

Similar results on positive semi-definite (PSD) matrix optimization problems (but without the rank constraint) with generic objective functions were obtained in [29]. We note that one cannot directly apply the results in [29] to the optimization (1) when the matrices under consideration are nonsymmetric or rectangular, even if we ignore the rank constraint. One could attempt to convert minimizing $f(\boldsymbol{X})$ over general $n\times m$ matrices into minimizing $q(\boldsymbol{Z})$ over the cone of PSD matrices of size $(m+n)\times(m+n)$ , where $\boldsymbol{X}$ and $\boldsymbol{X}^{\mathrm{T}}$ form the upper right and lower left blocks of $\boldsymbol{Z}$ . The problem with this transformation, however, is that $q(\boldsymbol{Z})$ will no longer satisfy the same properties as $f(\boldsymbol{X})$ , in particular the restricted strong convexity and smoothness condition (3) which is a key assumption utilized in [29]. For this reason, one cannot apply the results for the PSD optimization in [29] directly to our problem. In terms of the proof techniques, although the generalization from the PSD case might not seem technically challenging at first sight, quite a few technical difficulties had to be overcome to develop the theory for the general case in this paper. In fact, the non-triviality of extending to the nonsymmetric case is also highlighted in [30, 19].

I-B Related Works

Compared with the original program (1), the factored form (2) typically involves many fewer variables (or variables with much smaller size) and can be efficiently solved by simple but powerful methods (such as gradient descent [22, 31], the trust region method [32], and alternating methods [33]) for large-scale settings, though it is nonconvex. In recent years, tremendous effort has been devoted to analyzing nonconvex optimizations by exploiting the geometry of the corresponding objective functions. These works can be separated into two types based on whether the geometry is analysed locally or globally. One type of work analyzes the behavior of the objective function in a small neighborhood containing the global optimum and requires a good initialization that is close enough to a global minimum. Problems such as phase retrieval [34], matrix sensing [30], and semi-definite optimization [35] have been studied.

Another type of work attempts to analyze the landscape of the objective function and show that it obeys the strict saddle property. If this particular property holds, then simple algorithms such as gradient descent and the trust region method are guaranteed to converge to a local minimum from a random initialization [31, 22, 36] rather than requiring a good guess. We approach low-rank matrix optimization with general objective functions (1) via a similar geometric characterization. Similar geometric results are known for a number of problems including complete dictionary learning [36], phase retrieval [21], orthogonal tensor decomposition [22], and matrix inverse problems [17, 18, 37]. Empirical evidence also supports using the factorization approach for estimating a low-rank PSD matrix from a set of rank-one measurements corrupted by arbitrary outliers [38] and for recovering a dynamically evolving low-rank matrix from incomplete observations [39].

Our work is most closely related to certain recent works in low-rank matrix optimization. Bhojanapalli et al. [17] showed that the low-rank, PSD matrix sensing problem has no spurious local minima and obeys the strict saddle property. Similar results were exploited for PSD matrix completion [18], PSD matrix factorization [40] and low-rank, PSD matrix optimization problems with generic objective functions [29]. Our work extends this line of analysis to general low-rank matrix (not necessary PSD or even square) optimization problems. Another closely related work considers the low-rank, non-square matrix sensing problem and matrix completion with the factorization approach [19, 41, 42]. We note that our general objective function framework includes the low-rank matrix sensing problem as a special case (see Section III-C). Furthermore, our result covers both over-parameterization where $r>r^{\star}$ and exact parameterization where $r=r^{\star}$ . Wang et al. [20] also considered the factored low-rank matrix minimization problem with a general objective function which satisfies the restricted strong convexity and smoothness condition. Their algorithms require good initializations for global convergence since they characterized only the local landscapes around the global optima. By categorizing the behavior of all the critical points, our work differs from [20] in that we instead characterize the global landscape of the factored objective function.

This paper continues in Section II with formal definitions for strict saddles and the strict saddle property. We present the main results and their implications in matrix sensing, weighted low-rank approximation, and 1-bit matrix completion in Section III. The proof of our main results is given in Section IV. We conclude the paper in Section VI.

II Preliminaries

II-A Notation

To begin, we first briefly introduce some notation used throughout the paper. The symbols ${\bf I}$ and ${\bf 0}$ respectively represent the identity matrix and zero matrix with appropriate sizes. The set of $r\times r$ orthonormal matrices is denoted by $\mathcal{O}_{r}:=\{\boldsymbol{R}\in\mathbb{R}^{r\times r}:\boldsymbol{R}^{\mathrm{T}}\boldsymbol{R}={\bf I}\}$ . If a function $h(\boldsymbol{U},\boldsymbol{V})$ has two arguments, $\boldsymbol{U}\in\mathbb{R}^{n\times r}$ and $\boldsymbol{V}\in\mathbb{R}^{m\times r}$ , we occasionally use the notation $h(\boldsymbol{W})$ when we put these two arguments into a new one as $\boldsymbol{W}=\begin{bmatrix}\boldsymbol{U}\\ \boldsymbol{V}\end{bmatrix}$ . For a scalar function $f(\boldsymbol{Z})$ with a matrix variable $\boldsymbol{Z}\in\mathbb{R}^{n\times m}$ , its gradient is an $n\times m$ matrix whose $(i,j)$ -th entry is $[\nabla f(\boldsymbol{Z})]_{ij}=\frac{\partial f(\boldsymbol{Z})}{\partial Z_{ij}}$ for all $i\in[n],j\in[m]$ . Here $[n]=\{1,2,\ldots,n\}$ for any $n\in\mathbb{N}$ and $Z_{ij}$ is the $(i,j)$ -th entry of the matrix $\boldsymbol{Z}$ . The Hessian of $f(\boldsymbol{Z})$ can be viewed as an $nm\times nm$ matrix $[\nabla^{2}f(\boldsymbol{Z})]_{ij}=\frac{\partial^{2}f(\boldsymbol{Z})}{\partial z_{i}\partial z_{j}}$ for all $i,j\in[nm]$ , where $z_{i}$ is the $i$ -th entry of the vectorization of $\boldsymbol{Z}$ . An alternative way to represent the Hessian is by a bilinear form defined via $[\nabla^{2}f(\boldsymbol{Z})](\boldsymbol{A},\boldsymbol{B})=\sum_{i,j,k,l}\frac{\partial^{2}f(\boldsymbol{Z})}{\partial Z_{ij}\partial Z_{kl}}A_{ij}B_{kl}$ for any $\boldsymbol{A},\boldsymbol{B}\in\mathbb{R}^{n\times m}$ . The bilinear form for the Hessian is widely utilized through the paper.

II-B Strict Saddle Property

Suppose $h:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is a twice continuously differentiable objective function. We begin with the notion of strict saddles and the strict saddle property.

Definition 1 (Critical points).

We say $\boldsymbol{x}$ a critical point if the gradient at $\boldsymbol{x}$ vanishes, i.e., $\nabla h(\boldsymbol{x})={\bf 0}$ .

Definition 2 (Strict saddles).

A critical point $\boldsymbol{x}$ is a strict saddle if the Hessian matrix evaluated at this point has a strictly negative eigenvalue, i.e., $\lambda_{\min}(\nabla^{2}h(\boldsymbol{x}))<0$ .

Definition 3 (Strict saddle property [22]).

A twice differentiable function satisfies the strict saddle property if each critical point either corresponds to a local minimum or is a strict saddle.

Intuitively, the strict saddle property requires a function to have a directional negative curvature at all critical points but local minima. This property allows a number of iterative algorithms such as noisy gradient descent [22] and the trust region method [43] to further decrease the function value at all the strict saddles and thus converge to a local minimum.

Theorem 2.

[32, 22, 31]** (informal) For a twice continuously differentiable objective function satisfying the strict saddle property, a number of iterative optimization algorithms (such as gradient descent and the the trust region method) can find a local minimum.

III Problem Formulation and Main Results

III-A Problem Formulation

This paper considers the problem (1) of minimizing a general function $f(\boldsymbol{X})$ (over the set of low-rank matrices) which is assumed to have a low-rank critical point $\boldsymbol{X}^{\star}$ with $\operatorname{rank}(\boldsymbol{X}^{\star})=r^{\star}\leq r$ such that $\nabla f(\boldsymbol{X}^{\star})={\bf 0}$ . Because of the restricted strong convexity and smoothness condition (3), the following result establishes that if $f(\boldsymbol{X})$ has a critical point $\boldsymbol{X}^{\star}$ with $\operatorname{rank}(\boldsymbol{X}^{\star})\leq r$ , then it is the unique global minimum of (1).

Proposition 1.

Suppose $f(\boldsymbol{X})$ satisfies the $(2r,4r)$ -restricted strong convexity and smoothness condition (3) with positive $\alpha$ and $\beta$ . Assume $\boldsymbol{X}^{\star}$ is a critical point of $f(\boldsymbol{X})$ with $\operatorname{rank}(\boldsymbol{X}^{\star})=r^{\star}\leq r$ . Then $\boldsymbol{X}^{\star}$ is the global minimum of (1), i.e.,

[TABLE]

and the equality holds only at $\boldsymbol{X}=\boldsymbol{X}^{\star}$ .

Proof of Proposition 1.

First note that if $\boldsymbol{X}^{\star}$ is a critical point of $f(\boldsymbol{X})$ , then

[TABLE]

Now for any $\boldsymbol{X}\in\mathbb{R}^{n\times m}$ with $\operatorname{rank}(\boldsymbol{X})\leq r$ , the second order Taylor expansion gives

[TABLE]

where $\widetilde{\boldsymbol{X}}=t\boldsymbol{X}^{\star}+(1-t)\boldsymbol{X}$ for some $t\in[0,1]$ . This Taylor expansion together with $\nabla f(\boldsymbol{X}^{\star})={\bf 0}$ and (3) (both $\widetilde{\boldsymbol{X}}$ and $\boldsymbol{X}^{\prime}-\boldsymbol{X}^{\star}$ have rank at most $2r$ ) gives

[TABLE]

∎

With this, in the sequel, we use $\boldsymbol{X}^{\star}$ to denote the global minimum of (1) (i.e., the low-rank critical point of $f(\boldsymbol{X})$ ), unless stated otherwise. We note that the assumption of the existence of a low-rank critical point $\boldsymbol{X}^{\star}$ is very mild and holds in many matrix inverse problems [7, 8], where the unknown matrix to be recovered is a critical point of $f$ .

We factorize the variable $\boldsymbol{X}=\boldsymbol{U}\boldsymbol{V}^{\mathrm{T}}$ with $\boldsymbol{U}\in\mathbb{R}^{n\times r},\boldsymbol{V}\in\mathbb{R}^{m\times r}$ and transform (1) into its factored counterpart (2). Throughout the paper, $\boldsymbol{X}$ , $\boldsymbol{W}$ and $\widehat{\boldsymbol{W}}$ are matrices depending on $\boldsymbol{U}$ and $\boldsymbol{V}$ :

[TABLE]

Although the new variable $\boldsymbol{W}$ has much smaller size than $\boldsymbol{X}$ when $r\ll\min\{n,m\}$ , the objective function in the factored problem (2) may have a much more complicated landscape due to the bilinear form about $\boldsymbol{U}$ and $\boldsymbol{V}$ . The reformulated objective function $h(\boldsymbol{U},\boldsymbol{V})$ could introduce spurious local minima or degenerate saddle points even when $f(\boldsymbol{X})$ is convex. Our goal is to guarantee that this does not happen.

Let $\boldsymbol{X}^{\star}=\boldsymbol{Q}_{\boldsymbol{U}^{\star}}\boldsymbol{\Sigma}^{\star}\boldsymbol{Q}_{\boldsymbol{V}^{\star}}^{\mathrm{T}}$ denote an SVD of $\boldsymbol{X}^{\star}$ , where $\boldsymbol{Q}_{\boldsymbol{U}^{\star}}\in\mathbb{R}^{n\times r}$ and $\boldsymbol{Q}_{\boldsymbol{V}^{\star}}\in\mathbb{R}^{m\times r}$ are orthonormal matrices of appropriate sizes, and $\boldsymbol{\Sigma}^{\star}\in\mathbb{R}^{r\times r}$ is a diagonal matrix with non-negative diagonal (but with some zeros on the diagonal if $r>r^{\star}=\operatorname{rank}(\boldsymbol{X}^{\star})$ ). We denote

[TABLE]

where $\boldsymbol{X}^{\star}=\boldsymbol{U}^{\star}\boldsymbol{V}^{\star\mathrm{T}}$ forms a balanced factorization of $\boldsymbol{X}^{\star}$ since $\boldsymbol{U}^{\star}$ and $\boldsymbol{V}^{\star}$ have the same singular values. Throughout the paper, we utilize the following two ways to stack $\boldsymbol{U}^{\star}$ and $\boldsymbol{V}^{\star}$ together:

[TABLE]

Before moving on, we note that for any solution $(\boldsymbol{U},\boldsymbol{V})$ to (2), $(\boldsymbol{U}\boldsymbol{\Psi},\boldsymbol{V}\boldsymbol{\Phi})$ is also a solution to (2) for any $\boldsymbol{\Psi},\boldsymbol{\Phi}\in\mathbb{R}^{r\times r}$ such that $\boldsymbol{U}\boldsymbol{\Psi}\boldsymbol{\Phi}^{\mathrm{T}}\boldsymbol{V}^{\mathrm{T}}=\boldsymbol{U}\boldsymbol{V}^{\mathrm{T}}$ . In order to address this ambiguity (i.e., to reduce the search space of $\boldsymbol{W}$ for (2)), we utilize the trick in [30, 19, 20] by introducing a regularizer

[TABLE]

and solving the following problem

[TABLE]

where $\mu>0$ controls the weight for the term $\left\|\boldsymbol{U}^{\mathrm{T}}\boldsymbol{U}-\boldsymbol{V}^{\mathrm{T}}\boldsymbol{V}\right\|_{F}^{2}$ , which will be discussed soon.

We remark that $\boldsymbol{W}^{\star}$ is still a global minimizer of the factored problem (5) since $f(\boldsymbol{X})$ achieves its global minimum over the low-rank set of matrices at $\boldsymbol{X}^{\star}$ and $g(\boldsymbol{W})$ also achieves its global minimum at $\boldsymbol{W}^{\star}$ . The regularizer $g(\boldsymbol{W})$ is applied to force the difference between the two Gram matrices of $\boldsymbol{U}$ and $\boldsymbol{V}$ to be as small as possible. The global minimum of $g(\boldsymbol{W})$ is [math], which is achieved when $\boldsymbol{U}$ and $\boldsymbol{V}$ have the same Gram matrices, i.e., when $\boldsymbol{W}$ belongs to

[TABLE]

Informally, we can view (5) as finding a point from $\mathcal{E}$ that also minimizes $f(\boldsymbol{U}\boldsymbol{V}^{\mathrm{T}})$ . This is formally established in Theorem 3.

III-B Main Results

Our main argument is that, under certain conditions on $f(\boldsymbol{X})$ , the objective function $\rho(\boldsymbol{W})$ has no spurious local minima and satisfies the strict saddle property. This is equivalent to categorizing all the critical points into two types: 1) the global minima which correspond to the global solution of the original convex problem (1) and 2) strict saddles such that the Hessian matrix $\nabla^{2}\rho(\boldsymbol{W})$ evaluated at these points has a strictly negative eigenvalue. We formally establish this in the following theorem, whose proof is given in the next section.

Theorem 3.

For any $\mu>0$ , each critical point $\boldsymbol{W}=\begin{bmatrix}\boldsymbol{U}\\ \boldsymbol{V}\end{bmatrix}$ of $\rho(\boldsymbol{W})$ defined in (5) satisfies

[TABLE]

Furthermore, suppose that the function $f(\boldsymbol{X})$ satisfies the $(2r,4r)$ -restricted strong convexity and smoothness condition (3) with positive constants $\alpha$ and $\beta$ satisfying $\frac{\beta}{\alpha}\leq 1.5$ and that the function $f(\boldsymbol{X})$ has a critical point $\boldsymbol{X}^{\star}\in\mathbb{R}^{n\times m}$ with $\operatorname{rank}(\boldsymbol{X}^{\star})=r^{\star}\leq r$ . Set $\mu\leq\frac{\alpha}{16}$ for the factored problem (5). Then $\rho(\boldsymbol{W})$ has no spurious local minima, i.e., any local minimum of $\rho(\boldsymbol{W})$ is a global minimum corresponding to the global solution of the original problem (1): $\boldsymbol{U}\boldsymbol{V}^{\mathrm{T}}=\boldsymbol{X}^{\star}.$ In addition, $\rho(\boldsymbol{W})$ obeys the strict saddle property that any critical point not being a local minimum is a strict saddle with

[TABLE]

where $r^{c}\leq r$ is the rank of $\boldsymbol{W}$ , $\lambda_{\min}(\cdot)$ represents the smallest eigenvalue, and $\sigma_{\ell}(\cdot)$ denotes the $\ell$ -th largest singular value.

*Remark 4**.*

Equation (7) shows that any critical point $\boldsymbol{W}$ belongs to $\mathcal{E}$ for the objective function in the factored problem (5) with any positive $\mu$ . This demonstrates the reason for adding the regularizer $g(\boldsymbol{U},\boldsymbol{V})$ . Thus, any iterative optimization algorithm converging to some critical point of $\rho(\boldsymbol{W})$ results in a solution within $\mathcal{E}$ . Furthermore, the strict saddle property along with the lack of spurious local minima ensures that a number of iterative optimization algorithms find the global minimum.

*Remark 5**.*

For any critical point $\boldsymbol{W}\in\mathbb{R}^{(n+m)\times r}$ that is not a local minimum, the right hand side of (8) is strictly negative, implying that $\boldsymbol{W}$ is a strict saddle. We also note that Theorem 3 not only covers exact parameterization where $r=r^{\star}$ , but also includes the over-parameterization case where $r>r^{\star}$ .

*Remark 6**.*

The constants appearing in Theorem 3 are not optimized. We use $\mu\leq\frac{1}{16}\alpha$ simply to include $\mu=\frac{1}{16}$ which is utilized for the matrix sensing problem in [30]. If the ratio between the restricted strong convexity and smoothness constants $\frac{\beta}{\alpha}\leq 1.4$ , then we can show that $\rho(\boldsymbol{W})$ has no spurious local minima and obeys the strict saddle property for any $\mu\leq\frac{1}{4}\alpha$ (where $\mu=\frac{1}{4}$ is utilized for the matrix sensing problem in [19]). In all cases, a smaller $\mu$ yields a more negative constant in (8); see Section IV for more discussion on this. This implies that when the restricted strong convexity constant $\alpha$ is not provided a priori, one can always choose a small $\mu$ to ensure the strict saddle property holds, and hence guarantee the global convergence of many iterative optimization algorithms.

The constant $1.5$ for the dynamic range $\frac{\beta}{\alpha}$ in Theorem 3 is also not optimized and it is possible to slightly relax this constraint with more sophisticated analysis. However, the following example involving weighted symmetric matrix factorization implies that the room for improving this constant is rather limited. Let

[TABLE]

for some $a\geq 0$ ,

[TABLE]

Now consider the following weighted low-rank matrix factorization:

[TABLE]

whose gradient $\nabla h(\boldsymbol{U})$ and Hessian $\nabla^{2}h(\boldsymbol{U})$ are given by:

[TABLE]

and

[TABLE]

Then,

[TABLE]

is a critical point with

[TABLE]

which has eigenvalues

[TABLE]

and $\lambda_{2}=4a>0$ . We conclude that this $\boldsymbol{U}$ is a strict saddle point when $a<2$ and a spurious local minimum when $a>2$ . This weighted symmetric matrix factorization problem (9) satisfies the restricted strong convexity and smoothness condition (3) with constants $\alpha=\|\boldsymbol{\Omega}\|_{\min}^{2}=1$ and $\beta=\|\boldsymbol{\Omega}\|_{\max}^{2}=1+a$ (where $\|\boldsymbol{\Omega}\|_{\min}$ and $\|\boldsymbol{\Omega}\|_{\max}$ represent the smallest and largest entries in $\boldsymbol{\Omega}$ ; see Section III-C). Thus, we have a counter example which demonstrates the existence of spurious local minima when $\frac{\beta}{\alpha}>3$ .

*Remark 7**.*

We finally remark that although Theorem 3 requires the additional regularizer (4), empirical evidence (see experiments in Section V) shows we can get rid of this regularizer for many iterative algorithms with random initialization.

We prove Theorem 3 in Section IV. Before proceeding, we present two stylized applications of Theorem 3 in matrix sensing and weighted low-rank approximation.

III-C Stylized Applications

III-C1 Matrix Sensing

We first consider the implication of Theorem 3 in the matrix sensing problem where

[TABLE]

Here $\mathcal{A}:\mathbb{R}^{n\times m}\rightarrow\mathbb{R}^{p}$ is a known measurement operator satisfying the following restricted isometry property.

Definition 4.

(Restricted Isometry Property (RIP) [7]) The map $\mathcal{A}:\mathbb{R}^{n\times m}\rightarrow\mathbb{R}^{p}$ satisfies the $r$ -RIP with constant $\delta_{r}$ if

[TABLE]

holds for any $n\times m$ matrix $\boldsymbol{X}$ with $\operatorname{rank}(\boldsymbol{X})\leq r$ .

Note that, in this case, the gradient of $f(\boldsymbol{X})$ at $\boldsymbol{X}^{\star}$ is

[TABLE]

which implies that $\boldsymbol{X}^{\star}$ is a critical point of $f(\boldsymbol{X})$ . The Hessian quadrature form $\nabla^{2}f(\boldsymbol{X})[\boldsymbol{Y},\boldsymbol{Y}]$ for any $n\times m$ matrices $\boldsymbol{X}$ and $\boldsymbol{Y}$ is given by

[TABLE]

If $\mathcal{A}$ satisfies the $4r$ -restricted isometry property with constant $\delta_{4r}$ , then $f(\boldsymbol{X})$ satisfies the $(2r,4r)$ -restricted strong convexity and smoothness condition (3) with constants $\alpha=1-\delta_{4r}$ and $\beta=1-\delta_{4r}$ since

[TABLE]

for any rank- $4r$ matrix $\boldsymbol{Y}$ . Now, applying Theorem 3, we can characterize the geometry for the following matrix sensing problem with the factorization approach:

[TABLE]

where $g(\boldsymbol{U},\boldsymbol{V})$ is the added regularizer defined in (4).

Corollary 1.

Suppose $\mathcal{A}$ satisfies the $4r$ -RIP with constant $\delta_{4r}\leq\frac{1}{5}$ , and set $\mu\leq\frac{1-\delta_{4r}}{16}$ . Then the objective function in (11) has no spurious local minima and satisfies the strict saddle property.

This result follows directly from Theorem 3 by noting that $\frac{\beta}{\alpha}=\frac{1+\delta_{4r}}{1-\delta_{4r}}\leq 1.5$ if $\delta_{4r}\leq\frac{1}{5}$ . We remark that Park et al. [19, Theorem 4.3] provided a similar geometric result for (11). Compared to their result which requires $\delta_{4r}\leq\frac{1}{100}$ , our result has a much weaker requirement on the RIP of the measurement operator.

III-C2 Weighted Low-Rank Matrix Factorization

We now consider the implication of Theorem 3 in the weighted matrix factorization problem [44], where

[TABLE]

Here $\boldsymbol{\Omega}$ is an $n\times m$ weight matrix consisting of positive elements and $\circ$ denotes the point-wise product between two matrices. In this case, the gradient of $f(\boldsymbol{X})$ at $\boldsymbol{X}^{\star}$ is

[TABLE]

which implies that $\boldsymbol{X}^{\star}$ is a critical point of $f(\boldsymbol{X})$ . The Hessian quadrature form $\nabla^{2}f(\boldsymbol{X})[\boldsymbol{Y},\boldsymbol{Y}]$ for any $n\times m$ matrices $\boldsymbol{X}$ and $\boldsymbol{Y}$ is given by

[TABLE]

Thus $f(\boldsymbol{X})$ satisfies the $(2r,4r)$ -restricted strong convexity and smoothness condition (3) with constants $\alpha=\|\boldsymbol{\Omega}\|_{\min}^{2}$ and $\beta=\|\boldsymbol{\Omega}\|_{\max}^{2}$ since

[TABLE]

where $\|\boldsymbol{\Omega}\|_{\min}$ and $\|\boldsymbol{\Omega}\|_{\max}$ represent the smallest and largest entries in $\boldsymbol{\Omega}$ , respectively. Now we consider the following weighted matrix factorization problem:

[TABLE]

where $g(\boldsymbol{U},\boldsymbol{V})$ is the added regularizer defined in (4). For an arbitrary weight matrix $\boldsymbol{\Omega}$ , it is proven that the weighted low-rank factorization can be NP-hard [45] and has spurious local minima. When the elements in the weight matrix $\boldsymbol{\Omega}$ are concentrated, it is expected that (12) can be efficiently solved by a number of iterative optimization algorithms as it is close to an (unweighted) matrix factorization problem (where $\boldsymbol{\Omega}$ is a matrix of ones) which obeys the strict saddle property [40]. The following result characterizes the geometric structure in the objection function of (12) by directly applying Theorem 3.

Corollary 2.

Suppose $\boldsymbol{\Omega}$ satisfies $\frac{\|\boldsymbol{\Omega}\|_{\max}^{2}}{\|\boldsymbol{\Omega}\|_{\min}^{2}}\leq 1.5$ . Set $\mu\leq\frac{\|\boldsymbol{\Omega}\|_{\min}^{2}}{16}$ . Then the objective function in (12) has no spurious local minima and satisfies the strict saddle property.

III-C3 1-bit Matrix Completion

Finally, we consider the problem of completing a low-rank matrix from a subset of 1-bit measurements [26]. Given $\boldsymbol{X}^{\diamond}\in\mathbb{R}^{n\times m}$ , a subset of indices $\Omega\subset[m]\times[n]$ , and a differentiable function $q:\mathbb{R}\rightarrow[0,1]$ , we observe

[TABLE]

for all $(i,j)\in\Omega$ . Typical choices for $q$ include the logistic regression model where $q(x)=\frac{e^{x}}{1+e^{x}}$ and the probit regression model where $q(x)=1-\Phi(-x/\sigma)=\Phi(x/\sigma)$ . Here $\Phi$ is the cumulative distribution function (CDF) of a mean-zero Gaussian distribution with variance $\sigma^{2}$ . In [26], the authors attempt to recover $\boldsymbol{X}^{\diamond}$ from the incomplete nonlinear measurements $\{Y_{ij}\}_{(i,j)\in\Omega}$ by minimizing the negative log-likelihood function

[TABLE]

which results in a maximum likelihood (ML) estimate.

We note that $F_{\Omega,\boldsymbol{Y}}$ is a convex function for both the logistic model and the probit model. The following result also establishes that $F_{\Omega,\boldsymbol{Y}}$ satisfies the restricted strong convexity and smoothness condition if we observe full 1-bit measurements, i.e., $\Omega=[n]\times[m]$ .

Lemma 1.

Suppose $\Omega=[n]\times[m]$ . Let

[TABLE]

and

[TABLE]

Then $F_{\Omega,\boldsymbol{Y}}$ satisfies the restricted strong convexity and smoothness condition:

[TABLE]

for any $\boldsymbol{G}\in\mathbb{R}^{n\times m}$ and $\|\boldsymbol{X}\|_{\infty}\leq\gamma$ .

The proof of Lemma 1 is given in Appendix A. Now we consider the logistic regression model where $q(x)=\frac{e^{x}}{1+e^{x}}$ .

Corollary 3.

Suppose $\Omega=[n]\times[m]$ and $\gamma\leq 1.3$ . Consider the logistic regression model where $q(x)=\frac{e^{x}}{1+e^{x}}$ . Then $F_{\Omega,\boldsymbol{Y}}$ satisfies the restricted strong convexity and smoothness condition with

[TABLE]

Proof of Corollary 3.

Applying Lemma 1 with direct calculation gives

[TABLE]

where $q^{\prime}(x)=\frac{e^{x}}{(1+e^{x})^{2}}$ . Now if we restrict $\|\boldsymbol{X}\|_{\infty}\leq 1.3$ , we have

[TABLE]

∎

Under the assumption that $\boldsymbol{X}^{\diamond}$ is low-rank, a nuclear norm constraint is utilized in [26] to force a low-rank solution. Corollary 3 implies that we can apply matrix factorization for 1-bit matrix recovery given that the elements of $\boldsymbol{X}$ are bounded. For the setting where $\Omega$ is only a subset of $[n]\times[m]$ , [46] considered the 1-bit matrix completion problem with the rank constraint and established a stronger statistical recovery guarantee than that in [26]. Empirical evidence (see [46] and Section V-C) supports that matrix factorization also works for 1-bit matrix completion.

IV Proof of Theorem 3

In this section, we provide a formal proof of Theorem 3. The main argument involves showing that each critical point of $\rho(\boldsymbol{W})$ either corresponds to the global solution of (1) or is a strict saddle whose Hessian $\nabla^{2}\rho(\boldsymbol{W})$ has a strictly negative eigenvalue. Specifically, we show that $\boldsymbol{W}$ is a strict saddle by arguing that the Hessian $\nabla^{2}\rho(\boldsymbol{W})$ has a strictly negative curvature along $\boldsymbol{\Delta}:=\boldsymbol{W}-\boldsymbol{W}^{\star}\boldsymbol{R}$ , i.e., $[\nabla^{2}\rho(\boldsymbol{W})](\boldsymbol{\Delta},\boldsymbol{\Delta})\leq-\tau\|\boldsymbol{\Delta}\|_{F}^{2}$ for some $\tau>0$ . Here $\boldsymbol{R}$ is an $r\times r$ orthonormal matrix such that the distance between $\boldsymbol{W}$ and $\boldsymbol{W}^{\star}$ rotated through $\boldsymbol{R}$ is as small as possible.

IV-A Supporting Results

We first present some useful results. The $(2r,4r)$ -restricted strong convexity and smoothness assumption (3) implies the following isometry property, whose proof is given in Appendix B.

Proposition 2.

Suppose the function $f(\boldsymbol{X})$ satisfies the $(2r,4r)$ -restricted strong convexity and smoothness condition (3) with positive $\alpha$ and $\beta$ . Then for any $n\times m$ matrices $\boldsymbol{Z},\boldsymbol{G},\boldsymbol{H}$ of rank at most $2r$ , we have

[TABLE]

The following result provides an upper bound on the energy of the difference $\boldsymbol{W}\boldsymbol{W}^{\mathrm{T}}-\boldsymbol{W}^{\star}\boldsymbol{W}^{\star\mathrm{T}}$ when projected onto the column space of $\boldsymbol{W}$ . Its proof is given in Appendix C.

Lemma 2.

Suppose $f(\boldsymbol{X})$ satisfies the $(2r,4r)$ -restricted strong convexity and smoothness condition (3). For any critical point $\boldsymbol{W}$ of (5), let $\boldsymbol{P}_{\boldsymbol{W}}\in\mathbb{R}^{(m+n)\times(m+n)}$ be the orthogonal projector onto the column space of $\boldsymbol{W}$ . Then

[TABLE]

We remark that Lemma 2 is a variant of [19, Lemma 3.2]. While the result there requires the $4r$ -RIP condition of the objective function, our result depends on the $(2r,4r)$ -restricted strong convexity and smoothness condition. Our result is also slightly tighter than [19, Lemma 3.2].

In addition, for any matrices $\boldsymbol{C},\boldsymbol{D}\in\mathbb{R}^{n\times r}$ , the following result relates the distance between $\boldsymbol{C}\boldsymbol{C}^{\mathrm{T}}$ and $\boldsymbol{D}\boldsymbol{D}^{\mathrm{T}}$ to the distance between $\boldsymbol{C}$ and $\boldsymbol{D}$ .

Lemma 3.

For any matrices $\boldsymbol{C},\boldsymbol{D}\in\mathbb{R}^{n\times r}$ with ranks $r_{1}$ and $r_{2}$ , respectively, let $\boldsymbol{R}=\operatorname*{\text{arg\leavevmode\nobreak\ min}}_{\boldsymbol{R}^{\prime}\in\mathcal{O}_{r}}\|\boldsymbol{C}-\boldsymbol{D}\boldsymbol{R}^{\prime}\|_{F}$ . Then

[TABLE]

If $\boldsymbol{C}={\bf 0}$ , then we have

[TABLE]

We present one more useful result in the following Lemma.

Lemma 4.

[29, Lemma 3]** For any matrices $\boldsymbol{C},\boldsymbol{D}\in\mathbb{R}^{n\times r}$ , let $\boldsymbol{P}_{\boldsymbol{C}}$ be the orthogonal projector onto the range of $\boldsymbol{C}$ . Let $\boldsymbol{R}=\operatorname*{\text{arg\leavevmode\nobreak\ min}}_{\boldsymbol{R}^{\prime}\in\mathcal{O}_{r}}\|\boldsymbol{C}-\boldsymbol{D}\boldsymbol{R}^{\prime}\|_{F}$ . Then

[TABLE]

Finally, we provide the gradient and Hessian expressions for $\rho(\boldsymbol{W})$ . The gradient of $\rho(\boldsymbol{W})$ is given by

[TABLE]

Standard computations give the the Hessian quadrature form $[\nabla^{2}\rho(\boldsymbol{W})](\boldsymbol{\Delta},\boldsymbol{\Delta})$ for any $\boldsymbol{\Delta}=\begin{bmatrix}\boldsymbol{\Delta}_{\boldsymbol{U}}\\ \boldsymbol{\Delta}_{\boldsymbol{V}}\end{bmatrix}$ where $\boldsymbol{\Delta}_{\boldsymbol{U}}\in\mathbb{R}^{n\times r},\boldsymbol{\Delta}_{\boldsymbol{V}}\in\mathbb{R}^{m\times r}$ :

[TABLE]

where

[TABLE]

IV-B The Formal Proof

Proof of Theorem 3.

Any critical point $\boldsymbol{W}$ of $\rho(\boldsymbol{W})$ satisfies $\nabla\rho(\boldsymbol{W})={\bf 0}$ , i.e.,

[TABLE]

By (17), we obtain

[TABLE]

Multiplying (16) by $\boldsymbol{U}^{\mathrm{T}}$ and plugging in the expression for $\boldsymbol{U}^{\mathrm{T}}\nabla f(\boldsymbol{X})$ from the above equation $\boldsymbol{V}^{\mathrm{T}}$ gives

[TABLE]

which further implies

[TABLE]

Note that $\boldsymbol{U}^{\mathrm{T}}\boldsymbol{U}$ and $\boldsymbol{V}^{\mathrm{T}}\boldsymbol{V}$ are the principal square roots (i.e., PSD square roots) of $\boldsymbol{U}^{\mathrm{T}}\boldsymbol{U}\boldsymbol{U}^{\mathrm{T}}\boldsymbol{U}$ and $\boldsymbol{V}^{\mathrm{T}}\boldsymbol{V}\boldsymbol{V}^{\mathrm{T}}\boldsymbol{V}$ , respectively. Utilizing the result that a PSD matrix has a unique principal square root [47], we obtain

[TABLE]

Thus, we can simplify (16) and (17) by

[TABLE]

Now we turn to prove the strict saddle property and that there are no spurious local minima.

First, note that as guaranteed by Proposition 1, $\boldsymbol{X}^{\star}$ is the unique $n\times m$ matrix with rank at most $r$ . Also the gradient of $f(\boldsymbol{X})$ vanishes at $\boldsymbol{X}^{\star}$ since (1) is an unconstraint optimization problem. Denote the set of critical points of $\rho(\boldsymbol{W})$ by

[TABLE]

We separate $\mathcal{C}$ into two subsets:

[TABLE]

satisfying $\mathcal{C}=\mathcal{C}_{1}\cup\mathcal{C}_{2}$ . Since any critical point $\boldsymbol{W}$ satisfies (18), $g(\boldsymbol{W})$ achieves its global minimum at $\boldsymbol{W}$ . Also $f(\boldsymbol{X})$ achieves its global minimum at $\boldsymbol{X}^{\star}$ . We conclude that $\boldsymbol{W}$ is the globally optimal solution of $\rho$ for any $\boldsymbol{W}\in\mathcal{C}_{1}$ . If we show that any $\boldsymbol{W}\in\mathcal{C}_{2}$ is a strict saddle, then we prove that there are no spurious local minima as well as the strict saddle property. Thus, the remaining part is to show that $\mathcal{C}_{2}$ is the set of strict saddles.

To show that $\mathcal{C}_{2}$ is the set of strict saddles, it is sufficient to find a direction $\boldsymbol{\Delta}$ along which the Hessian has a strictly negative curvature for each of these points. We construct $\boldsymbol{\Delta}=\boldsymbol{W}-\boldsymbol{W}^{\star}\boldsymbol{R}$ , the difference from $\boldsymbol{W}$ to its nearest global factor $\boldsymbol{W}^{\star}$ , where

[TABLE]

Such $\boldsymbol{\Delta}$ satisfies $\boldsymbol{\Delta}\neq{\bf 0}$ since $\boldsymbol{X}\neq\boldsymbol{X}^{\star}$ implying $\boldsymbol{W}\boldsymbol{W}^{\mathrm{T}}\neq\boldsymbol{W}^{\star}\boldsymbol{W}^{\star\mathrm{T}}$ . Then we evaluate the Hessian bilinear form along the direction $\boldsymbol{\Delta}$ :

[TABLE]

The following result (which is proved in Appendix E) states that $\Pi_{1}$ is strictly negative, while the remaining terms are relatively small, though they may be nonnegative:

[TABLE]

Now, substituting (22) into (21) gives

[TABLE]

where $(i)$ utilizes Lemmas 2 and 4, $(ii)$ utilizes the following inequality (which is proved in Appendix F)

[TABLE]

and $(ii)$ holds because $\frac{\beta}{\alpha}\leq 1.5$ and $\mu\leq\frac{1}{16}\alpha$ . Thus, if $\boldsymbol{X}\neq\boldsymbol{X}^{\star}$ , $\left[\nabla^{2}\rho(\boldsymbol{X})\right](\boldsymbol{\Delta},\boldsymbol{\Delta})$ is always negative. This implies that $\boldsymbol{W}$ is a strict saddle.

To complete the proof, we utilize Lemma 3 to further bound the last term in (23):

[TABLE]

where $r^{c}$ is the rank of $\boldsymbol{W}$ , the fist inequality utilizes (24), and the second inequality follows from Lemma 3. We complete the proof of Theorem 3 by noting that $\sigma_{\ell}^{2}(\boldsymbol{W}^{\star})=2\sigma_{\ell}(\boldsymbol{X}^{\star})$ for all $\ell\in\{1,\ldots,r^{\star}\}$ since

[TABLE]

is an SVD of $\boldsymbol{W}^{\star}$ , where we recall that $\boldsymbol{X}^{\star}=\boldsymbol{Q}_{\boldsymbol{U}^{\star}}\boldsymbol{\Sigma}^{\star}\boldsymbol{Q}_{\boldsymbol{V}^{\star}}^{\mathrm{T}}$ is an SVD of $\boldsymbol{X}^{\star}$ . ∎

*Remark 8**.*

From (23), we observe that a smaller $\mu$ yields a more negative bound on $\left[\nabla^{2}\rho(\boldsymbol{X})\right](\boldsymbol{\Delta},\boldsymbol{\Delta})$ . This can be explained intuitively as follows. First note that any critical point $\boldsymbol{W}$ satisfies (18) provided $\mu>0$ , no matter how large or small $\mu$ is. The Hessian information about $g(\boldsymbol{W})$ is represented by the terms $\Pi_{3}$ and $\Pi_{4}$ . We have

[TABLE]

where the last line holds since for any $r\times r$ matrix $\boldsymbol{A}$ ,

[TABLE]

Thus the Hessian of $\rho$ evaluated at any critical point $\boldsymbol{W}$ is a PSD matrix111This can also be observed since any critical point $\boldsymbol{W}$ is a global minimum point of $\rho(\boldsymbol{W})$ , which directly indicates that $\nabla^{2}\rho(\boldsymbol{W})\succeq{\bf 0}$ . instead of having a negative eigenvalue. In low-rank, PSD matrix optimization problems, the corresponding objective function (without any regularizer such as $g(\boldsymbol{W})$ ) is proved to have the strict saddle property [17, 29]. Therefore, $h(\boldsymbol{W})$ is also expected to have the strict saddle property, and so is $\rho(\boldsymbol{W})$ when $\mu$ is small, i.e., the Hessian of $g(\boldsymbol{W})$ has little influence on the Hessian of $\rho(\boldsymbol{W})$ when $\mu$ is small. Our results also indicate that when the restricted strict convexity constant $\alpha$ is not provided a priori, we can always choose a small $\mu$ to ensure the strict saddle property of $\rho(\boldsymbol{W})$ is met, and hence we are guaranteed the global convergence of a number of local search algorithms applied to (5).

V Experiments

In this section, we present a set of experiments on matrix sensing, matrix completion, and 1-bit matrix completion to demonstrate the performance of iterative algorithms for low-rank matrix optimization. Unless noted otherwise, we denote the matrix factorization approach by NVX and use the minFunc package222Software available at

https://www.cs.ubc.ca/$\sim$schmidtm/Software/minFunc.html to perform the local search algorithms for the factored problem.

V-A Matrix Sensing

We first present some experiments to illustrate the performance of local search algorithms for the matrix sensing problem with the factorization approach (11). In these experiments, we set $n=50$ , $m=50$ and vary the rank $r$ from $1$ to $19$ . We generate a rank- $r$ $n\times m$ random matrix $\boldsymbol{X}^{\star}$ by setting $\boldsymbol{X}^{\star}=\widetilde{\boldsymbol{U}}\widetilde{\boldsymbol{V}}^{\mathrm{T}}$ where $\widetilde{\boldsymbol{U}}$ and $\widetilde{\boldsymbol{V}}$ are respectively $n\times r$ and $m\times r$ matrixes of normally distributed random numbers. We then obtain $p$ random measurements $\boldsymbol{y}=\mathcal{A}(\boldsymbol{X}^{\star})$ with

[TABLE]

where the entries of each $n\times m$ matrix $\boldsymbol{Y}_{i}$ are independent and identically distributed (i.i.d.) normal random variables with zero mean and variance $\frac{1}{p}$ for $i\in\{1,2,\ldots p\}$ . For each pair of $r$ and the number of measurements, 10 Monte Carlo trials are carried out and for each trial, and we claim matrix recovery to be successful if the relative reconstruction error satisfies

[TABLE]

where we denote by $\widehat{\boldsymbol{X}}$ the reconstructed matrix. Figure 1 displays the phase transition for factorized gradient descent starting from a random initialization, the singular value projection (SVP) method proposed in [48] which requires a SVD in each iteration, and the convex approach which solves

[TABLE]

We see that there are only negligible differences between the different approaches for matrix sensing; these approaches also have very similar performance guarantees when the Gaussian sensing operator $\mathcal{A}$ satisfies the RIP [11]. We note that with or without the regularizer $g$ as defined in (4), local search algorithms have similar performance with random initialization. Hence, throughout all of the experiments, we simply discard the regularizer $g$ , but we stress that identical performance is observed if we have this regularizer $g$ .

The previous experiments suppose that $r$ is known for SVP and the matrix factorization approach. We note, however, that our result in Theorem 3 also covers the over-parameterization case where $r>r^{\star}$ . To illustrate the possible influence of over-parameterization, we generate a rank- $r^{\star}$ random matrix $\boldsymbol{X}^{\star}\in\mathbb{R}^{n\times m}$ with $r^{\star}=4$ and $n=m=50$ and obtain $p=4Rn$ random measurements (so that the measurement operator $\mathcal{A}$ satisfies the RIP of rank $R$ ), where $R=7$ . We then solve the matrix factorization problem333To avoid tuning the parameters (such as step-size) for different $r$ , we use the minFunc package with the default setting, which solves the factored problem by the “LBFGS” algorithm [49]. with $r=4,5,6,7$ and display the corresponding convergence results in Figure 2. As can been seen, the matrix factorization approach converges to the target matrix $\boldsymbol{X}^{\star}$ in both the exact-parameterization and over-parameterization cases. However, we also observe that it converges slower in the over-parameterization case (i.e., $r>r^{\star}$ ) than in the exact-parameterization case (i.e., $r=r^{\star}$ ).

V-B Matrix Completion

We compare the performance of the matrix factorization approach with SVP [48], the convex approach, and singular value thresholding444Software available at http://svt.stanford.edu/ (SVT) [50] for matrix completion where we want to recover a low-rank matrix $\boldsymbol{X}^{\star}$ from incomplete measurements $\{X^{\star}_{ij}\}_{(i,j)\in\Omega}$ , where $\Omega\subset[n]\times[m]$ . Let $\mathcal{P}_{\Omega}$ denote the projection onto the index set $\Omega$ . The convex approach (denoted by CVX) attempts to use the nuclear norm as a convex relaxation of the rankness and solves

[TABLE]

To make the recovery of $\boldsymbol{X}^{\star}$ well-posed, we require $\boldsymbol{X}^{\star}$ to be incoherent such that the information in $\boldsymbol{X}$ is not concentrated in a small number of entries [8]. A matrix $\boldsymbol{X}\in\mathbb{R}^{n\times m}$ with singular value decomposition $\boldsymbol{X}=\boldsymbol{L}\boldsymbol{\Sigma}\boldsymbol{Q}^{\mathrm{T}}$ is $u$ -incoherent if [48, Definition 2.1]

[TABLE]

Though $\mathcal{P}_{\Omega}$ does not satisfy the $r$ -RIP (10) for all low-rank matrices $\boldsymbol{X}$ , it satisfies the RIP when restricted to low-rank incoherent matrices.

Theorem 4.

[48, Theorem 4.2]** Without loss of generality, assume $n\geq m$ . There exists a constant $C\geq 0$ such that for $\Omega\in[n]\times[m]$ chosen according to the Bernouli model with density greater than $Cu^{2}r^{2}\log n/\delta^{2}m$ , with probability at least $1-e^{-n\log n}$ , the RIP holds for all $\mu$ -incoherent matrices $\boldsymbol{X}$ of rank at most $r$ .

Thus, if local search algorithms (such as gradient descent) start with a random initialization and the iterates remain incoherent, then Theorem 3 guarantees the global convergence of the matrix factorization approach with these algorithms. We note that this hypothesis is also required for SVP [48]. Though we can add a regularizer for incoherence as in [18], empirical evidence supports this hypothesis that the iterates in gradient descent are incoherent.

In the first set of experiments, we set $n=m=100$ and vary the rank $r$ from $1$ to $30$ . Similar to the setup for matrix sensing in Section V-A, we generate a rank- $r$ random matrix and randomly obtain $p$ entries, i.e., $|\Omega|=p$ . Figure 3 displays the phase transition for gradient descent with a random initialization, SVP [48], singular value thresholding (SVT) [50], and the convex approach. As can been seen, the matrix factorization approach has similar phase transition to SVP, and is slightly better than SVT and the convex approach in terms of the number of measurements needed for successful recovery.

In the second set of experiments, we set $r=5$ and $p=3r(2n-r)$ (3 times the number of degrees of freedom within a rank- $r$ $n\times n$ matrix), and vary $n$ from $40$ to $5120$ . We compare the time needed for the four approaches in Figure 4; our matrix factorization approach is much faster than the other methods. The time savings for the matrix factorization approach comes from avoiding performing the SVD, which is needed both for SVT and SVP in each iteration. We also observe that convex approach has the highest computational complexity and is not scalable (which is the reason that we only present its time for $n$ up to $640$ ).

V-C 1-bit Matrix Completion

In the last set of experiments, we compare the performance of the matrix factorization approach with the convex approach555Software available at http://mdav.ece.gatech.edu/software/ in [26] for 1-bit matrix completion. We first note that to make the recovery problem well-posed, a constraint on $\|\boldsymbol{X}\|_{\infty}$ (the entry-wise maximum of the matrix $\boldsymbol{X}$ ) is applied in [26] to require that the matrix is not too “spiky”. Instead of using the constraint on $\|\boldsymbol{X}\|_{\infty}$ , we add a smooth regularizer $\|\boldsymbol{X}\|_{F}^{2}$ and turn to minimize the following objective function

[TABLE]

which is also a convex function over $\boldsymbol{X}$ and satisfies a similar restricted strong convexity and smoothness condition to $F_{\Omega,\boldsymbol{Y}}$ in Lemma 1. In the case where we only observe part of the entries, then in light of Theorem 4, the corresponding objective function is expected to satisfy the strong convexity and smoothness condition for all incoherent matrices. Thus, we factorize $\boldsymbol{X}$ into $\boldsymbol{U}\boldsymbol{V}^{\mathrm{T}}$ and solve the following optimization problem over the $n\times r$ and $m\times r$ matrices $\boldsymbol{U}$ and $\boldsymbol{V}$ :

[TABLE]

To evaluate the performance of this factorization approach on 1-bit matrix completion, we generate $n\times r$ matrices $\boldsymbol{U}^{\diamond}$ and $\boldsymbol{V}^{\diamond}$ with entries drawn i.i.d. from a uniform distribution on $[-\frac{1}{2},\frac{1}{2}]$ and construct a random $n\times n$ matrix $\boldsymbol{X}^{\diamond}$ with rank $r$ . Similar to the setup in [26], the matrix is then scaled so that $\|\boldsymbol{X}^{\diamond}\|=1$ . We obtain 1-bit observations $\{Y_{i,j}\}_{(i,j)\in\Omega}$ by adding Gaussian noise of variance $\sigma^{2}$ and recording the sign of the resulting value (15), where the subset of indices $\Omega$ is chosen at random with $\operatorname{E}|\Omega|=p$ . We compare the performance of the factorization approach and the convex approach [26] over a range of different values of $n$ , $p$ , $r$ or $\sigma$ . Figures 5(a)-(d) show the normalized squared Frobenius norm of the error $\frac{\|\widehat{\boldsymbol{X}}-\boldsymbol{X}^{\diamond}\|}{\|\boldsymbol{X}^{\diamond}\|_{F}^{2}}$ (where $\widehat{\boldsymbol{X}}$ denotes the reconstructed matrix) and average the results over 10 draws of Monte Carlo trials. We observe that matrix factorization approach has slightly better performance than the convex approach for 1-bit matrix completion [26]. Note that this phenomenon (the factorization approach having better performance) is also observed in [46]. We repeat these experiments but obtaining 1-bit observations with the logistic regression model where $g(x)=\frac{e^{x}}{1+e^{x}}$ for (15) and display the results in Figure 6.

VI Conclusion

This paper considers low-rank matrix optimization on general (nonsymmetric and rectangular) matrices with general objective functions. By focusing on general objective functions, we provide a unifying framework for low-rank matrix optimizations with the factorization approach. Although the resulting optimization problem is not convex, we show that the reformulated objection function has a simple landscape: there are no spurious local minima and any critical point not being a local minimum is a strict saddle such that the Hessian evaluated at this point has a strictly negative eigenvalue. These properties guarantee that a number of iterative optimization algorithms (such as gradient descent and the trust region method) will converge to the global optimum from a random initialization.

Appendix A Proof of Lemma 1

Proof of Lemma 1.

We compute the partial derivative of $F_{\Omega,\boldsymbol{Y}}$ in terms of $X_{i,j}$ as

[TABLE]

which implies

[TABLE]

and

[TABLE]

for all $(k,\ell)\neq(i,j)$ . Thus, the bilinear form for the Hessian of $\nabla^{2}F_{\Omega,\boldsymbol{Y}}(\boldsymbol{X})$ can be computed as

[TABLE]

for any $\boldsymbol{G}\in\mathbb{R}^{n\times m}$ . Now since by assumption $\|\boldsymbol{X}\|_{\infty}\leq\gamma$ , we have

[TABLE]

∎

Appendix B Proof of Proposition 2

Proof of Proposition 2.

This proof follows similar steps to the proof of [51, Lemma 2.1]. First note that the bilinear form $[\nabla^{2}f(\boldsymbol{Z})](\boldsymbol{G},\boldsymbol{H})=\sum_{i,j,k,l}\frac{\partial^{2}f(\boldsymbol{Z})}{\partial\boldsymbol{Z}_{ij}\partial\boldsymbol{Z}_{kl}}\boldsymbol{G}_{ij}\boldsymbol{H}_{kl}$ implies $[\nabla^{2}f(\boldsymbol{Z})](\boldsymbol{G},\boldsymbol{H})$ is invariant under all scalings for both $\boldsymbol{G}$ and $\boldsymbol{H}$ , i.e.,

[TABLE]

for any $a,b\in\mathbb{R}$ . If either $\boldsymbol{G}$ or $\boldsymbol{H}$ is zero, (3) holds since both sides are [math].

Now suppose both $\boldsymbol{G}$ or $\boldsymbol{H}$ are nonzero. By the scaling invariance property of both sides in (3), we assume $\|\boldsymbol{G}\|_{F}=\|\boldsymbol{H}\|_{F}=1$ without loss of generality. Note that the $(2r,4r)$ -restricted strong convexity and smoothness condition (3) implies

[TABLE]

Thus we have

[TABLE]

which further implies

[TABLE]

∎

Appendix C Proof of Lemma 2

Proof of Lemma 2.

First recall the notation $\boldsymbol{X}=\boldsymbol{U}\boldsymbol{V}^{\mathrm{T}}$ , $\boldsymbol{X}^{\star}=\boldsymbol{U}^{\star}\boldsymbol{V}^{\star}$ , and

[TABLE]

It follows from (19) and (20) that any critical point $\boldsymbol{W}$ satisfies

[TABLE]

which gives

[TABLE]

for any $\boldsymbol{Z}=\begin{bmatrix}\boldsymbol{Z}_{\boldsymbol{U}}\\ \boldsymbol{Z}_{\boldsymbol{V}}\end{bmatrix}\in\mathbb{R}^{(n+m)\times r}$ . Here the second line utilizes the fact $\nabla f(\boldsymbol{X}^{\star})={\bf 0}$ . We bound $\daleth_{1}$ by first using integral form of the mean value theorem for $\nabla f(\boldsymbol{X})$ :

[TABLE]

Noting that all the three matrices $t\boldsymbol{X}+(1-t)\boldsymbol{X}^{\star}$ , $\boldsymbol{X}-\boldsymbol{X}^{\star}$ and $\boldsymbol{Z}_{\boldsymbol{U}}\boldsymbol{V}^{\mathrm{T}}+\boldsymbol{U}\boldsymbol{Z}_{\boldsymbol{V}}^{\mathrm{T}}$ have rank at most $2r$ , it follows from Proposition 2 that

[TABLE]

which when plugged into (28) gives

[TABLE]

Now let $\boldsymbol{Z}=(\boldsymbol{W}\boldsymbol{W}^{\mathrm{T}}-\boldsymbol{W}^{\star}\boldsymbol{W}^{\star\mathrm{T}}){\boldsymbol{W}^{T}}^{\dagger}$ , which gives $\boldsymbol{Z}\boldsymbol{W}^{\mathrm{T}}=(\boldsymbol{W}\boldsymbol{W}^{\mathrm{T}}-\boldsymbol{W}^{\star}\boldsymbol{W}^{\star\mathrm{T}})\boldsymbol{P}_{\boldsymbol{W}}$ . Here $\dagger$ denotes the pseudoinverse of a matrix and $\boldsymbol{P}_{\boldsymbol{W}}$ is the orthogonal projector onto the range of $\boldsymbol{W}$ . Utilizing the fact $\widehat{\boldsymbol{W}}^{\mathrm{T}}\boldsymbol{W}={\bf 0}$ from (7), we further connect the left hand side of (29) with $\left\|\left(\boldsymbol{W}\boldsymbol{W}^{\mathrm{T}}-\boldsymbol{W}^{\star}\boldsymbol{W}^{\star\mathrm{T}}\right)\boldsymbol{P}_{\boldsymbol{W}}\right\|_{F}^{2}$ by

[TABLE]

where the inequality follows because $\left\langle\widehat{\boldsymbol{W}}^{\star}\widehat{\boldsymbol{W}}^{\star\mathrm{T}},\boldsymbol{W}^{\star}\boldsymbol{W}^{\star\mathrm{T}}\boldsymbol{P}_{\boldsymbol{W}}\right\rangle=0$ (noting that $\widehat{\boldsymbol{W}}^{\star\mathrm{T}}\widehat{\boldsymbol{W}}^{\star}={\bf 0}$ ) and $\left\langle\widehat{\boldsymbol{W}}^{\star}\widehat{\boldsymbol{W}}^{\star\mathrm{T}},\boldsymbol{W}\boldsymbol{W}^{\mathrm{T}}\boldsymbol{P}_{\boldsymbol{W}}\right\rangle=\left\langle\widehat{\boldsymbol{W}}^{\star}\widehat{\boldsymbol{W}}^{\star\mathrm{T}},\boldsymbol{W}\boldsymbol{W}^{\mathrm{T}}\right\rangle\geq 0$ since it is the inner product between two PSD matrices.

On the other hand, we give an upper bound on the right hand side of (29):

[TABLE]

where the last line follows because $\left\|\boldsymbol{Z}_{\boldsymbol{U}}\boldsymbol{V}^{\mathrm{T}}\right\|_{F}^{2}+\left\|\boldsymbol{Z}_{\boldsymbol{V}}\boldsymbol{U}^{\mathrm{T}}\right\|_{F}^{2}=\left\|\boldsymbol{Z}_{\boldsymbol{U}}\boldsymbol{U}^{\mathrm{T}}\right\|_{F}^{2}+\left\|\boldsymbol{Z}_{\boldsymbol{V}}\boldsymbol{V}^{\mathrm{T}}\right\|_{F}^{2}$ (since $\boldsymbol{U}^{\mathrm{T}}\boldsymbol{U}=\boldsymbol{V}^{\mathrm{T}}\boldsymbol{V}$ ), implying $2\left\|\boldsymbol{Z}_{\boldsymbol{U}}\boldsymbol{V}^{\mathrm{T}}\right\|_{F}^{2}+2\left\|\boldsymbol{U}\boldsymbol{Z}_{\boldsymbol{V}}^{\mathrm{T}}\right\|_{F}^{2}=\left\|\boldsymbol{Z}\boldsymbol{W}^{\mathrm{T}}\right\|_{F}^{2}$ . This together with (29) and (30) completes the proof. ∎

Appendix D Proof of Lemma 3

Proof of Lemma 3.

When $\boldsymbol{C}\neq{\bf 0}$ , the proof follows directly from the following results.

Lemma 5.

[29, Lemma 2]** For any matrices $\boldsymbol{C},\boldsymbol{D}\in\mathbb{R}^{n\times r}$ with rank $r_{1}$ and $r_{2}$ , respectively, let $\boldsymbol{R}=\operatorname*{\text{arg\leavevmode\nobreak\ min}}_{\widetilde{\boldsymbol{R}}\in\mathcal{O}_{r}}\|\boldsymbol{C}-\boldsymbol{D}\boldsymbol{R}\|_{F}$ . Then

[TABLE]

Lemma 6.

[30, Lemma 5.4]** For any matrices $\boldsymbol{C},\boldsymbol{D}\in\mathbb{R}^{n\times r}$ with $\operatorname{rank}(\boldsymbol{D})=r$ , let $\boldsymbol{R}=\operatorname*{\text{arg\leavevmode\nobreak\ min}}_{\widetilde{\boldsymbol{R}}\in\mathcal{O}_{r}}\|\boldsymbol{C}-\boldsymbol{D}\boldsymbol{R}\|_{F}$ . Then

[TABLE]

If $\boldsymbol{C}={\bf 0}$ , then we have

[TABLE]

∎

Appendix E Proof of (22)

Proof of (22).

We prove the upper bounds for the four terms as follows.

Bounding term $\Pi_{1}$ : Utilizing the fact that $\boldsymbol{\Delta}_{\boldsymbol{U}}=\boldsymbol{U}-\boldsymbol{U}^{\star}\boldsymbol{R}$ and $\boldsymbol{\Delta}_{\boldsymbol{V}}=\boldsymbol{V}-\boldsymbol{V}^{\star}\boldsymbol{R}$ , we have

[TABLE]

where $(i)$ follows from (19) and (20), $(ii)$ utilizes $\nabla f(\boldsymbol{X}^{\star})={\bf 0}$ , and $(iii)$ follows by using the $(2r,4r)$ -restricted strict convexity property (3):

[TABLE]

where the first line follows from the integral form of the mean value theorem for vector-valued functions, and the second line uses the fact that both $t\boldsymbol{X}+(1-t)\boldsymbol{X}^{\star}$ and $\boldsymbol{X}-\boldsymbol{X}^{\star}$ have rank at most $2r$ , and the $(2r,4r)$ -restricted strong convexity of the Hessian $\nabla^{2}f(\cdot)$ .

Bounding term $\Pi_{2}$ : By the smoothness condition (3), we have

[TABLE]

where the last line holds because $\left\|\boldsymbol{D}\boldsymbol{U}^{\mathrm{T}}\right\|_{F}^{2}=\left\|\boldsymbol{D}\boldsymbol{V}^{\mathrm{T}}\right\|_{F}^{2}$ for any $\boldsymbol{D}\in\mathbb{R}^{p\times r}$ with arbitrary $p\geq 1$ since any critical point $\boldsymbol{W}$ satisfies $\boldsymbol{U}^{\mathrm{T}}\boldsymbol{U}=\boldsymbol{V}^{\mathrm{T}}\boldsymbol{V}$ .

Bounding term $\Pi_{3}$ :

[TABLE]

Bounding term $\Pi_{4}$ :

[TABLE]

where $(i)$ holds because $\widehat{\boldsymbol{W}}^{\mathrm{T}}\boldsymbol{W}={\bf 0}$ , and $(ii)$ follows because $\widehat{\boldsymbol{W}}^{\star\mathrm{T}}\boldsymbol{W}^{\star}={\bf 0}$ and $\langle\widehat{\boldsymbol{W}}^{\star}\widehat{\boldsymbol{W}}^{\star\mathrm{T}},\boldsymbol{W}\boldsymbol{W}^{\mathrm{T}}\rangle\geq 0$ since it is the inner product between two PSD matrices. ∎

Appendix F Proof of (24)

Proof of (24).

To show (24), expanding the left hand side of (24), it is equivalent to show

[TABLE]

Expanding both sides of the above equation and utilizing the fact $\boldsymbol{U}^{\mathrm{T}}\boldsymbol{U}=\boldsymbol{V}^{\mathrm{T}}\boldsymbol{V}$ and $\boldsymbol{U}^{\star\mathrm{T}}\boldsymbol{U}^{\star}=\boldsymbol{V}^{\star\mathrm{T}}\boldsymbol{V}^{\star}$ , the remaining step is to show

[TABLE]

Thus, we obtain (24) by noting that the above equation is equivalent to

[TABLE]

∎

Bibliography51

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Aaronson, “The learnability of quantum states,” in Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences , vol. 463, pp. 3089–3114, The Royal Society, 2007.
2[2] S. T. Flammia, D. Gross, Y.-K. Liu, and J. Eisert, “Quantum tomography via compressed sensing: Error bounds, sample complexity and efficient estimators,” New Journal of Physics , vol. 14, no. 9, p. 095022, 2012.
3[3] N. Srebro, J. Rennie, and T. S. Jaakkola, “Maximum-margin matrix factorization,” in Advances in Neural Information Processing Systems , pp. 1329–1336, 2004.
4[4] D. De Coste, “Collaborative prediction using ensembles of maximum margin matrix factorizations,” in Proceedings of the 23rd International Conference on Machine Learning , pp. 249–256, ACM, 2006.
5[5] P. Biswas and Y. Ye, “Semidefinite programming for ad hoc wireless sensor network localization,” in Proceedings of the 3rd international symposium on Information processing in sensor networks , pp. 46–54, ACM, 2004.
6[6] G. Tang and A. Nehorai, “Lower bounds on the mean-squared error of low-rank matrix reconstruction,” IEEE Transactions on Signal Processing , vol. 59, no. 10, pp. 4559–4571, 2011.
7[7] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization,” SIAM Review , vol. 52, no. 3, pp. 471–501, 2010.
8[8] E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational Mathematics , vol. 9, no. 6, pp. 717–772, 2009.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Global Optimality in Low-rank Matrix Optimization

Abstract

Index Terms:

I Introduction

I-A Summary of Results

Theorem 1**.**

Remark 1*.*

Remark 2*.*

Remark 3*.*

I-B Related Works

II Preliminaries

II-A Notation

II-B Strict Saddle Property

Definition 1** (Critical points).**

Definition 2** (Strict saddles).**

Definition 3** (Strict saddle property [22]).**

Theorem 2**.**

III Problem Formulation and Main Results

III-A Problem Formulation

Proposition 1**.**

Proof of Proposition 1.

III-B Main Results

Theorem 3**.**

Remark 4*.*

Remark 5*.*

Remark 6*.*

Remark 7*.*

III-C Stylized Applications

III-C1 Matrix Sensing

Definition 4**.**

Corollary 1**.**

III-C2 Weighted Low-Rank Matrix Factorization

Corollary 2**.**

III-C3 1-bit Matrix Completion

Lemma 1**.**

Corollary 3**.**

Proof of Corollary 3.

IV Proof of Theorem 3

IV-A Supporting Results

Proposition 2**.**

Lemma 2**.**

Lemma 3**.**

Lemma 4**.**

IV-B The Formal Proof

Proof of Theorem 3.

Remark 8*.*

V Experiments

V-A Matrix Sensing

V-B Matrix Completion

Theorem 4**.**

V-C 1-bit Matrix Completion

VI Conclusion

Appendix A Proof of Lemma 1

Proof of Lemma 1.

Appendix B Proof of Proposition 2

Proof of Proposition 2.

Appendix C Proof of Lemma 2

Proof of Lemma 2.

Appendix D Proof of Lemma 3

Proof of Lemma 3.

Lemma 5**.**

Lemma 6**.**

Appendix E Proof of (22)

Proof of (22).

Appendix F Proof of (24)

Proof of (24).

Theorem 1.

*Remark 1**.*

*Remark 2**.*

*Remark 3**.*

Definition 1 (Critical points).

Definition 2 (Strict saddles).

Definition 3 (Strict saddle property [22]).

Theorem 2.

Proposition 1.

Theorem 3.

*Remark 4**.*

*Remark 5**.*

*Remark 6**.*

*Remark 7**.*

Definition 4.

Corollary 1.

Corollary 2.

Lemma 1.

Corollary 3.

Proposition 2.

Lemma 2.

Lemma 3.

Lemma 4.

*Remark 8**.*

Theorem 4.

Lemma 5.

Lemma 6.