Global Optimality in Low-rank Matrix Optimization
Zhihui Zhu, Qiuwei Li, Gongguo Tang, Michael B. Wakin

TL;DR
This paper analyzes the geometric landscape of low-rank matrix optimization problems, showing that under certain conditions, the factored formulation has no spurious local minima and satisfies the strict saddle property, enabling global convergence of algorithms.
Contribution
It provides a geometric analysis of the factored low-rank matrix optimization problem for well-conditioned functions, establishing conditions for no spurious minima and strict saddle property.
Findings
The reformulated problem has no spurious local minima.
The objective function satisfies the strict saddle property.
Gradient-based algorithms can provably find global solutions.
Abstract
This paper considers the minimization of a general objective function over the set of rectangular matrices that have rank at most . To reduce the computational burden, we factorize the variable into a product of two smaller matrices and optimize over these two matrices instead of . Despite the resulting nonconvexity, recent studies in matrix completion and sensing have shown that the factored problem has no spurious local minima and obeys the so-called strict saddle property (the function has a directional negative curvature at all critical points but local minima). We analyze the global geometry for a general and yet well-conditioned objective function whose restricted strong convexity and restricted strong smoothness constants are comparable. In particular, we show that the reformulated objective function has no spurious local minima and obeys the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Global Optimality in Low-rank Matrix Optimization
Zhihui Zhu, Qiuwei Li, Gongguo Tang, and Michael B. Wakin This work was supported by NSF grant CCF-1409261, NSF grant CCF-1464205, and NSF CAREER grant CCF-1149225.Z. Zhu, Q. Li, G. Tang, and M. B. Wakin are with the Department of Electrical Engineering, Colorado School of Mines, Golden, CO 80401 USA. Email: {zzhu, qiuli, gtang, mwakin}@mines.edu.
Abstract
This paper considers the minimization of a general objective function over the set of rectangular matrices that have rank at most . To reduce the computational burden, we factorize the variable into a product of two smaller matrices and optimize over these two matrices instead of . Despite the resulting nonconvexity, recent studies in matrix completion and sensing have shown that the factored problem has no spurious local minima and obeys the so-called strict saddle property (the function has a directional negative curvature at all critical points but local minima). We analyze the global geometry for a general and yet well-conditioned objective function whose restricted strong convexity and restricted strong smoothness constants are comparable. In particular, we show that the reformulated objective function has no spurious local minima and obeys the strict saddle property. These geometric properties imply that a number of iterative optimization algorithms (such as gradient descent) can provably solve the factored problem with global convergence.
Index Terms:
Low-rank matrix optimization, matrix sensing, noncovnex optimization, optimization geometry, strict saddle
I Introduction
Consider the minimization of a general objective function over all low-rank matrices:
[TABLE]
where the objective function is smooth. Low-rank matrix optimizations of the form (1) appear in a wide variety of applications, including quantum tomography [1, 2], collaborative filtering [3, 4], sensor localization [5], low-rank matrix recovery from compressive measurements [6, 7], and matrix completion [8, 9]. Due to the rank constraint, however, low-rank matrix optimizations of the form (1) are highly nonconvex and computationally NP-hard in general [10] even if itself is convex. In order to deal with the rank constraint and to find a low-rank solution, the nuclear norm is widely used in matrix inverse problems [11, 7] arising in machine learning [12], signal processing [13], and control [14]. Although nuclear norm minimization enjoys strong statistical guarantees [8], its computational complexity is very high (as most algorithms require performing an expensive singular value decomposition (SVD) in each iteration), prohibiting it from scaling to practical problems.
To relieve the computational bottleneck and provide an alternative way of dealing with the rank constraint, recent studies propose to factorize the variable into the Burer-Monteiro type decomposition [15, 16] with , and optimize over the and matrices and . With this parameterization of , we can recast (1) into the following program:
[TABLE]
The bilinear nature of the parameterization renders the objective function of (2) nonconvex even when is a convex function. Hence, the objective function in (2) can potentially have spurious local minima (i.e., local minimizers that are not global minimizers) or “bad” saddle points that prevent a number of iterative algorithms from converging to the global solution. By analyzing the landscape of nonconvex functions, several recent works have shown that the factored objective function in certain matrix inverse problems has no spurious local minima [17, 18, 19].
We generalize this line of work by focusing on a general objective function in the optimization (1), not necessarily a quadratic loss function coming from a matrix inverse problem. By focusing on a general objective function, we attempt to provide a unifying framework for low-rank matrix optimizations with the factorization approach. We provide a geometric analysis for the factored program (2) and show that, under certain conditions on , all critical points of the objective function are well-behaved. Our characterization of the geometry of the objective function ensures that a number of iterative optimization algorithms converge to a global minimum.
I-A Summary of Results
The purpose of this paper is to analyze the geometry of the factored problem in (2). In particular, we attempt to understand the behavior of all of the critical points of the objective function in the reformulated problem (2).
Before presenting our main results, we lay out the necessary assumptions on the objective function . As is known, without any assumptions on the problem, even minimizing traditional quadratic objective functions is challenging. For this purpose, we focus on the model where is -restricted strongly convex and smooth, i.e., for any matrices with and , the Hessian of satisfies
[TABLE]
for some positive and . A similar assumption is also utilized in [20, Conditions 5.3 and 5.4]. With this assumption on , we summarize our main results in the following informal theorem.
Theorem 1**.**
(informal)* Suppose the function satisfies the -restricted strong convexity and smoothness condition (3) and has a critical point with . Then the factored objective function (with an additional regularizer, see Theorem 3) in (2) has no spurious local minima and obeys the strict saddle property (see Definition 3 in Section II).*
Remark 1*.*
As guaranteed by Proposition 1 (in Section III), the -restricted strong convexity and smoothness property (3) ensures that is the unique global minimum of (1). Theorem 1 then implies that we can recover the rank- global minimizer of (1) by many iterative algorithms (such as the trust region method [21] and stochastic gradient descent [22]) even from a random initialization. This is because 1) as guaranteed by Theorem 2, the strict saddle property ensures local search algorithms converge to a local minimum, and 2) there are no spurious local minima.
Remark 2*.*
Since our main result only requires the -restricted strong convexity and smoothness property (3), aside from low-rank matrix recovery [11], it can also be applied to many other low-rank matrix optimization problems [23] which do not necessarily involve quadratic loss functions. Typical examples include robust PCA [24, 25], 1-bit matrix completion [26, 27] and Poisson principal component analysis (PCA) [28].
Remark 3*.*
Similar results on positive semi-definite (PSD) matrix optimization problems (but without the rank constraint) with generic objective functions were obtained in [29]. We note that one cannot directly apply the results in [29] to the optimization (1) when the matrices under consideration are nonsymmetric or rectangular, even if we ignore the rank constraint. One could attempt to convert minimizing over general matrices into minimizing over the cone of PSD matrices of size , where and form the upper right and lower left blocks of . The problem with this transformation, however, is that will no longer satisfy the same properties as , in particular the restricted strong convexity and smoothness condition (3) which is a key assumption utilized in [29]. For this reason, one cannot apply the results for the PSD optimization in [29] directly to our problem. In terms of the proof techniques, although the generalization from the PSD case might not seem technically challenging at first sight, quite a few technical difficulties had to be overcome to develop the theory for the general case in this paper. In fact, the non-triviality of extending to the nonsymmetric case is also highlighted in [30, 19].
I-B Related Works
Compared with the original program (1), the factored form (2) typically involves many fewer variables (or variables with much smaller size) and can be efficiently solved by simple but powerful methods (such as gradient descent [22, 31], the trust region method [32], and alternating methods [33]) for large-scale settings, though it is nonconvex. In recent years, tremendous effort has been devoted to analyzing nonconvex optimizations by exploiting the geometry of the corresponding objective functions. These works can be separated into two types based on whether the geometry is analysed locally or globally. One type of work analyzes the behavior of the objective function in a small neighborhood containing the global optimum and requires a good initialization that is close enough to a global minimum. Problems such as phase retrieval [34], matrix sensing [30], and semi-definite optimization [35] have been studied.
Another type of work attempts to analyze the landscape of the objective function and show that it obeys the strict saddle property. If this particular property holds, then simple algorithms such as gradient descent and the trust region method are guaranteed to converge to a local minimum from a random initialization [31, 22, 36] rather than requiring a good guess. We approach low-rank matrix optimization with general objective functions (1) via a similar geometric characterization. Similar geometric results are known for a number of problems including complete dictionary learning [36], phase retrieval [21], orthogonal tensor decomposition [22], and matrix inverse problems [17, 18, 37]. Empirical evidence also supports using the factorization approach for estimating a low-rank PSD matrix from a set of rank-one measurements corrupted by arbitrary outliers [38] and for recovering a dynamically evolving low-rank matrix from incomplete observations [39].
Our work is most closely related to certain recent works in low-rank matrix optimization. Bhojanapalli et al. [17] showed that the low-rank, PSD matrix sensing problem has no spurious local minima and obeys the strict saddle property. Similar results were exploited for PSD matrix completion [18], PSD matrix factorization [40] and low-rank, PSD matrix optimization problems with generic objective functions [29]. Our work extends this line of analysis to general low-rank matrix (not necessary PSD or even square) optimization problems. Another closely related work considers the low-rank, non-square matrix sensing problem and matrix completion with the factorization approach [19, 41, 42]. We note that our general objective function framework includes the low-rank matrix sensing problem as a special case (see Section III-C). Furthermore, our result covers both over-parameterization where and exact parameterization where . Wang et al. [20] also considered the factored low-rank matrix minimization problem with a general objective function which satisfies the restricted strong convexity and smoothness condition. Their algorithms require good initializations for global convergence since they characterized only the local landscapes around the global optima. By categorizing the behavior of all the critical points, our work differs from [20] in that we instead characterize the global landscape of the factored objective function.
This paper continues in Section II with formal definitions for strict saddles and the strict saddle property. We present the main results and their implications in matrix sensing, weighted low-rank approximation, and 1-bit matrix completion in Section III. The proof of our main results is given in Section IV. We conclude the paper in Section VI.
II Preliminaries
II-A Notation
To begin, we first briefly introduce some notation used throughout the paper. The symbols and respectively represent the identity matrix and zero matrix with appropriate sizes. The set of orthonormal matrices is denoted by . If a function has two arguments, and , we occasionally use the notation when we put these two arguments into a new one as . For a scalar function with a matrix variable , its gradient is an matrix whose -th entry is for all . Here for any and is the -th entry of the matrix . The Hessian of can be viewed as an matrix for all , where is the -th entry of the vectorization of . An alternative way to represent the Hessian is by a bilinear form defined via for any . The bilinear form for the Hessian is widely utilized through the paper.
II-B Strict Saddle Property
Suppose is a twice continuously differentiable objective function. We begin with the notion of strict saddles and the strict saddle property.
Definition 1** (Critical points).**
We say a critical point if the gradient at vanishes, i.e., .
Definition 2** (Strict saddles).**
A critical point is a strict saddle if the Hessian matrix evaluated at this point has a strictly negative eigenvalue, i.e., .
Definition 3** (Strict saddle property [22]).**
A twice differentiable function satisfies the strict saddle property if each critical point either corresponds to a local minimum or is a strict saddle.
Intuitively, the strict saddle property requires a function to have a directional negative curvature at all critical points but local minima. This property allows a number of iterative algorithms such as noisy gradient descent [22] and the trust region method [43] to further decrease the function value at all the strict saddles and thus converge to a local minimum.
Theorem 2**.**
[32, 22, 31]** (informal) For a twice continuously differentiable objective function satisfying the strict saddle property, a number of iterative optimization algorithms (such as gradient descent and the the trust region method) can find a local minimum.
III Problem Formulation and Main Results
III-A Problem Formulation
This paper considers the problem (1) of minimizing a general function (over the set of low-rank matrices) which is assumed to have a low-rank critical point with such that . Because of the restricted strong convexity and smoothness condition (3), the following result establishes that if has a critical point with , then it is the unique global minimum of (1).
Proposition 1**.**
Suppose satisfies the -restricted strong convexity and smoothness condition (3) with positive and . Assume is a critical point of with . Then is the global minimum of (1), i.e.,
[TABLE]
and the equality holds only at .
Proof of Proposition 1.
First note that if is a critical point of , then
[TABLE]
Now for any with , the second order Taylor expansion gives
[TABLE]
where for some . This Taylor expansion together with and (3) (both and have rank at most ) gives
[TABLE]
∎
With this, in the sequel, we use to denote the global minimum of (1) (i.e., the low-rank critical point of ), unless stated otherwise. We note that the assumption of the existence of a low-rank critical point is very mild and holds in many matrix inverse problems [7, 8], where the unknown matrix to be recovered is a critical point of .
We factorize the variable with and transform (1) into its factored counterpart (2). Throughout the paper, , and are matrices depending on and :
[TABLE]
Although the new variable has much smaller size than when , the objective function in the factored problem (2) may have a much more complicated landscape due to the bilinear form about and . The reformulated objective function could introduce spurious local minima or degenerate saddle points even when is convex. Our goal is to guarantee that this does not happen.
Let denote an SVD of , where and are orthonormal matrices of appropriate sizes, and is a diagonal matrix with non-negative diagonal (but with some zeros on the diagonal if ). We denote
[TABLE]
where forms a balanced factorization of since and have the same singular values. Throughout the paper, we utilize the following two ways to stack and together:
[TABLE]
Before moving on, we note that for any solution to (2), is also a solution to (2) for any such that . In order to address this ambiguity (i.e., to reduce the search space of for (2)), we utilize the trick in [30, 19, 20] by introducing a regularizer
[TABLE]
and solving the following problem
[TABLE]
where controls the weight for the term , which will be discussed soon.
We remark that is still a global minimizer of the factored problem (5) since achieves its global minimum over the low-rank set of matrices at and also achieves its global minimum at . The regularizer is applied to force the difference between the two Gram matrices of and to be as small as possible. The global minimum of is [math], which is achieved when and have the same Gram matrices, i.e., when belongs to
[TABLE]
Informally, we can view (5) as finding a point from that also minimizes . This is formally established in Theorem 3.
III-B Main Results
Our main argument is that, under certain conditions on , the objective function has no spurious local minima and satisfies the strict saddle property. This is equivalent to categorizing all the critical points into two types: 1) the global minima which correspond to the global solution of the original convex problem (1) and 2) strict saddles such that the Hessian matrix evaluated at these points has a strictly negative eigenvalue. We formally establish this in the following theorem, whose proof is given in the next section.
Theorem 3**.**
For any , each critical point of defined in (5) satisfies
[TABLE]
Furthermore, suppose that the function satisfies the -restricted strong convexity and smoothness condition (3) with positive constants and satisfying and that the function has a critical point with . Set for the factored problem (5). Then has no spurious local minima, i.e., any local minimum of is a global minimum corresponding to the global solution of the original problem (1): In addition, obeys the strict saddle property that any critical point not being a local minimum is a strict saddle with
[TABLE]
where is the rank of , represents the smallest eigenvalue, and denotes the -th largest singular value.
Remark 4*.*
Equation (7) shows that any critical point belongs to for the objective function in the factored problem (5) with any positive . This demonstrates the reason for adding the regularizer . Thus, any iterative optimization algorithm converging to some critical point of results in a solution within . Furthermore, the strict saddle property along with the lack of spurious local minima ensures that a number of iterative optimization algorithms find the global minimum.
Remark 5*.*
For any critical point that is not a local minimum, the right hand side of (8) is strictly negative, implying that is a strict saddle. We also note that Theorem 3 not only covers exact parameterization where , but also includes the over-parameterization case where .
Remark 6*.*
The constants appearing in Theorem 3 are not optimized. We use simply to include which is utilized for the matrix sensing problem in [30]. If the ratio between the restricted strong convexity and smoothness constants , then we can show that has no spurious local minima and obeys the strict saddle property for any (where is utilized for the matrix sensing problem in [19]). In all cases, a smaller yields a more negative constant in (8); see Section IV for more discussion on this. This implies that when the restricted strong convexity constant is not provided a priori, one can always choose a small to ensure the strict saddle property holds, and hence guarantee the global convergence of many iterative optimization algorithms.
The constant for the dynamic range in Theorem 3 is also not optimized and it is possible to slightly relax this constraint with more sophisticated analysis. However, the following example involving weighted symmetric matrix factorization implies that the room for improving this constant is rather limited. Let
[TABLE]
for some ,
[TABLE]
Now consider the following weighted low-rank matrix factorization:
[TABLE]
whose gradient and Hessian are given by:
[TABLE]
and
[TABLE]
Then,
[TABLE]
is a critical point with
[TABLE]
which has eigenvalues
[TABLE]
and . We conclude that this is a strict saddle point when and a spurious local minimum when . This weighted symmetric matrix factorization problem (9) satisfies the restricted strong convexity and smoothness condition (3) with constants and (where and represent the smallest and largest entries in ; see Section III-C). Thus, we have a counter example which demonstrates the existence of spurious local minima when .
Remark 7*.*
We finally remark that although Theorem 3 requires the additional regularizer (4), empirical evidence (see experiments in Section V) shows we can get rid of this regularizer for many iterative algorithms with random initialization.
We prove Theorem 3 in Section IV. Before proceeding, we present two stylized applications of Theorem 3 in matrix sensing and weighted low-rank approximation.
III-C Stylized Applications
III-C1 Matrix Sensing
We first consider the implication of Theorem 3 in the matrix sensing problem where
[TABLE]
Here is a known measurement operator satisfying the following restricted isometry property.
Definition 4**.**
(Restricted Isometry Property (RIP) [7]) The map satisfies the -RIP with constant if
[TABLE]
holds for any matrix with .
Note that, in this case, the gradient of at is
[TABLE]
which implies that is a critical point of . The Hessian quadrature form for any matrices and is given by
[TABLE]
If satisfies the -restricted isometry property with constant , then satisfies the -restricted strong convexity and smoothness condition (3) with constants and since
[TABLE]
for any rank- matrix . Now, applying Theorem 3, we can characterize the geometry for the following matrix sensing problem with the factorization approach:
[TABLE]
where is the added regularizer defined in (4).
Corollary 1**.**
Suppose satisfies the -RIP with constant , and set . Then the objective function in (11) has no spurious local minima and satisfies the strict saddle property.
This result follows directly from Theorem 3 by noting that if . We remark that Park et al. [19, Theorem 4.3] provided a similar geometric result for (11). Compared to their result which requires , our result has a much weaker requirement on the RIP of the measurement operator.
III-C2 Weighted Low-Rank Matrix Factorization
We now consider the implication of Theorem 3 in the weighted matrix factorization problem [44], where
[TABLE]
Here is an weight matrix consisting of positive elements and denotes the point-wise product between two matrices. In this case, the gradient of at is
[TABLE]
which implies that is a critical point of . The Hessian quadrature form for any matrices and is given by
[TABLE]
Thus satisfies the -restricted strong convexity and smoothness condition (3) with constants and since
[TABLE]
where and represent the smallest and largest entries in , respectively. Now we consider the following weighted matrix factorization problem:
[TABLE]
where is the added regularizer defined in (4). For an arbitrary weight matrix , it is proven that the weighted low-rank factorization can be NP-hard [45] and has spurious local minima. When the elements in the weight matrix are concentrated, it is expected that (12) can be efficiently solved by a number of iterative optimization algorithms as it is close to an (unweighted) matrix factorization problem (where is a matrix of ones) which obeys the strict saddle property [40]. The following result characterizes the geometric structure in the objection function of (12) by directly applying Theorem 3.
Corollary 2**.**
Suppose satisfies . Set . Then the objective function in (12) has no spurious local minima and satisfies the strict saddle property.
III-C3 1-bit Matrix Completion
Finally, we consider the problem of completing a low-rank matrix from a subset of 1-bit measurements [26]. Given , a subset of indices , and a differentiable function , we observe
[TABLE]
for all . Typical choices for include the logistic regression model where and the probit regression model where . Here is the cumulative distribution function (CDF) of a mean-zero Gaussian distribution with variance . In [26], the authors attempt to recover from the incomplete nonlinear measurements by minimizing the negative log-likelihood function
[TABLE]
which results in a maximum likelihood (ML) estimate.
We note that is a convex function for both the logistic model and the probit model. The following result also establishes that satisfies the restricted strong convexity and smoothness condition if we observe full 1-bit measurements, i.e., .
Lemma 1**.**
Suppose . Let
[TABLE]
and
[TABLE]
Then satisfies the restricted strong convexity and smoothness condition:
[TABLE]
for any and .
The proof of Lemma 1 is given in Appendix A. Now we consider the logistic regression model where .
Corollary 3**.**
Suppose and . Consider the logistic regression model where . Then satisfies the restricted strong convexity and smoothness condition with
[TABLE]
Proof of Corollary 3.
Applying Lemma 1 with direct calculation gives
[TABLE]
where . Now if we restrict , we have
[TABLE]
∎
Under the assumption that is low-rank, a nuclear norm constraint is utilized in [26] to force a low-rank solution. Corollary 3 implies that we can apply matrix factorization for 1-bit matrix recovery given that the elements of are bounded. For the setting where is only a subset of , [46] considered the 1-bit matrix completion problem with the rank constraint and established a stronger statistical recovery guarantee than that in [26]. Empirical evidence (see [46] and Section V-C) supports that matrix factorization also works for 1-bit matrix completion.
IV Proof of Theorem 3
In this section, we provide a formal proof of Theorem 3. The main argument involves showing that each critical point of either corresponds to the global solution of (1) or is a strict saddle whose Hessian has a strictly negative eigenvalue. Specifically, we show that is a strict saddle by arguing that the Hessian has a strictly negative curvature along , i.e., for some . Here is an orthonormal matrix such that the distance between and rotated through is as small as possible.
IV-A Supporting Results
We first present some useful results. The -restricted strong convexity and smoothness assumption (3) implies the following isometry property, whose proof is given in Appendix B.
Proposition 2**.**
Suppose the function satisfies the -restricted strong convexity and smoothness condition (3) with positive and . Then for any matrices of rank at most , we have
[TABLE]
The following result provides an upper bound on the energy of the difference when projected onto the column space of . Its proof is given in Appendix C.
Lemma 2**.**
Suppose satisfies the -restricted strong convexity and smoothness condition (3). For any critical point of (5), let be the orthogonal projector onto the column space of . Then
[TABLE]
We remark that Lemma 2 is a variant of [19, Lemma 3.2]. While the result there requires the -RIP condition of the objective function, our result depends on the -restricted strong convexity and smoothness condition. Our result is also slightly tighter than [19, Lemma 3.2].
In addition, for any matrices , the following result relates the distance between and to the distance between and .
Lemma 3**.**
For any matrices with ranks and , respectively, let . Then
[TABLE]
If , then we have
[TABLE]
We present one more useful result in the following Lemma.
Lemma 4**.**
[29, Lemma 3]** For any matrices , let be the orthogonal projector onto the range of . Let . Then
[TABLE]
Finally, we provide the gradient and Hessian expressions for . The gradient of is given by
[TABLE]
Standard computations give the the Hessian quadrature form for any where :
[TABLE]
where
[TABLE]
IV-B The Formal Proof
Proof of Theorem 3.
Any critical point of satisfies , i.e.,
[TABLE]
By (17), we obtain
[TABLE]
Multiplying (16) by and plugging in the expression for from the above equation gives
[TABLE]
which further implies
[TABLE]
Note that and are the principal square roots (i.e., PSD square roots) of and , respectively. Utilizing the result that a PSD matrix has a unique principal square root [47], we obtain
[TABLE]
Thus, we can simplify (16) and (17) by
[TABLE]
Now we turn to prove the strict saddle property and that there are no spurious local minima.
First, note that as guaranteed by Proposition 1, is the unique matrix with rank at most . Also the gradient of vanishes at since (1) is an unconstraint optimization problem. Denote the set of critical points of by
[TABLE]
We separate into two subsets:
[TABLE]
satisfying . Since any critical point satisfies (18), achieves its global minimum at . Also achieves its global minimum at . We conclude that is the globally optimal solution of for any . If we show that any is a strict saddle, then we prove that there are no spurious local minima as well as the strict saddle property. Thus, the remaining part is to show that is the set of strict saddles.
To show that is the set of strict saddles, it is sufficient to find a direction along which the Hessian has a strictly negative curvature for each of these points. We construct , the difference from to its nearest global factor , where
[TABLE]
Such satisfies since implying . Then we evaluate the Hessian bilinear form along the direction :
[TABLE]
The following result (which is proved in Appendix E) states that is strictly negative, while the remaining terms are relatively small, though they may be nonnegative:
[TABLE]
Now, substituting (22) into (21) gives
[TABLE]
where utilizes Lemmas 2 and 4, utilizes the following inequality (which is proved in Appendix F)
[TABLE]
and holds because and . Thus, if , is always negative. This implies that is a strict saddle.
To complete the proof, we utilize Lemma 3 to further bound the last term in (23):
[TABLE]
where is the rank of , the fist inequality utilizes (24), and the second inequality follows from Lemma 3. We complete the proof of Theorem 3 by noting that for all since
[TABLE]
is an SVD of , where we recall that is an SVD of . ∎
Remark 8*.*
From (23), we observe that a smaller yields a more negative bound on . This can be explained intuitively as follows. First note that any critical point satisfies (18) provided , no matter how large or small is. The Hessian information about is represented by the terms and . We have
[TABLE]
where the last line holds since for any matrix ,
[TABLE]
Thus the Hessian of evaluated at any critical point is a PSD matrix111This can also be observed since any critical point is a global minimum point of , which directly indicates that . instead of having a negative eigenvalue. In low-rank, PSD matrix optimization problems, the corresponding objective function (without any regularizer such as ) is proved to have the strict saddle property [17, 29]. Therefore, is also expected to have the strict saddle property, and so is when is small, i.e., the Hessian of has little influence on the Hessian of when is small. Our results also indicate that when the restricted strict convexity constant is not provided a priori, we can always choose a small to ensure the strict saddle property of is met, and hence we are guaranteed the global convergence of a number of local search algorithms applied to (5).
V Experiments
In this section, we present a set of experiments on matrix sensing, matrix completion, and 1-bit matrix completion to demonstrate the performance of iterative algorithms for low-rank matrix optimization. Unless noted otherwise, we denote the matrix factorization approach by NVX and use the minFunc package222Software available at
https://www.cs.ubc.ca/$\sim$schmidtm/Software/minFunc.html to perform the local search algorithms for the factored problem.
V-A Matrix Sensing
We first present some experiments to illustrate the performance of local search algorithms for the matrix sensing problem with the factorization approach (11). In these experiments, we set , and vary the rank from to . We generate a rank- random matrix by setting where and are respectively and matrixes of normally distributed random numbers. We then obtain random measurements with
[TABLE]
where the entries of each matrix are independent and identically distributed (i.i.d.) normal random variables with zero mean and variance for . For each pair of and the number of measurements, 10 Monte Carlo trials are carried out and for each trial, and we claim matrix recovery to be successful if the relative reconstruction error satisfies
[TABLE]
where we denote by the reconstructed matrix. Figure 1 displays the phase transition for factorized gradient descent starting from a random initialization, the singular value projection (SVP) method proposed in [48] which requires a SVD in each iteration, and the convex approach which solves
[TABLE]
We see that there are only negligible differences between the different approaches for matrix sensing; these approaches also have very similar performance guarantees when the Gaussian sensing operator satisfies the RIP [11]. We note that with or without the regularizer as defined in (4), local search algorithms have similar performance with random initialization. Hence, throughout all of the experiments, we simply discard the regularizer , but we stress that identical performance is observed if we have this regularizer .
The previous experiments suppose that is known for SVP and the matrix factorization approach. We note, however, that our result in Theorem 3 also covers the over-parameterization case where . To illustrate the possible influence of over-parameterization, we generate a rank- random matrix with and and obtain random measurements (so that the measurement operator satisfies the RIP of rank ), where . We then solve the matrix factorization problem333To avoid tuning the parameters (such as step-size) for different , we use the minFunc package with the default setting, which solves the factored problem by the “LBFGS” algorithm [49]. with and display the corresponding convergence results in Figure 2. As can been seen, the matrix factorization approach converges to the target matrix in both the exact-parameterization and over-parameterization cases. However, we also observe that it converges slower in the over-parameterization case (i.e., ) than in the exact-parameterization case (i.e., ).
V-B Matrix Completion
We compare the performance of the matrix factorization approach with SVP [48], the convex approach, and singular value thresholding444Software available at http://svt.stanford.edu/ (SVT) [50] for matrix completion where we want to recover a low-rank matrix from incomplete measurements , where . Let denote the projection onto the index set . The convex approach (denoted by CVX) attempts to use the nuclear norm as a convex relaxation of the rankness and solves
[TABLE]
To make the recovery of well-posed, we require to be incoherent such that the information in is not concentrated in a small number of entries [8]. A matrix with singular value decomposition is -incoherent if [48, Definition 2.1]
[TABLE]
Though does not satisfy the -RIP (10) for all low-rank matrices , it satisfies the RIP when restricted to low-rank incoherent matrices.
Theorem 4**.**
[48, Theorem 4.2]** Without loss of generality, assume . There exists a constant such that for chosen according to the Bernouli model with density greater than , with probability at least , the RIP holds for all -incoherent matrices of rank at most .
Thus, if local search algorithms (such as gradient descent) start with a random initialization and the iterates remain incoherent, then Theorem 3 guarantees the global convergence of the matrix factorization approach with these algorithms. We note that this hypothesis is also required for SVP [48]. Though we can add a regularizer for incoherence as in [18], empirical evidence supports this hypothesis that the iterates in gradient descent are incoherent.
In the first set of experiments, we set and vary the rank from to . Similar to the setup for matrix sensing in Section V-A, we generate a rank- random matrix and randomly obtain entries, i.e., . Figure 3 displays the phase transition for gradient descent with a random initialization, SVP [48], singular value thresholding (SVT) [50], and the convex approach. As can been seen, the matrix factorization approach has similar phase transition to SVP, and is slightly better than SVT and the convex approach in terms of the number of measurements needed for successful recovery.
In the second set of experiments, we set and (3 times the number of degrees of freedom within a rank- matrix), and vary from to . We compare the time needed for the four approaches in Figure 4; our matrix factorization approach is much faster than the other methods. The time savings for the matrix factorization approach comes from avoiding performing the SVD, which is needed both for SVT and SVP in each iteration. We also observe that convex approach has the highest computational complexity and is not scalable (which is the reason that we only present its time for up to ).
V-C 1-bit Matrix Completion
In the last set of experiments, we compare the performance of the matrix factorization approach with the convex approach555Software available at http://mdav.ece.gatech.edu/software/ in [26] for 1-bit matrix completion. We first note that to make the recovery problem well-posed, a constraint on (the entry-wise maximum of the matrix ) is applied in [26] to require that the matrix is not too “spiky”. Instead of using the constraint on , we add a smooth regularizer and turn to minimize the following objective function
[TABLE]
which is also a convex function over and satisfies a similar restricted strong convexity and smoothness condition to in Lemma 1. In the case where we only observe part of the entries, then in light of Theorem 4, the corresponding objective function is expected to satisfy the strong convexity and smoothness condition for all incoherent matrices. Thus, we factorize into and solve the following optimization problem over the and matrices and :
[TABLE]
To evaluate the performance of this factorization approach on 1-bit matrix completion, we generate matrices and with entries drawn i.i.d. from a uniform distribution on and construct a random matrix with rank . Similar to the setup in [26], the matrix is then scaled so that . We obtain 1-bit observations by adding Gaussian noise of variance and recording the sign of the resulting value (15), where the subset of indices is chosen at random with . We compare the performance of the factorization approach and the convex approach [26] over a range of different values of , , or . Figures 5(a)-(d) show the normalized squared Frobenius norm of the error (where denotes the reconstructed matrix) and average the results over 10 draws of Monte Carlo trials. We observe that matrix factorization approach has slightly better performance than the convex approach for 1-bit matrix completion [26]. Note that this phenomenon (the factorization approach having better performance) is also observed in [46]. We repeat these experiments but obtaining 1-bit observations with the logistic regression model where for (15) and display the results in Figure 6.
VI Conclusion
This paper considers low-rank matrix optimization on general (nonsymmetric and rectangular) matrices with general objective functions. By focusing on general objective functions, we provide a unifying framework for low-rank matrix optimizations with the factorization approach. Although the resulting optimization problem is not convex, we show that the reformulated objection function has a simple landscape: there are no spurious local minima and any critical point not being a local minimum is a strict saddle such that the Hessian evaluated at this point has a strictly negative eigenvalue. These properties guarantee that a number of iterative optimization algorithms (such as gradient descent and the trust region method) will converge to the global optimum from a random initialization.
Appendix A Proof of Lemma 1
Proof of Lemma 1.
We compute the partial derivative of in terms of as
[TABLE]
which implies
[TABLE]
and
[TABLE]
for all . Thus, the bilinear form for the Hessian of can be computed as
[TABLE]
for any . Now since by assumption , we have
[TABLE]
∎
Appendix B Proof of Proposition 2
Proof of Proposition 2.
This proof follows similar steps to the proof of [51, Lemma 2.1]. First note that the bilinear form implies is invariant under all scalings for both and , i.e.,
[TABLE]
for any . If either or is zero, (3) holds since both sides are [math].
Now suppose both or are nonzero. By the scaling invariance property of both sides in (3), we assume without loss of generality. Note that the -restricted strong convexity and smoothness condition (3) implies
[TABLE]
Thus we have
[TABLE]
which further implies
[TABLE]
∎
Appendix C Proof of Lemma 2
Proof of Lemma 2.
First recall the notation , , and
[TABLE]
It follows from (19) and (20) that any critical point satisfies
[TABLE]
which gives
[TABLE]
for any . Here the second line utilizes the fact . We bound by first using integral form of the mean value theorem for :
[TABLE]
Noting that all the three matrices , and have rank at most , it follows from Proposition 2 that
[TABLE]
which when plugged into (28) gives
[TABLE]
Now let , which gives . Here denotes the pseudoinverse of a matrix and is the orthogonal projector onto the range of . Utilizing the fact from (7), we further connect the left hand side of (29) with by
[TABLE]
where the inequality follows because (noting that ) and since it is the inner product between two PSD matrices.
On the other hand, we give an upper bound on the right hand side of (29):
[TABLE]
where the last line follows because (since ), implying . This together with (29) and (30) completes the proof. ∎
Appendix D Proof of Lemma 3
Proof of Lemma 3.
When , the proof follows directly from the following results.
Lemma 5**.**
[29, Lemma 2]** For any matrices with rank and , respectively, let . Then
[TABLE]
Lemma 6**.**
[30, Lemma 5.4]** For any matrices with , let . Then
[TABLE]
If , then we have
[TABLE]
∎
Appendix E Proof of (22)
Proof of (22).
We prove the upper bounds for the four terms as follows.
Bounding term : Utilizing the fact that and , we have
[TABLE]
where follows from (19) and (20), utilizes , and follows by using the -restricted strict convexity property (3):
[TABLE]
where the first line follows from the integral form of the mean value theorem for vector-valued functions, and the second line uses the fact that both and have rank at most , and the -restricted strong convexity of the Hessian .
Bounding term : By the smoothness condition (3), we have
[TABLE]
where the last line holds because for any with arbitrary since any critical point satisfies .
Bounding term :
[TABLE]
Bounding term :
[TABLE]
where holds because , and follows because and since it is the inner product between two PSD matrices. ∎
Appendix F Proof of (24)
Proof of (24).
To show (24), expanding the left hand side of (24), it is equivalent to show
[TABLE]
Expanding both sides of the above equation and utilizing the fact and , the remaining step is to show
[TABLE]
Thus, we obtain (24) by noting that the above equation is equivalent to
[TABLE]
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] S. Aaronson, “The learnability of quantum states,” in Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences , vol. 463, pp. 3089–3114, The Royal Society, 2007.
- 2[2] S. T. Flammia, D. Gross, Y.-K. Liu, and J. Eisert, “Quantum tomography via compressed sensing: Error bounds, sample complexity and efficient estimators,” New Journal of Physics , vol. 14, no. 9, p. 095022, 2012.
- 3[3] N. Srebro, J. Rennie, and T. S. Jaakkola, “Maximum-margin matrix factorization,” in Advances in Neural Information Processing Systems , pp. 1329–1336, 2004.
- 4[4] D. De Coste, “Collaborative prediction using ensembles of maximum margin matrix factorizations,” in Proceedings of the 23rd International Conference on Machine Learning , pp. 249–256, ACM, 2006.
- 5[5] P. Biswas and Y. Ye, “Semidefinite programming for ad hoc wireless sensor network localization,” in Proceedings of the 3rd international symposium on Information processing in sensor networks , pp. 46–54, ACM, 2004.
- 6[6] G. Tang and A. Nehorai, “Lower bounds on the mean-squared error of low-rank matrix reconstruction,” IEEE Transactions on Signal Processing , vol. 59, no. 10, pp. 4559–4571, 2011.
- 7[7] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization,” SIAM Review , vol. 52, no. 3, pp. 471–501, 2010.
- 8[8] E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational Mathematics , vol. 9, no. 6, pp. 717–772, 2009.
