Geometry of Factored Nuclear Norm Regularization
Qiuwei Li, Zhihui Zhu, Gongguo Tang

TL;DR
This paper explores the geometric structure of a nonconvex factorized formulation of nuclear norm regularization, showing that critical points are either global optima or saddle points, enabling scalable algorithms for matrix optimization.
Contribution
It proves that under certain conditions, the nonconvex factored problem's critical points are either optimal or saddle points, facilitating efficient optimization.
Findings
Critical points are either global optima or strict saddle points.
The geometric structure enables local search algorithms to find global solutions.
Conditions on the loss function ensure the favorable geometry.
Abstract
This work investigates the geometry of a nonconvex reformulation of minimizing a general convex loss function regularized by the matrix nuclear norm . Nuclear-norm regularized matrix inverse problems are at the heart of many applications in machine learning, signal processing, and control. The statistical performance of nuclear norm regularization has been studied extensively in literature using convex analysis techniques. Despite its optimal performance, the resulting optimization has high computational complexity when solved using standard or even tailored fast convex solvers. To develop faster and more scalable algorithms, we follow the proposal of Burer-Monteiro to factor the matrix variable into the product of two smaller rectangular matrices and also replace the nuclear norm with . In spite of the nonconvexity of theβ¦
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Geometry of Factored Nuclear Norm Regularization
Qiuwei Li, Zhihui Zhu and Gongguo Tang
Department of Electrical Engineering and Computer Science,
Colorado School of Mines, Golden, CO 80401, USA This work was supported by NSF grants CCF-1464205, CCF-1409261. Email: {qiuli,gtang,zzhu}@mines.edu
(March 5, 2024)
Abstract
This work investigates the geometry of a nonconvex reformulation of minimizing a general convex loss function regularized by the matrix nuclear norm . Nuclear-norm regularized matrix inverse problems are at the heart of many applications in machine learning, signal processing, and control. The statistical performance of nuclear norm regularization has been studied extensively in literature using convex analysis techniques. Despite its optimal performance, the resulting optimization has high computational complexity when solved using standard or even tailored fast convex solvers. To develop faster and more scalable algorithms, we follow the proposal of Burer-Monteiro to factor the matrix variable into the product of two smaller rectangular matrices and also replace the nuclear norm with . In spite of the nonconvexity of the factored formulation, we prove that when the convex loss function is -restricted well-conditioned, each critical point of the factored problem either corresponds to the optimal solution of the original convex optimization or is a strict saddle point where the Hessian matrix has a strictly negative eigenvalue. Such a geometric structure of the factored formulation allows many local search algorithms to converge to the global optimum with random initializations.
1 Introduction
Nuclear-norm regularized inverse problems arise in many applications in machine learning [22], signal processing [4], and control [28]. In this work, we consider a general nuclear-norm regularized optimization:
[TABLE]
Here is a general convex loss function, denotes the matrix nuclear norm of , and is a trade-off parameter. The statistical performance has been studied extensively in literature using convex analysis techniques [12], for example, information-theoretically optimal sampling complexity [13], minimax denoising rate [10], and tight oracle inequalities [11]. In spite of its optimal performance, improving computational efficiency for (1.1) remains a challenge. Even fast first-order methods, such as the projected gradient descent algorithm [5, 14], require an expensive singular value decomposition in each iteration, forming the major computational bottleneck of the algorithms and thus preventing them from scaling to big-data applications [17].
1.1 Our Approach: Burer-Monteiro Parameterization
To overcome the computational challenges, we utilize the Burer-Monteiro parameterization that is recognized as an alternative to convex solvers in [6]. More precisely, when (1.1) admits a solution with rank , the matrix variable is decomposed as the product of two smaller matrices:
[TABLE]
where and with and . Moreover, using the fact that [31, page 21]
[TABLE]
where , we replace the matrix nuclear norm with to obtain a factored formulation of the original convex optimization (1.1):
[TABLE]
For simplicity, in (1.4) is also represented by with . The factored formulation (1.4), coupled with local search algorithms such as gradient descent and its variants, reduces computational complexity by avoiding expensive SVDs and decreasing the number of optimization variables from to , a significant reduction when . Such an increase in computational efficiency makes it possible to handle problems with millions of variables.
Although the Burer-Monteiro factored reformulation (1.4) allows faster implementations, the theoretical performance guarantees for the nuclear-norm regularization are not compete. In this work, we adopt the geometric approach [18, 20, 34, 36, 39, 40, 23, 19] to analyze the landscape of (1.4) and show that there is no spurious local minima or degenerate saddle points for (1.4) when the objective function is well-conditioned. An important implication is that local search algorithms, such as gradient descent and its variants, are able to converge to the global optima even with random initialization [18, 20]. Moreover, once we establish the equivalence between the convex and the factored formulations, it is unnecessary to rederive the statistical performances of the factored optimization (1.4), since they inherit from that of the convex optimization (1.1).
1.2 Main Result
Before presenting our main result, we provide several necessary definitions. We call a vector a critical point of some differentiable function if the gradient When is twice continuously differentiable, a critical point is called a strict saddle or riddable saddle [35] if the Hessian has a strictly negative eigenvalue, i.e., . A twice continuously differentiable function satisfies the strict saddle property if every critical point is either a local minimum or is a strict saddle [20].
Heuristically, the strict saddle property describes a geometric structure of the landscape: every non-local-minimum critical point is a strict saddle, where the Hessian has a strictly negative eigenvalue. This property ensures that many local search algorithms, such as noisy gradient descent [18] and the trust region method [36], can escape from all the saddles along the directions associated with the Hessianβs negative eigenvalues, and hence converge to a local minimum:
Theorem 1** (Escaping from saddles [18, 35, 24]).**
(informal) The strict saddle property allows many local search algorithms, such as noisy gradient descent and the trust region method, to converge to a local minimum.
Our main result builds on the assumption that the convex loss function is -restricted well-conditioned:
[TABLE]
We note that the assumption (1.5) is standard in matrix inverse problems [2]. The main contribution of this work is establishing that under the restricted well-conditionedness of the convex loss function, the factored optimization (1.4) has no spurious local minima and satisfies the strict saddle property.
Theorem 2** (Strict saddle property).**
Suppose the function is twice continuously differentiable and satisfies the -restricted well-conditioned property (1.5). Assume is an optimal solution of the optimization (1.1) with . Set in (1.4). Let be any critical point of satisfying . Then either corresponds to a factorization of , i.e.,
[TABLE]
or is a strict saddle of the factored problem (1.4). More precisely, denote . Then
[TABLE]
Here denotes βs smallest nonzero singular value.
Remark 1*.*
In addition to the strict saddle property, Theorem 2 also shows that there is no spurious local minimum. This allows a number of iterative optimization methods [18, 36, 24] to find with random initialization.
Remark 2*.*
Theorem 2 establishes the strict saddle property for both over-parameterization () and exact parameterization (). Thus, as long as we know an upper bound on , many simple iterative algorithms can help to find the global optimizer .
Remark 3*.*
The main result only requires to be restricted well-conditioned. Hence, in addition to those with quadratic objective functions [2, 25, 3], a range of other low-rank matrix recovery problems are covered by our main theorem, including -bit matrix completion [16], robust principal component analysis (PCA) [26], Poisson PCA [32], and other more general low-rank matrix problems [38].
1.3 Related Work
This research is inspired by several previous works where nonconvex reformulations of various convex optimizations are proposed and analyzed [3, 2, 36, 34, 25]. Some of the proposed algorithms require initializing the first iterate into the attraction basin of the global optima [9, 37, 2], while others have guaranteed convergence with random initializations [18, 36, 34]. The latter is achieved by studying the (nonconvex) landscape of the optimizationsβ objective function. Our work falls into the second category.
The most related work is non-square matrix sensing from linear observations, which minimizes the factored quadratic objective function [30]. The ambiguity in the factored parameterization
[TABLE]
tends to make the factored quadratic objective function badly-conditioned, especially when the matrix or its inverse is close to being singular [30, 27]. To overcome this problem, the regularizer
[TABLE]
is proposed to ensure that and have almost equal energy [37, 30, 27]. Our result shows that it is not necessary to introduce the extra regularization (1.7). Indeed, the representation (1.3) of the nuclear norm implicitly requires and to have equal energy. As a reformulation of the convex program (1.1), the nonconvex optimization (1.4) inherits all its statistical performance. Furthermore, by relating the first order optimality condition of the factored problem with the global optimality of the original nuclear-norm regularized convex program, our work provides a more transparent theoretical analysis that shows how the convex geometry is transformed into a nonconvex one.
In [7], Cabral et al. worked on a similar problem and showed all global optima of (1.4) corresponds to the solution of the convex program (1.1). The work [21] applied the factorization approach to a more broad class of problems. When specialized to matrix inverse problems, their results show that any local minimizer and with zero columns is a global minimum for over-parameterization case, i.e., . However, there are no results discussing the existence of spurious local minima or the degenerate saddles in these previous work. We extend their work and further prove that as long as the loss function is restricted well-conditioned, all local minima are global minima and there are no degenerate saddles with no requirement on the size of the variables.
1.4 Notations
In this section, we collect notations used throughout the paper. Denote . We reserve the symbols and for the identity matrix and zero matrix/vector, respectively. represents the set of real matrices. Matrix norms, such as the spectral, nuclear, and Frobenius norms, are denoted by , and , respectively. The smallest nonzero singular value for any is denoted by .
For any row-block matrix , we denote by changing the sign of the second block of . The gradient of a scalar function is an matrix, whose th element is for , . Meanwhile, the gradient can be also viewed as a linear form for any . We can view the Hessian of as a th order tensor of size , whose th entry is for , . Similarly, we can also view the Hessian as a bilinear form defined via for any . Yet another way to represent the Hessian is as an matrix for , where is the th element of the vectorization of . We will use these representations interchangeably whenever the specific form can be inferred from context.
2 Problem Formulation
In this work, we consider the nuclear norm regularization (1.1):
[TABLE]
which is equivalent to [31, page 8]:
[TABLE]
We can enforce the PSD constraint implicitly using the fact that any PSD variable can be reparameterized as . More precisely, let
[TABLE]
implying
[TABLE]
Plugging (2.3) into (2.1) gives the Burer-Monteiro factored reformulation (1.4):
[TABLE]
where the PSD constraint is dropped by construction. As discussed in Section 1, this new factored formulation (1.4) can potentially increase computational efficiency in two ways: (i) avoiding expensive SVDs because of replacing the nuclear norm with the squared term ; (ii) a substantial reduction in the number of the optimization variables from to .
2.1 The Necessity of Restricted Well-Conditionedness
The factored parameterization
[TABLE]
transforms the original convex optimization into a nonconvex one and introduces additional critical points (i.e., those with that are not global optima of the factored optimization (1.4)). This causes the convex program (1.1) and its low-rank reformulation (1.4) not equivalent. In particular, when the loss function is not well-conditioned, spurious local minima might emerge from these introduced critical points. For example, Srebro et al. [33] showed for weighted low-rank approximation, if the objective function is not well-conditioned (e.g., the weight matrix has a few dominant entries), a non-global local minimum emerges for the factored problem. Similarly, as discussed in [3, Examples 1,2], when the objective function of a matrix sensing problem does not satisfy the Restricted Isometry Property (RIP), there would be spurious local minima in the factored problem. Therefore, to enable the two problems (1.1) and (1.4) equivalent, it is reasonable to introduce the restricted well-conditionedness assumption (1.5) for the general loss function in (1.1).
2.2 Consequences of Restricted Well-Conditionedness
We observe that the -restricted well-conditionedness assumption (1.5) reduces to the RIP when the objective function is quadratic [3]. To see this, we note the -restricted well-conditionedness assumption (1.5) indicates a restricted orthogonality property:
Proposition 1**.**
Let satisfies the -restricted well-conditionedness assumption (1.5) with positive and . Then
[TABLE]
for any matrices of rank at most .
Proof of PropositionΒ 1.
The proof follows similar toΒ [8]. If either or is zero, it holds since both sides are [math]. For nonzero and , we can assume without loss of generality. Then the assumptionΒ (1.5) implies
[TABLE]
Thus we have
[TABLE]
We complete the proof by dividing both sides by :
[TABLE]
β
Sharing a similar spirit with the standard RIP, Proposition 1 also implies that the operator preserves geometric structures, for low-rank matrices. We intend to show that, when the function in (1.1) satisfy the restricted well-conditioned assumption (1.5), the two programs (1.1) and (1.4) are equivalent: we can always find the global optimizer by applying the simple iterative optimization methods to the factored problem.
3 Understand the Transformed Landscape
It is interesting to understand how the parameteriztion transforms the geometric structures of the convex objective function by categorizing the critical points of the nonconvex factored function . In particular, we will illustrate how the globally optimal solution of the convex program is transformed in the domain of . Furthermore, we will explore the properties of the additional critical points introduced by the parameterization and find a way of utilizing these properties to prove the strict saddle property. For those purposes, the optimality conditions for the two programs (1.1) and (1.4) will be compared.
Before continuing this geometry-based argument, it is important to have a good understanding of the domain of the factored problem and set up a metric for this domain.
3.1 Metric in the Domain of the Factored Problem
Since the parameterization and the factored regularization in the factored objective function are both rotational invariant:
[TABLE]
we obtain that is a rotational-invariant function and the domain of is stratified into equivalent classes and can be treated as a quotient manifold [1].
For matrices lying in the same equivalent classes, they differ each other by only an orthogonal rotation. Hence, to measure the distance of and lying in the quotient manifold, we can define the distance on their corresponding equivalent classes:
[TABLE]
where the second line follows from the rotation invariance of and the third line follows from the property of the closed multiplicative operation for the orthogonal group [15, Definition 7.2].
3.2 Optimality Condition for the Convex Program
As an unconstrained convex optimization, all critical points of (1.1) are global optima and are characterized by the necessary and sufficient KKT condition [5]:
[TABLE]
where denotes the subdifferential (the set of subgradient) of the nuclear norm evaluated at . The subdifferential of the matrix nuclear norm is defined by
[TABLE]
We have a more explicit characterization of the subdifferential of the nuclear norm using the singular value decomposition. More specifically, suppose is the (compact) singular value decomposition of with and being an diagonal matrix. Then the subdifferential of the matrix nuclear norm at is given by [31, Equation (2.9)]
[TABLE]
By combining this representation of the subdifferential and the KKT condition (3.1), we present an equivalent expression for the optimality condition:
[TABLE]
where denote respectively the right- and left- singular matrices in the compact SVD of with and being an diagonal matrix where we assume Since we set in the factored problem (1.4), in order to agree with the dimensions in (1.4), we define the optimal factors , as
[TABLE]
where . Consequently, with the optimal factors defined in (3.3), we can rewrite the optimal condition (3.2) as
[TABLE]
Stacking the two variables into , we obtain a more concise and equivalent form of (3.4):
[TABLE]
with
[TABLE]
An immediate result of (3.5) is
[TABLE]
by Schur complement theorem [5, A.5.5] in view of in (3.5).
3.3 Properties of the Critical Points of the Factored Program
First of all, the gradient of is given by
[TABLE]
By invoking the notation in (3.6), we obtain a more concise expression
[TABLE]
Let the set of the critical points of be denoted as
[TABLE]
It is easy to see that any critical point of the factored problem (1.4), i.e., , also satisfies the left part of the optimality condition (3.5) of the convex program. If the critical point additionally satisfies , then the pair corresponds to the global optimizer . It remains to study the additional critical points that violates , which are introduced by the parameterization .
To show that the nuclear norm reformulation guarantees and have equal energy at every critical point, we define the notation of balanced pairs:
Definition 1** (Balanced pairs).**
Let and . Then is a balanced pair if the Gram matrices of and are the same:
[TABLE]
All the balanced pairs form the balanced set, denoted by where indicates the dimensions of .
By stacking the variables into and invoking , we get
[TABLE]
Then the definition of the set simplifies as
[TABLE]
Proposition 2 claims that the critical points of are balanced pairs, whose proof is given in Appendix D.
Proposition 2**.**
Let be the set of critical points of in (1.4) and be the balanced set (3.11). Then we have
Next, we derive two properties of those points lying in that involve the relationship of the energy of the on-diagonal blocks and the off-diagonal blocks of certain block matrix, say with . More precisely, we define
[TABLE]
When and are acting on the product of two block matrices for and with and , we observe that
[TABLE]
Now we are ready to provide these two properties, summarized in Lemma 1, 2, whose proofs are in Appendix A, Appendix B, respectively.
Lemma 1**.**
Let . Then for every with consistent sizes, we have
[TABLE]
Lemma 2**.**
Let with and . Then
[TABLE]
3.4 Characterizing the Critical Points by the Hessian
First of all, we observe that with given in (3.3), is also the global optimum of the factored program (1.4) given by Proposition 3 with its proof listed in Appendix E:
Proposition 3**.**
For any given in (3.3), we have
[TABLE]
implying is also a global optimum of the factored program (1.4).
However, due to the nonconvexity of the factored problem, only characterizing the global optimizers is not sufficient. One should also eliminate possibility of the existence of spurious local minima or degenerate saddles. For this purpose, we analyze the Hessian quadratic form of :
[TABLE]
We intend to show the following: for any critical point , if , we can find a direction , along which the Hessian has a strictly negative curvature for some . We choose as the direction from the to its closest globally optimal factor of the same size as :
[TABLE]
with Such a choice is inspired by the previous work [25], particularly [25, Example 1]).
4 Proof of Theorem 2
We choose as the direction from to its closest globally optimal factor: with , and wish to show that, as long as , the Hessian has a strictly negative curvature along .
4.1 Supporting Lemmas
We start with several lemmas that will be used in the proof. The first two lemmas bound the distance by their square-root distance .
Lemma 3**.**
[25, Lemma 3]** Let be of the same size. Then
[TABLE]
Lemma 4**.**
[37, Lemma 5.4]** Let be of the same size and . Then
[TABLE]
Lemma 5 divides into two terms: and , where is the projector onto . The key is letting the first part have a small coefficient. Then Lemma 6 further controls the second part, with the proof listed in Appendix C.
Lemma 5**.**
[25, Lemma 4]** Let and be of the same size and be PSD. Assume is an orthogonal matrix whose columns span . Then
[TABLE]
Lemma 6**.**
Suppose satisfies (1.5). Let be any critical point of (1.4), correspond to the optimal solution of (1.1) and be projection to . Then
[TABLE]
4.2 A Formal Proof
Let Denote and invoke to simplify notations. Then
[TABLE]
where (a) follows from and (3.8) and (b) holds since by in (3.5) and in (3.7). (c) follows from the integral form of the mean value theorem for vector-valued functions (see [29, Eq. (A.57)]). (d) follows from the restricted well-conditionedness assumption (1.5) since , and
Then
[TABLE]
where (a) follows from Lemma 2 and the fact . (b) follows from Lemma 1. For (c) to hold, we first use Lemma 5 to bound since by [25, Lemma 1]. Then use Lemma 6 to further bound . (d) holds when . For the last inequality, we apply Lemma 3 when and Lemma 4 when and observe when . Finally, the proof follows from the fact that by (3.3), we have
[TABLE]
5 Conclusion
In this work, we considered the minimization of a general convex loss function regularized by the matrix nuclear norm . To improve computational efficiency, we applied the Burer-Monteiro factored formulation and showed that, as long as the convex function is (restricted) well-conditioned, the factored problem has the following benign landscape: each critical point either produces a global optimum of the original convex program, or is a strict saddle where the Hessian matrix has a strictly negative eigenvalue. Such geometric structure then allows many iterative optimization methods to escape from the saddles and thus converge to a global minimizer with random initializations.
Appendix A Proof of Lemma 1
First, by (3.13), we can get
[TABLE]
since by (3.11).
Appendix B Proof of Lemma 2
Replacing and by (3.13) and expanding by the innerproducts, we have
[TABLE]
where the third equality follows from (3.11) and the last inequality holds by recognizing these PSD matrices: , , and .
Appendix C Proof of Lemma 6
To simplify notations, we denote and invoke . Then
[TABLE]
where (a) follows from the integral form of the mean value theorem for vector-valued functions (see [29, Eq. (A.57)]). Then, in view of Proposition 1 and (3.13), we arrive at
[TABLE]
When , we further can show
[TABLE]
Finally, we complete the proof by combining (C.1) with (C.2),(C.3),(C.4) to get
[TABLE]
Show (C.2). First note that when . Then
[TABLE]
where the second equality holds since and by (3.5). The inequality is due to (3.7).
Show (C.3). First recognize that
[TABLE]
Then (C.3) follows from
[TABLE]
where the first equality follows from (3.11) and the inequality holds by recognizing those PSD matrices.
Show (C.4). By plugging , we get which is obviously no larger than
Appendix D Proof of Proposition 2
Note for . Hence
[TABLE]
implying by (3.8), which finishes the proof since by (3.10).
Appendix E Proof of Proposition 3
First of all, by (3.3), we have
[TABLE]
Thus,
[TABLE]
where the first inequality follows from the optimality of for (1.1). The second equality holds by choosing Finally, the last inequality holds since by (1.3).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds . Princeton University Press, 2009.
- 2[2] Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi. Dropping convexity for faster semi-definite optimization. In 29th Annual Conference on Learning Theory , pages 530β582, 2016.
- 3[3] Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Global optimality of local search for low rank matrix recovery. ar Xiv preprint ar Xiv:1605.07221 , 2016.
- 4[4] Thierry Bouwmans, Necdet Serhat Aybat, and El-hadi Zahzah. Handbook of Robust Low-Rank and Sparse Matrix Decomposition: Applications in Image and Video Processing . CRC Press, 2016.
- 5[5] Stephen Boyd and Lieven Vandenberghe. Convex optimization . Cambridge university press, 2004.
- 6[6] Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming , 95(2):329β357, 2003.
- 7[7] Ricardo Cabral, Fernando De la Torre, JoΓ£o P Costeira, and Alexandre Bernardino. Unifying nuclear norm and bilinear factorization approaches for low-rank matrix decomposition. In Proceedings of the IEEE International Conference on Computer Vision , pages 2488β2495, 2013.
- 8[8] Emmanuel J Candes. The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique , 346(9-10):589β592, 2008.
