A Non-Convex Relaxation for Fixed-Rank Approximation
Carl Olsson, Marcus Carlsson, Erik Bylow

TL;DR
This paper introduces a non-convex relaxation method for fixed-rank matrix approximation that avoids the bias of nuclear norm approaches and often converges to better solutions, especially under RIP conditions.
Contribution
It proposes a novel non-convex relaxation technique for low-rank matrix approximation that reduces bias and demonstrates favorable convergence properties compared to nuclear norm methods.
Findings
The non-convex relaxation often has a single local minimizer under RIP.
Numerical tests show better solutions than nuclear norm methods.
The approach performs well even when RIP does not hold.
Abstract
This paper considers the problem of finding a low rank matrix from observations of linear combinations of its elements. It is well known that if the problem fulfills a restricted isometry property (RIP), convex relaxations using the nuclear norm typically work well and come with theoretical performance guarantees. On the other hand these formulations suffer from a shrinking bias that can severely degrade the solution in the presence of noise. In this theoretical paper we study an alternative non-convex relaxation that in contrast to the nuclear norm does not penalize the leading singular values and thereby avoids this bias. We show that despite its non-convexity the proposed formulation will in many cases have a single local minimizer if a RIP holds. Our numerical tests show that our approach typically converges to a better solution than nuclear norm based alternatives even in cases…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Advanced Image Processing Techniques · Statistical and numerical algorithms
A Non-Convex Relaxation for Fixed-Rank Approximation
Carl Olsson1,2 Marcus Carlsson1 Erik Bylow1
1Centre for Mathematical Sciences
Lund University
2Department of Electrical Engineering
Chalmers University of Technology
Abstract
This paper considers the problem of finding a low rank matrix from observations of linear combinations of its elements. It is well known that if the problem fulfills a restricted isometry property (RIP), convex relaxations using the nuclear norm typically work well and come with theoretical performance guarantees. On the other hand these formulations suffer from a shrinking bias that can severely degrade the solution in the presence of noise.
In this theoretical paper we study an alternative non-convex relaxation that in contrast to the nuclear norm does not penalize the leading singular values and thereby avoids this bias. We show that despite its non-convexity the proposed formulation will in many cases have a single local minimizer if a RIP holds. Our numerical tests show that our approach typically converges to a better solution than nuclear norm based alternatives even in cases when the RIP does not hold.
1 Introduction
Low rank approximation is an important tool in applications such as rigid and non rigid structure from motion, photometric stereo and optical flow [25, 4, 26, 13, 2, 12]. The rank of the approximating matrix typically describes the complexity of the solution. For example, in non-rigid structure from motion the rank measures the number of basis elements needed to describe the point motions [4]. Under the assumption of Gaussian noise the objective is typically to solve
[TABLE]
where is a measurement matrix and is the Frobenius norm. The problem can be solved optimally using SVD [9], but the strategy is limited to problems where all matrix elements are directly measured. In this paper we will consider low rank approximation problems where linear combinations of the elements are observed. We aim to solve problems of the form
[TABLE]
Here is [math] if and otherwise. The linear operator is assumed to fulfill a restricted isometry property (RIP) [24]
[TABLE]
for all matrices with . The standard approach for problems of this class is to replace the rank function with the convex nuclear norm [24, 5]. It was first observed that this is the convex envelope of the rank function over the set in [11]. Since then a number of generalizations that give performance guarantees for the nuclear norm relaxation have appeared, e.g. [24, 23, 5, 6]. The approach does however suffer from a shrinking bias that can severely degrade the solution in the presence of noise. In contrast to the rank constraint the nuclear norm penalizes both small singular values of , assumed to stem from measurement noise, and large singular values, assumed to make up the true signal, equally. In some sense the suppression of noise also requires an equal suppression of signal. Non-convex alternatives have been shown to improve performance [22, 20].
In this paper we will consider the relaxation
[TABLE]
where
[TABLE]
and , are the singular values of and respectively. The minimization over does not have any closed form solution, however it was shown in [18, 1] how to efficiently evaluate and compute its proximal operator. Figure 1 shows a three dimensional illustration of the level sets of the regularizer.
[TABLE]
is the the convex envelope of
[TABLE]
By itself the regularization term is not convex, but when adding a quadratic term the result is convex. It is shown in [1] that (7) and (6) have the same optimizer as long as the singular values of are distinct. (When this is not the case the minimizer is not unique.) Assuming that (3) holds will behave roughly like for matrices of rank less than , and therefore it seems reasonable that (4) should have some convexity properties. In this paper we study the stationary points of (4). We show that if a RIP property holds it is in many cases possible to guarantee that any stationary point of (4) (with rank ) is unique.
A number of recent works propose to use the related -norm [19, 10, 16, 15] (sometimes referred to as the spectral k-support norm). This is a generalization of the nuclear norm which is obtained when selecting . It can be shown [19] that the extreme points of the unit ball with this norm are rank matrices. Therefore this choice may be more appropriate than the nuclear norm when searching for solutions of a particular (known) rank. It can be seen (e.g. from the derivations in [16]) that is the convex envelope of (7) when , which gives . While the approach is convex the extra norm penalty adds a (usually) unwanted shrinking bias similar to what the nuclear norm does. In contrast, our approach avoids this bias since it uses a non-convex regularizer. Despite this non-convexity we are still able to derive strong optimality guarantees for an important class of problem instances.
1.1 Main Results and Contributions
Our main result, Theorem 2.4, shows that if is a stationary point of (4) and the singular values of the matrix fulfill then there can not be any other stationary point with rank less than or equal to . The matrix is related to the gradient of the objective function at (see Section 2). The term can be seen as a local approximation of close to (see [21]). If for example there is a rank matrix such that then it is easy to show that is a stationary point and the corrresponding is identical to . Since this means that our results certify that this is the only stationary point to the problem if fulfills (3) with . The following lemma clarifies the connection between the stationary point and .
Lemma 1.1**.**
The point is stationary in = if and only if , where , and if and only if
[TABLE]
(The proof is identical to that of Lemma 3.1 in [21].) In [1] it is shown that as long as the unique solution of (8) is the best rank approximation of . When there are several singular values that are equal to , (8) will have multiple solutions and some of them will not be of rank .
In [7] the relationship between minimizers of (4) and (2) is studied. We note that [7] shows that if then any local minimizer of (4) is also a local minimizer of (2), and that their global minimizers coincide. In this situation local minimizers of will therefore be rank approximations of . Hence, loosely speaking our results state that in this situation any local minimum of is likely to be unique.
Our work builds on that of [21] which derives similar results for the non-convex regularizer , where are the singular values of . In this case a trade-off between rank and residual error is optimized using the formulation
[TABLE]
While it can be argued that (9) and (4) are essentially equivalent since we can iteratively search for a that gives the desired rank, the results of [21] may not rule out the existence of multiple high rank stationary points. In contrast, when using (4) our results imply that if then is the unique stationary point of the problem. (To see this, note that if there are other stationary points we can by the preceding discussion assume that at least one is of rank or less which contradicts our main result in Theorem 2.4). Hence, in this sense our results are stronger than those of [21] and allow for directly searching for a matrix of the desired rank with an essentially parameter free formulation.
1.2 Notation
In this section we introduce some preliminary material and notation. In general we will use boldface to denote a vector and its th element . Unless otherwise stated the singular values of a matrix will be denoted and the vector of singular values . By we denote the standard euclidean norm . A diagonal matrix with diagonal elements will be denoted . For matrices we define the scalar product as , where tr is the trace function, and the Frobenius norm . The adjoint of the linear matrix operator is denoted . By we mean the set of subgradients of the function at and by a stationary point we mean a solution to .
2 Optimality Conditions
Let . We can equivalently write
[TABLE]
where and . The function is convex and sub-differentiable. Any stationary point of therefore has to fulfill
[TABLE]
Computation of the gradient gives the optimality conditions where .
2.1 Subgradients of
For our analysis we need to determine the subdifferential of the function . Let be the vector of singular values of and be the SVD. Using Von Neumann’s trace theorem it is easy to see [18] that the that maximizes (5) has to be of the form , where are singular values. If we let
[TABLE]
then we have . The function is linear in and concave in . Furthermore for any given the corresponding maximizers can be restricted to a compact set (because of the dominating quadratic term). By Danskin’s Theorem, see [3], the subgradients of are then given by
[TABLE]
where . We note that by concavity the maximizing set is convex. Since we get
[TABLE]
To find the set of subgradients we thus need to determine all maximizers of . Since the maximizing has the same and as what remains is to determine the singular values of . It can be shown [18] that these have the form
[TABLE]
for some number . (The case , is actually not addressed in [18]. However, it is easy to see that any value in works since vanishes from (12) when , . In fact, any value works, but we use the convention that singular values are positive. Note that the columns of that correspond to zero singular values of are not uniquely defined. We can always achieve a decreasing sequence with by changing signs and switching order.)
For a general matrix the value of can not be determined analytically but has to be computed numerically by maximizing a one dimensional concave and differentiable function [18]. If it is however clear that the optimal choice is . To see this we note that since the optimal is of the form we have
[TABLE]
if . Selecting and inserting (15) into (16) gives , which is clearly the maximum. Hence if we conclude that the subgradients of are given by where
[TABLE]
2.2 Growth estimates for the
Next we derive a bound on the growth of the subgradients that will be useful when considering the uniqueness of low rank stationary points.
Let and be two vectors both with at most non-zero (positive) elements, and and be the indexes of the largest elements of and respectively. We will assume that both and contain elements. If in particular has fewer than non-zero elements we also include some zero elements in . We define the corresponding sequences and by
[TABLE]
where and . If has fewer than non-zero elements then . Note that we do not require that the elements of the and vectors are ordered in decreasing order. We will see later (Lemma 2.2) that in order to estimate the effects of the and matrices we need to be able to handle permutations of the singular values. For our analysis we will also use the quantity .
Lemma 2.1**.**
If , where then
[TABLE]
Proof.
Since when and otherwise, we can write the inner product as
[TABLE]
Note that
[TABLE]
Since the second and third sum in (20) have the same number of terms it suffices to show that
[TABLE]
when , and , . By the assumption we know that . We further know that . This gives
[TABLE]
Inserting these inequalities into the left hand side of (22) gives the desired bound. ∎
The above result gives an estimate of the growth of the subdifferential in terms of the singular values. To derive a similar estimate for the matrix elements we need the following lemma:
Lemma 2.2**.**
Let ,,, be fixed vectors with non-increasing and non-negative elements such that and and fulfill (15) (with and respectively). Define , , , and as functions of unknown orthogonal matrices , , and . If
[TABLE]
then
[TABLE]
where belongs to the set of permutation matrices.
The proof is almost identical to that of Lemma 4.1 in [21] and therefore we omit it. While our subdifferential is different to the one studied in [21], for both of them we have that the and vectors fulfill which is the only property that is used in the proof.
Corollary 2.3**.**
Assume that is of rank and . If the singular values of the matrix fulfill , where , then for any with we have
[TABLE]
as long as .
Proof.
We let and be the singular values of the matrices and respectively. Our proof essentially follows that of Corollary 4.2 in [21], where a similar result is first proven under the assumption that and then generalized to the general case using a continuity argument. For this purpose we need to extend the infeasible interval somewhat. Since and are open there is an such that and . Now assume that in (25), then clearly
[TABLE]
since . Otherwise and we have
[TABLE]
According to Lemma 2.1 the right hand side is strictly larger than , which proves that (28) holds for all with .
It remains to show that
[TABLE]
for the case and . Since is arbitrary this proves the Corollary. This can be done as in [21] using continuity of the scalar product and the Frobenius norm. Specifically, a sequence , when , is defined by modifying the largest singular value and letting . It is easy to verify that fulfills (28) for every . Letting then proves (30). ∎
2.3 Uniqueness of Low Rank Stationary Points
In this section we show that if a RIP (3) holds and the singular values and are well separated there can only be one stationary point of that has rank . We first derive a bound on the gradients of . We have
[TABLE]
This gives
[TABLE]
By (3) if which gives
[TABLE]
This leads us to our main result
Theorem 2.4**.**
Assume that is a stationary point of , that is, , where , and the singular values of fulfill . If is another stationary point then .
Proof.
Assume that . Since both and are stationary we have
[TABLE]
where and . Taking the difference between the two equations yields
[TABLE]
which implies
[TABLE]
where has . By (33) the left hand side is less than . However, according to Corollary 2.3 (with ) the right hand side is larger than which contradicts . ∎
Remark.
Note that [7] shows that if then any local minimizer of (4) is also a local minimizer of (2) and therefore of rank . Hence any local minimizer obeying the conditions of the theorem will be unique.
3 Implementation and Experiments
In this section we test the proposed approach on some simple real and synthetic applications (some that fulfill (3) and some that do not). For our implementation we use the GIST approach from [14] because of its simplicity. Given a current iterate this method solves
[TABLE]
where . Note that if then any fixed point of (38) is a stationary point by Lemma 1.1. To solve (38) we use the proximal operator computed in [18].
Our algorithm consists of repeatedly solving (38) for a sequence of . We start from a larger value ( in our implementation) and reduce towards as long as this results in decreasing objective values. Specifically we set if the previous step was successful in reducing the objective value. Otherwise we increase according to .
3.1 Synthetic Data
We first evaluate the quality of the relaxation on a number of synthetic experiments. We compare the two formulations (4) and
[TABLE]
In Figure 2 (a) we tested these two relaxations on a number of synthetic problems with varying noise levels. The data was created so that the operator fulfills (3) with . By column stacking an matrix the linear mapping can be represented with a matrix of size . It is easy to see that if we let the term of (3) will be the same as the smallest singular value of squared. (In this case the RIP constraint will hold for any rank and we therefore suppress the subscript and only use .) For the data in Figure 3 (a) we selected matrices with random (gaussian mean [math] and variation ) entries and modified their singular values. We then generated a matrices of rank by sampling matrices and with entries and computed . The measurement vector was created by computing , where is for varying noise level between [math] and . In Figure 2 (a) we plotted the measurement fit versus the noise level for the solutions obtained with (4) and (39). Note that since the formulation (39) does not directly specify the rank of the sought matrix we iteratively searched for the smallest value of that gives the correct rank, using a bisection scheme. The reason for choosing the smallest is that this reduces the shrinking bias to a minimum while it still gives the correct rank.
In Figure 2 (b) we computed the matrix and plotted the fraction of problem instances where its singular values fulfilled , with . For these instances the obtained stationary points are also globally optimal according to our main results.
In Figure 2 (c) we did the same experiment as in (a) but with an under determined of size . It is known [24] that if is and the elements of are drawn from then fulfills (3) with high probability. The exact value of is however difficult to determine and therefore we are not able to verify optimality in this case.
3.2 Non-Rigid Structure from Motion
In this section we consider the problem of Non-Rigid Structure from Motion. We follow the aproach of Dai. et al. [8] and let
[TABLE]
where ,, are matrices containing the -,- and -coordinates of the tracked points in images . Under the assumption of an orthographic camera the projection of the points can be modeled using , where is a block diagonal matrix with blocks , consisting of two orthogonal rows that encode the camera orientation in image . The resulting measurement matrix consists of the - and -image coordinates or the tracked points. Under the assumption of a linear shape basis model [4] with deformation modes, the matrix can be factorized into , where the matrix contain the basis elements. It is clear that such a factorization is possible when is of rank . We therefore search for the matrix of rank that minimizes the residual error .
The linear operator defined by does by itself not obey (3) since there are typically low rank matrices in its nullspace. This can be seen by noting that if is the vector perpendicular to the two rows of , that is then
[TABLE]
where is any matrix, is in the null space of . Therefore any matrix of the form
[TABLE]
where are the elements of , vanishes under . Setting everything but the first row of to zero shows that there is a matrix of rank in the null space of . Moreover, if the rows of the optimal spans such a matrix it will not be unique since we may add without affecting the projections or the rank.
In Figure 7 we compare the two relaxations
[TABLE]
and
[TABLE]
on the four MOCAP sequences displayed in Figure 7, obtained from [8]. These consist of real motion capture data and therefore the ground truth solution is only approximatively of low rank.
In Figure 7 we plot the rank of the obtained solution versus the datafit . Since (44) does not allow us to directly specify the rank of the sought matrix, we solved the problem for values of between and (orange curve) and computed the resulting rank and datafit. Note that even if a change of is not large enough to change the rank of the solution it does affect the non-zero singular values. To achieve the best result for a specific rank with (44) we should select the smallest that gives the correct rank. Even though (3) does not hold, the relaxation (43) consistently gives better data fit with lower rank than (44). Figure 7 also shows the rank versus the distance to the ground truth solution. For high rank the distance is typically larger for (43) than (44). A feasible explanation is that when the rank is high it is more likely that the row space of contains a matrix of the type . Loosely speaking, when we allow too complex deformations it becomes more difficult to uniquely recover the shape. The nuclear norm’s built in bias to small solutions helps to regularize the problem when the rank constraint is not discriminative enough.
One way to handle the null space of is to add additional regularizes that penalize low rank matrices of the type . Dai et al. [8] suggested to use the derivative prior , where the matrix is a first order difference operator. The nullspace of consists of matrices that are constant in each column. Since this implies that the scene is rigid it is clear that is not in the nullspace of . We add this term and compare
[TABLE]
and
[TABLE]
Figures 7 and 7 show the results. In this case both the data fit and the distance to the ground truth is consistently better with (45) than (46). When the rank increases most of the regularization comes from the derivative prior leading to both methods providing similar results.
4 Conclusions
In this paper we studied the local minima of a non-convex rank regularization approach. Our main theoretical result shows that if a RIP property holds then there is often a unique local minimum. Since the proposed relaxation (4) and the original objective (2) is shown to have the same global minimizers if in [7] this result is also relevant for the original discontinuous problem. Our experimental evaluation shows that the proposed approach often gives better solutions than standard convex alternatives, even when the RIP constraint does not hold.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] F. Andersson, M. Carlsson, and C. Olsson. Convex envelopes for fixed rank approximation. Optimization Letters , pages 1–13, 2017.
- 2[2] R. Basri, D. Jacobs, and I. Kemelmacher. Photometric stereo with general, unknown lighting. Int. J. Comput. Vision , 72(3):239–257, May 2007.
- 3[3] D. Bertsekas. Nonlinear Programming . Athena Scientific, 1999.
- 4[4] C. Bregler, A. Hertzmann, and H. Biermann. Recovering non-rigid 3d shape from image streams. In IEEE Conference on Computer Vision and Pattern Recognition , 2000.
- 5[5] E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J. ACM , 58(3):11:1–11:37, June 2011.
- 6[6] E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics , 9(6):717–772, 2009.
- 7[7] M. Carlsson. On convexification/optimization of functionals including an l 2-misfit term. ar Xiv preprint ar Xiv:1609.09378 , 2016.
- 8[8] Y. Dai, H. Li, and M. He. A simple prior-free method for non-rigid structure-from-motion factorization. International Journal of Computer Vision , 107(2):101–122, 2014.
