Quartic First-Order Methods for Low-Rank Minimization
Radu-Alexandru Dragomir, Alexandre d'Aspremont, J\'er\^ome Bolte

TL;DR
This paper introduces quartic first-order methods for low-rank matrix problems, leveraging non-Euclidean geometries to improve scalability and performance in matrix completion and factorization tasks.
Contribution
It develops a novel class of algorithms based on quartic kernels and Gram kernels, enhancing efficiency and scalability for nonconvex low-rank minimization.
Findings
Algorithms scale well to large problems
Achieve state-of-the-art performance
Improve numerical stability and efficiency
Abstract
We study a generalized nonconvex Burer-Monteiro formulation for low-rank minimization problems. We use recent results on non-Euclidean first order methods to provide efficient and scalable algorithms. Our approach uses geometries induced by quartic kernels on matrix spaces; for unconstrained cases we introduce a novel family of Gram kernels that considerably improves numerical performances. Numerical experiments for Euclidean distance matrix completion and symmetric nonnegative matrix factorization show that our algorithms scale well and reach state of the art performance when compared to specialized methods.
| Dataset | r | NoLips | PG | Beta | CD | SymHALS | SymANLS | ADMM |
|---|---|---|---|---|---|---|---|---|
| Coil-20 | 10 | 24.7 | 51.4 | - | 26.2 | 7.0 | 32.3 | - |
| 20 | 23.7 | 36.8 | - | 21.3 | 4.0 | 18.2 | - | |
| 30 | 20.7 | 40.8 | - | 35.4 | 6.5 | 20.2 | - | |
| 40 | 21.7 | 49.5 | - | 57.6 | 7.5 | 28.4 | - | |
| CBCL | 10 | 38.2 | 42.7 | 44.0 | 35.6 | 13.6 | 35.2 | 42.8 |
| 20 | 57.7 | 88.4 | - | 93.9 | 17.8 | 47.8 | - | |
| 30 | 60.9 | 134.3 | - | 135.0 | 15.1 | 43.4 | - | |
| 40 | 50.8 | 126.4 | - | 90.0 | 23.7 | 52.5 | - | |
| TDT2 | 10 | 35.2 | 54.2 | - | 97.5 | 11.0 | - | - |
| 20 | 52.4 | 76.1 | - | 109.9 | 20.1 | - | - | |
| 30 | 29.4 | 45.1 | - | - | 12.1 | - | - | |
| 40 | 28.0 | 49.8 | - | - | 17.7 | - | - | |
| Reuters | 10 | 6.5 | 10.0 | - | 33.0 | 3.0 | 54.2 | - |
| 20 | 28.7 | 32.8 | - | 71.7 | 9.5 | 74.7 | - | |
| 30 | 24.3 | 45.5 | - | 69.4 | 6.5 | 91.0 | - | |
| 40 | 40.2 | 68.5 | - | 83.2 | 10.6 | 108.3 | - |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
∎
11institutetext: Radu-Alexandru Dragomir, corresponding author 22institutetext: Université Toulouse 1 Capitole & D.I. École Normale Supérieure, Paris, France
[email protected] 33institutetext: Alexandre d’Aspremont 44institutetext: CNRS & D.I. École Normale Supérieure, Paris, France
[email protected]. 55institutetext: Jérôme Bolte 66institutetext: Université Toulouse 1 Capitole, Toulouse, France
Quartic First-Order Methods for Low-Rank Minimization
Radu-Alexandru Dragomir
Alexandre d’Aspremont
Jérôme Bolte
(Last revised on January 18, 2021)
Abstract
We study a general nonconvex formulation for low-rank minimization problems. We use recent results on non-Euclidean first-order methods to provide efficient and scalable algorithms. Our approach uses the geometry induced by the Bregman divergence of well-chosen kernel functions; for unconstrained problems we introduce a novel family of Gram quartic kernels that improve numerical performance.
Numerical experiments on Euclidean distance matrix completion and symmetric nonnegative matrix factorization show that our algorithms scale well and reach state of the art performance when compared to specialized methods.
Keywords:
Bregman first-order methods Low-rank minimization Burer-Monteiro Matrix factorization Euclidean Distance matrix completion
MSC:
90C06 90C26
††journal: JOTA
1 Introduction
We consider the problem of minimizing a smooth convex function over the set of low-rank positive semidefinite matrices. Fundamental applications of this problem arise in various areas of data analysis including matrix completion Cand2008 ; Completion2010 ; Jain2012 , matrix sensing Recht2007 , Euclidean matrix completion Mishra2011 ; Fang2012 , phase retrieval Candes2015 , robust principal component analysis Chen2015 , to name a few.
A popular approach to low-rank semidefinite minimization, known as the Burer-Monteiro formulation Burer2005 , consists in explicitly modeling the rank constraint by writing the matrix in a factorized form. This method is especially appealing for large-scale instances, since it requires storing much less variables than the standard semidefinite programming approaches; see Tu2015 ; Bhojanapalli2015 ; Zhao2015 ; Sun2015 ; Chen2015 ; Zheng2016 ; Park2016 and references therein.
This formulation comes however with an important drawback, as the problem becomes nonconvex, even if the original objective is convex. Therefore, local optimization methods can generally only hope to find a stationary point, or at best a local minimum. Nevertheless, recent work shows convergence towards a global optimum for a close enough initialization Tu2015 ; Bhojanapalli2015 ; Park2016 , or under additional statistical assumptions about the problem Chen2015 ; Zheng2015 ; Ge2016 . Although these global optimality results often impose restrictive assumptions that may not be satisfied in practice, they help to explain why using local algorithms to solve Burer-Monteiro problem formulations often leads to satisfactory solutions in practice.
The most commonly used algorithm to solve these problem formulations is some variant of the proximal gradient method. However, a critical issue with gradient schemes is the choice of step sizes, which significantly impacts performance. This step size choice is closely related to the smoothness of the objective. In particular, when it has a -Lipschitz continuous gradient with respect to the Euclidean norm, standard gradient methods can be applied with a step size lying in . This smoothness assumption is used in the broad majority of theoretical analyses of gradient algorithms, yet there are many cases where it is not satisfied Bauschke2017 ; Bolte2018 . In particular, it does not hold for the general Burer-Monteiro low-rank problem, as we will show in what follows.
Of course, there is a way to circumvent this issue in classical Euclidean methods, by using an Armijo line search Lin2007 . However, in some cases, this naive line search strategy generates very small step sizes which in turn involve costly subroutines. Other approaches impose a step size that is only proven to be valid in a small neighborhood of the optimum Bhojanapalli2015 ; Park2016 .
Non-Euclidean gradient methods.
We adopt an original approach based on a recent line of work on non-Euclidean gradient methods Bauschke2017 ; Bolte2018 ; van and subsequent work Lu2016 . Unlike standard gradient descent that uses the uniform Euclidean geometry, the NoLips method, also known as Bregman/Mirror descent, uses the Bregman divergence induced by a well-chosen convex kernel function. This allows the algorithm to take gradient steps that are more adapted to the geometry of the problem, advancing faster in directions where the gradient of the objective changes slowly, thus improving convergence speed. The kernel function is chosen so that the objective function satisfies a compatibility condition called relative smoothness Bauschke2017 ; Lu2016 , which is a generalization of the usual smoothness assumption mentioned earlier.
In our setting, the objective has a quartic growth, hence choosing the geometry induced by a quartic polynomial will prove to be efficient.
Contributions.
In this work, we focus on deriving efficient algorithms to find stationary points of nonconvex low-rank problems. Our main contribution is to identify favorable non-Euclidean geometries for these problems, induced by well-chosen quartic kernels.
We first study a simple quartic norm kernel that is compatible with various regularization terms. We then introduce a novel family of quartic kernels that we call Gram kernels, which can be applied to unregularized problems. They provide richer geometries which greatly improve convergence speed with little impact on the iteration complexity. We also extend the NoLips scheme to Dyn-NoLips, allowing for adaptive step size strategies.
To highlight the benefits of our approach, we study applications to symmetric nonnegative matrix factorization and Euclidean distance matrix completion and show competitive numerical performance compared to specialized algorithms for these problems.
Notations
For a square matrix , we denote its trace . For two matrices and of same size, we denote the standard Euclidean inner product and norm by and . For a function , we denote by its gradient matrix and by the second derivative at in the directions . denotes the identity matrix of size . For two square matrices , we write if the matrix is positive semidefinite. We write for the operator norm of a linear application .
2 Quartic Geometries for Low-Rank Minimization
2.1 Problem Setup
Let and consider a low-rank semidefinite program, written
[TABLE]
in the variable , where is a smooth convex function and is the target rank. The Burer-Monteiro formulation Burer2005 consists in representing as to solve instead
[TABLE]
in the variable , where is a simple convex regularization function. is typically a quadratic loss function, and enforces penalties on the factor such as sparsity when choosing the norm, or nonnegativity when choosing the indicator function of the nonnegative orthant. We will write the factorized function defined by
[TABLE]
Throughout the paper, we make the following standing assumptions.
Assumption 2.1
- (a)
* is a twice continuously differentiable function which is -strongly convex and -smooth, i.e.,*
[TABLE] 2. (b)
* is a closed convex proper function,* 3. (c)
.
Our analysis will involve the following lemma.
Lemma 1
Let be a twice differentiable -strongly convex and -smooth function. Then, the function is convex and -smooth.
Proof
It suffices to use the second-order characterization Nesterov2004 and notice that, for , we have and hence
[TABLE]
∎
2.2 Relative Smoothness and the Bregman Iteration Map
In this section, we recall the framework of Bauschke2017 ; Bolte2018 to derive non-Euclidean gradient methods.
The first essential step is the choice of a distance kernel. In our context, we choose a differentiable strictly convex function , with (although more general distance kernels can be used). The distance kernel induces in turn a Bregman distance
[TABLE]
Note that is not a proper distance, it is sometimes referred to as a Bregman divergence. However enjoys a distance-like separation property: and for . The choice of a distance kernel suited to the function is guided by the following relative smoothness condition, also called generalized Lipschitz property.
Definition 1 (Relative smoothness Bauschke2017 )
We say that a differentiable function is -smooth relatively to the distance kernel if there exists such that for every ,
[TABLE]
For twice differentiable functions, relative smoothness has an elementary characterization: is -smooth relatively to if and only if
[TABLE]
where denotes the second derivative of at in the direction . Notice that if , then and we recover the standard Euclidean descent lemma that would be implied by Lipschitz continuity of the gradient of .
Bregman iteration map
Now that we are equipped with a non-Euclidean geometry generated by , we define the Bregman proximal iteration map with step size as follows.
[TABLE]
which consists in minimizing a surrogate for where has been replaced by the upper approximation given by (RelSmooth) and the nonsmooth part is kept intact, generalizing thus the approach used in the proximal gradient method. The relative smoothness condition ensures that this operation decreases the objective when . This iteration map is the basic brick for non-Euclidean methods à la Bregman. The simplest method is NoLips Bauschke2017 and its extension Dyn-NoLips (Algorithm 1), which simply amounts to iterating , but other possibilities exist using momentum ideas Auslender2006 ; Hanzely2018 ; Mukkamala2019 .
2.3 The Quartic Geometry
In order to provide some insight into the quartic geometry of our problem, let us consider the example where is a quadratic function, i.e.,
[TABLE]
where and is some linear map. Then, writes
[TABLE]
Clearly, is a quartic function and its gradient is not Lipschitz continuous on , as the Hessian “grows” to infinity when . In other words, (RelSmooth) does not hold with the Euclidean kernel . We now show that relative smoothness holds with a family of well-chosen quartic kernels, which are more adapted to the geometry of .
2.3.1 The Quartic Norm Kernel
We begin with the simplest quartic kernel, which depends solely on the Frobenius norm of . Define the norm kernel as
[TABLE]
where are fixed parameters. Note that this kernel is not new by itself, as it has been already studied in Bolte2018 for vectors in . Our first contribution is to show that it is adapted to every function of our class of problems.
Proposition 1 (Norm kernel)
The function is -smooth relative to the norm kernel for and .
Proof
As is twice differentiable, then so if and we can use the Hessian characterization (2). For , the second derivative of is written
[TABLE]
On the other hand, the second derivative of is
[TABLE]
Since has a Lipschitz continuous gradient, the standard second derivative inequality yields
[TABLE]
Now, the second term can be bounded by using the triangle inequality, the Cauchy-Schwarz inequality and the gradient Lipschitz property, to get
[TABLE]
hence
[TABLE]
where we used the submultiplicative property of the Frobenius norm, and our choice of parameters . Combining (6) and (8) gives that
[TABLE]
for all , hence that is -smooth relatively to Bauschke2017 .∎
The Bregman iteration map (3) associated with the kernel can be computed easily in closed form. We give its expression in the unconstrained case Bolte2018 .
Proposition 2 (Bregman iteration map for , unconstrained case)
Assume that there is no penalty term, i.e., that . The Bregman iteration map of the norm kernel with step size is given by
[TABLE]
where
[TABLE]
and denotes the unique real solution to the cubic equation .
Note that can be computed in closed form using Cardano’s method
[TABLE]
Compared to a standard gradient iteration, the additional operations are elementary and have a minimal impact on the arithmetic complexity.
Constraints and regularization terms.
Following the ideas in Bolte2018 , the Bregman iteration map of can also be easily computed in closed form when is the norm or the pseudonorm. As we will show in Section 4.1, this is also elementary when is the indicator function of the nonnegative orthant.
2.3.2 A More Refined Kernel for Unregularized Problems: the Gram Kernel
While the kernel is simple and compatible with many penalties , a better kernel can be derived for unconstrained instances by considering a richer geometry involving the Gram matrix. Define the Gram kernel as
[TABLE]
where are given parameters. The Gram kernel is more refined than the previous norm kernel since it incorporates some nonisotropic information with the term. To show where this term stems from, observe that following Lemma 1, can be decomposed as where is -smooth. Hence writes
[TABLE]
Since , the first term can be directly incorporated into the kernel, which allows to prove a tighter relative smoothness inequality.
Proposition 3 (Gram kernel)
* is -smooth relatively to the Gram kernel for , and .*
Proof
This amounts to refine the analysis of the proof of Proposition 1. Let . The second derivative of at in the direction writes
[TABLE]
On the other hand, following (7) the second derivative of satisfies
[TABLE]
To bound the second term, we use Lemma 1 which states that the function is convex and smooth with constant . Using the gradient Lipschitz property of yields
[TABLE]
using that , and so we have
[TABLE]
which shows that, for the prescribed choice of , the function is 1-smooth relatively to . ∎
Approximation quality for well-conditioned .
Let us illustrate the advantage of the Gram kernel when is well-conditioned. For simplicity, assume here that is a quadratic function, as in (4), i.e., where is a positive semidefinite linear operator on , and hence has a quartic and a quadratic term
[TABLE]
The gap between and with the choice of coefficients prescribed by Proposition 3 writes, for ,
[TABLE]
where we separated the gap into a quartic term and a quadratic term . It can be seen from (2) that the quality of approximation of the kernel is given by the difference of the Hessians. Focusing on the quartic part, the Hessian difference is
[TABLE]
for . Recalling that is -smooth and -strongly convex, we have that , therefore
[TABLE]
which shows that the quality of approximation of the quartic part of by the Gram kernel depends on the condition number of . Note that one could actually refine the analysis by replacing with the condition number of restricted to the set of matrices of rank at most , which can be much smaller. This is the case when the linear map satisfies the restricted isometry property (RIP), which occurs with high probability in matrix sensing applications with a sufficiently large number of samples Recht2007 ; Meka2009 ; Jain2012 .
Computing the iteration map.
We show now that, when there is no penalty term , the Bregman iteration map of can be computed efficiently, as it involves solving an easy quartic minimization subproblem of size .
Proposition 4 (Gram’s iteration map)
Assume that . For , the Bregman iteration map of for the Gram kernel with step size , called Gram’s iteration map, is given by
[TABLE]
where the matrices are computed through the routine:
- •
Set ,
- •
diagonalize as where and ,
- •
let be the unique solution of the convex minimization problem
[TABLE]
- •
finally set Z=P^{T}\mathop{\bf diag}\big{[}\mu_{1}^{2},\dots,\mu_{r}^{2}\big{]}P.
Proof
When , The Bregman iteration map of writes, for ,
[TABLE]
where we remove constant terms and defined . Write for the sake of clarity . The optimization problem (14) is strictly convex and the unique solution satisfies , meaning that
[TABLE]
Define . Then, the knowledge of determines , since and therefore .
Now, taking (15) and multiplying by its transpose implies that
[TABLE]
This shows that is a polynomial in , and therefore that they admit the same eigenvectors. Write the diagonalization and where and for . It follows from diagonalizing (16) and taking the square root that
[TABLE]
This is exactly the first-order optimality condition on for the problem
[TABLE]
Note that we do not need to enforce the nonnegativity constraint on , since we chose it follows that the optimal solution will be nonnegative. Hence, we can reconstruct from the diagonalization of and the solution of Problem (18), and thus we get the procedure described in the theorem for computing . ∎
Complexity
Note that the order of multiplication is important: we only need to compute the eigendecomposition of , which is of size . We additionally need to solve a small minimization problem of size , which can be done efficiently using the quartic NoLips algorithm with norm kernel (see Appendix A for implementation details). Due to this, the complexity of computing the Bregman iteration map of is , where is the number of iterations needed to solve the subproblem. Since is usually much smaller than by several orders of magnitude, the main computational bottleneck remains in most applications computing the gradient .
2.3.3 Comparison: how to choose the most appropriate kernel
In order to devise efficient methods, one should search for the kernel such that the upper approximation of in (RelSmooth) is as tight as possible, or, equivalently, such that the Hessian of the residual is small. On the other hand, has to be simple enough so that the iteration map (3) is easy to compute (which precludes choosing , as the iteration would be as hard to solve as the initial problem). This trade-off is key in choosing the appropriate kernel. Let us review these two conflicting criteria in our situation.
Complexity of the Bregman iteration map.
For the norm kernel , one iteration involves computing the gradient of , then solving a simple scalar equation. The Gram kernel involves solving a subproblem which requires additional operations. This overhead is negligible for the typical regime where ; however, the iterate can be computed easily only for unconstrained problems.
Quality of Hessian approximation.
We showed in Section 2.3.2 that the quality of the approximation of the quartic component of by the Gram kernel is bounded by . Therefore, it is expected to show good performance when is sufficiently well-conditioned. The norm kernel, however, has no such property, as its approximation of is much coarser. The difference stems from the supplementary term, which can be much smaller than , especially when the columns of are nearly orthogonal.
Note that even if is not globally strongly convex or is unknown, the Gram kernel can take advantage of local strong convexity through adaptive step sizes, as we show in the sequel.
3 Algorithms for Quartic Low-Rank Minimization
Now that we are equipped with a non-Euclidean geometry induced by one of the kernels and , we are ready to define the minimization scheme Dyn-NoLips in Algorithm 1. It extends the NoLips algorithm from Bolte2018 to allow step sizes larger than the theoretical value .
Step size choice
The step size is chosen so that the new iterate satisfies
[TABLE]
There are two ways to ensure this condition holds.
- •
Fixed step size. Since is -smooth relatively to , (19) holds as soon as .
- •
Dynamical step size. In some cases, the relative Lipschitz constant might be too conservative, and better numerical performance can be achieved by taking larger steps. We therefore can use a dynamical strategy for extending the step size, ensuring that (19) holds at each iteration. There are many strategies to efficiently adjust the step size; see, e.g., Nesterov2007 . In our case, we choose a simple strategy similar in spirit to the Armijo line search: at iteration , start with a tentative step size , then find the smallest integer such that (19) is satisfied with step size . Then, set .
Convergence to a stationary point.
We now extend the theoretical convergence results from Bolte2018 to handle the dynamical step size strategy.
Theorem 3.1 (Convergence results)
Let be the sequence generated by Algorithm 1. Assume that
* is -smooth relatively to a distance kernel such that is strongly convex and twice continuously differentiable on , and the penalty function is convex.* 2. 2.
The function is coercive (meaning that when ) and semialgebraic.
Then, the sequence is nonincreasing, and the sequence converges towards a critical point of problem (P).
Proof
First, the step size can be bounded for as
[TABLE]
Indeed, the upper bound holds by construction of the algorithm. The lower bound comes from the relative smoothness property: condition (19) is true for every , so the inner loop will stop whenever gets below .
Let us now prove the result. Since Condition (19) holds at each iteration , we can write
[TABLE]
On the other hand, the optimality condition characterizing writes
[TABLE]
where denotes the subdifferential of the convex function . Combining (22) with the subgradient inequality for yields
[TABLE]
[TABLE]
which yields
[TABLE]
From this inequality, we can now prove the same convergence properties as for the standard NoLips scheme. Indeed, the monotonicity of the sequence is a direct consequence of the above. Since , it follows that at every iteration ,
[TABLE]
Now, this inequality is the same as the one needed to prove convergence in the case of the fixed step size in Bolte2018 . Thus, global convergence towards a critical point is a consequence of (Bolte2018, , Th. 4.1), since all the assumptions are met: the kernel is defined over the entire space , it is strongly convex, and is Lipschitz continuous on bounded subsets of (because we assumed it is ). We also need the fact that the sequence is bounded, which is a consequence of the monotonicity of and the fact that the function is coercive.∎
The semialgebraicity assumption is needed to establish the crucial nonsmooth Lojasiewicz property Bolte2007 , required to show convergence to a critical point. It holds for all the applications we cited, since the class of semialgebraic functions includes polynomial functions, and norms, the seminorm and indicators of polynomial sets.
4 Applications
We now illustrate applications of our methodology to two different low-rank problems, symmetric nonnegative matrix factorization and Euclidean distance matrix completion. We show that good numerical performance can be reached using the dynamical step strategy, and that, for Euclidean matrix completion, it can be further improved by using the Gram kernel.
4.1 Symmetric Nonnegative Matrix Factorization
Symmetric Nonnegative Matrix Factorization (SymNMF) is the task of finding, given a symmetric nonnegative matrix , a nonnegative matrix such that . This is done by solving
[TABLE]
in the variable , where the inequality constraint is meant componentwise and is the target rank.
(SymNMF) is used as a probabilistic clustering or graph clustering technique Ding2005 ; He2011 . Numerical experiments by Kuang2015 have shown that it achieves state-of-the-art clustering accuracy on several text and image datasets.
4.1.1 Solving SymNMF.
While (SymNMF) looks similar to the well-known asymmetric NMF problem , it is actually harder. This is because NMF has a favorable block structure that allows the application of efficient alternating algorithms Controller2013 ; Cichocki2009 . SymNMF, however, does not enjoy the same block structure. Current solvers fall into two categories:
Direct solvers. There have been several attempts at solving the original problem, including multiplicative update rules He2011 , projected gradient algorithm quasi-Newton schemes Kuang2015 , and coordinate descent Vandaele2016 .
Nonsymmetric relaxations. Another idea is to use a mere penalty method Kuang2015 ; Lu2017 ; zhu2018a , relaxing (SymNMF) to the following penalized nonsymmetric problem
[TABLE]
in the variables , with parameter . This formulation is very similar to asymmetric NMF and can be solved by the same fast alternating algorithms that exploit the block structure, such as Alternating Nonnegative Least Squares (ANLS) and Hierarchical Alternating Least Squares zhu2018a (HALS), which are arguably the fastest SymNMF solvers.
Applying NoLips
We propose to apply NoLips for optimizing the original objective function. Problem (SymNMF) falls within our framework with , which has a Lipschitz gradient with constant , and the indicator function of the nonnegative orthant. Therefore, Proposition 1 implies that is -smooth relatively to the kernel with and . Since, in addition, is polynomial and is the indicator of a polynomial set, is semialgebraic, and it is also coercive, so Theorem 3.1 guarantees that NoLips will converge towards a stationary point of problem (SymNMF).
In this problem, the Bregman iteration map is solved by simply adding a projection step
[TABLE]
where , has been defined in Proposition 2 and is the projection on the nonnegative orthant: (entrywise).
Computational complexity for NoLips.
The computational complexity of an iteration is dominated by gradient computations and objective function evaluations, as all other operations are linear in the size of the variable.
If is a dense matrix, each gradient and function evaluation uses floating point operations. If is represented as a sparse matrix with nonzero elements, then we can take advantage of this structure (Vandaele2016, , Rmk. 2) by using
[TABLE]
which yields a much improved complexity per iteration.
4.1.2 Numerical experiments
We implemented the following algorithms: Algorithm 1 with dynamical step size and the norm kernel (Dyn-NoLips), the -SNMF scheme from He2011 , where we set as advised by the authors, the projected gradient algorithm (PG) with Armijo line search from Kuang2015 , where we use the line search parameters and , the coordinate descent scheme (CD) from Vandaele2016 , the ADMM algorithm Lu2017 , and the two fast algorithms from zhu2018a for solving the penalized problem (P-NMF): SymANLS and SymHALS. For the last two, we tuned the penalization parameter for best performance. We left out the quasi-Newton algorithm from Kuang2015 because of its prohibitive complexity for large datasets.
All algorithms were implemented in Julia Bezanson2017 which is a highly-optimized numerical computing language. Since our algorithms have different complexity per iteration, it is essential to compare them in terms of running time, and Julia provides a fairly accurate way to do so as there is little interpreter overhead in loops.111Tests were run on a PC Intel CORE i7-4910MQ CPU @ 2.90 GHz x 8 with 32 Go RAM.
We used two image and two text datasets.
- •
Image.
- –
CBCL222http://cbcl.mit.edu/software-datasets/FaceData2.html: 2,429 images of faces of size
- –
Coil-20333http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php: 1440 images of size representing 20 objects under various angles.
- •
Text.
- –
**TDT2444http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html **: dataset of 11,201 news articles classified in 96 semantic categories. We used the version provided by Cai et al. CWH09 ; CMHZ08 ; CHZH07 ; CHH05 , which has been restricted to the largest 30 categories, leaving a total of 9,394 documents.
- –
Reuters4: dataset of news articles, which we restricted to the largest 25 categories, leaving a total of 7,963 documents.
For all image and text datasets, we construct a sparse similarity matrix following the procedure described in (Kuang2015, , Section 7.1). We begin by computing the similarity graph between data points, using cosine similarity on term frequency vectors for text, and a Gaussian kernel for image (with the self-tuning method for the scale). The graph obtained is sparsified by keeping only the edges connecting the k-nearest neighbors, with . Then, is taken as a normalized version of the graph adjacency matrix.
We use the usual convergence criterion for constrained nonconvex problems
[TABLE]
where is the projected gradient defined as
[TABLE]
Table 1 reports the average time needed to reach a convergence criterion of , for 10 random initializations. For each dataset, we test several values for the rank parameter . In addition, Figure 1 shows the average evolution of the normalized objective gap , where is the minimal objective value encountered in all initializations.
Overall, the algorithm that shows the best convergence speed is SymHALS, but it has the disadvantage of needing to tune the penalization parameter . In the experiments we report, small values of yielded optimal performance, while the convergence theory of Zhu2018b only holds for large values for which the algorithm is much slower. By contrast, Dyn-NoLips is hyperparameter-free and has the second best overall performance. The gap with the other methods is particularly significant on the larger TDT2 and Reuters datasets, showing that the method scales well with problem dimension.
4.2 Euclidean Distance Matrix Completion
Euclidean distance matrix completion (EDMC) is the task of recovering the position of points , given the knowledge of a partial set of pairwise distances for , where . It is a fundamental problem with applications in sensor network localization and the study of conformation of molecules; see Fang2012 ; Qi2013 ; Dokmanic2015 and references therein.
The Burer-Monteiro nonconvex formulation for solving this problem writes
[TABLE]
in the variable . It can be rewritten where is the matrix of known distances, denotes the projection operator such that if , and elsewhere, and is the linear operator defined for by
[TABLE]
Applying NoLips with the norm kernel.
Problem (EDMC) falls within our framework with , which can be shown to have a Lipschitz gradient with constant
[TABLE]
Therefore, as in the case of SymNMF, the norm kernel can be used with an initial step size and parameters and .
Using the Gram kernel
As the problem is unconstrained, we can also apply minimization using the Gram kernel . We use the parameters , and , which ensure that is -smooth relatively to by Proposition 3.
Computational complexity for NoLips.
As before, the main computational bottleneck for an iteration consists in computing the value and gradient of the objective function. If denotes the number of known distances, then the computational complexity is . If the Gram kernel is used, each iteration requires an additional flops (see Section 2.3.2), which is negligible compared to the latter in the usual setting where and is small.
Numerical experiments
We implement the following algorithms: NoLips with a dynamical step size and the norm kernel (Dyn-NoLips), NoLips with a dynamical step size and the Gram kernel (Dyn-NoLips-Gram), gradient descent with Armijo line search (GD), the Riemannian trust region algorithm from Mishra2011 (TR). We leave out semidefinite relaxations because of their memory requirement which is prohibitive on large data. As the implementation for TR is provided in Matlab, we run our experiments on Matlab as well, with the same setup as in Section 4.1.
We try the algorithms on a standard EDMC problem, the 3-dimensional Helix dataset Mishra2011 which is generated as where are sampled uniformly in . We randomly keep only 10 % on the pairwise distances, and test on two different problem sizes: and . Figure 2 reports the normalized root mean squared error (RMSE) over all distances (known and unknown) averaged on random initializations. All the algorithms manage to recover the ground truth; the Dyn-NoLips-Gram algorithm shows the best numerical performance, which demonstrates the advantage of using the Gram geometry.
5 Conclusion
We proposed a generic approach for solving Burer-Monteiro formulations of low-rank minimization problems using the methodology of Bregman gradient methods and relative smoothness. We studied two quartic kernels, including a new Gram kernel, and demonstrated their benefits on numerical experiments. In future work, performance could be improved further by studying inertial variants Mukkamala2019 ; Hanzely2018 . New kernels could also be explored beyond the class of quartic functions to tackle other problems with inherent non-Euclidean geometries.
Code
The code for reproducing experiments for SymNMF and Euclidean Distance Matrix Completion can be downloaded from the public repository
https://github.com/RaduAlexandruDragomir/QuarticLowRankOptimization
Acknowledgments
The authors would like to thank the anonymous reviewers for their insightful comments.
Radu-Alexandru Dragomir would like to acknowledge support from an AMX fellowship, the Air Force Office of Scientific Research, Air Force Material Command, USAF, under grant number FA9550-18-1-0226, as well as from Sébastien Gadat.
Jérôme Bolte was partially supported by ANR-3IA Artificial and Natural Intelligence Toulouse Institute, and Air Force Office of Scientific Research, and Air Force Material Command, USAF, under grant numbers FA9550-18-1-0226 & FA9550-19-1-7026.
AA is at CNRS & département d’informatique, École normale supérieure, UMR CNRS 8548, 45 rue d’Ulm 75005 Paris, France, INRIA and PSL Research University. AA acknowledges support from the French government under management of Agence Nationale de la Recherche as part of the ”Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute), the ML & Optimisation joint research initiative with the fonds AXA pour la recherche and Kamet Ventures, as well as a Google focused award.
Appendix
Appendix A Solving the Subproblem for Computing the Bregman Iteration Map of the Gram Kernel
While it seems that computing the Bregman iteration map of the Gram kernel involves solving another difficult quartic subproblem, it is actually of small size ( is typically not larger than a few dozens) and can be solved efficiently with the NoLips scheme.
Indeed, the objective function of problem (18) is 1-smooth relatively to the norm kernel in with a choice of parameters and .
Algorithm 2 details the procedure. We initialize with the values for the previous iteration of the outer procedure. This proves to be efficient as the values will not vary much from one iteration to another. For the stopping criterion, we use the scaled gradient norm and a tolerance value .
The subproblem being very well conditionned, it is minimized easily; in numerical experiments, it usually convergences in no more than iterations.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Emmanuel J. Candès and Benjamin Recht. Exact Matrix Completion Via Convex Optimization. Found Comput Math , 9(6), 2009.
- 2[2] Jian-Feng Cai, Emmanuel J. Candès, and Zuowei Shen. A Singular Value Tresholding Algorithm for Matrix Completion. SIAM Journal on Optimization , 20(4):1956–1982, 2010.
- 3[3] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank Matrix Completion using Alternating Minimization. In Proceedings of the Forty-fifth Annual ACM Symposium on Theory of Computing , pages 665—-674, 2013.
- 4[4] Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization. SIAM Review , 52(3):471–501, 2007.
- 5[5] Bandev Mishra, Gilles Meyer, and Rodolphe Sepulchre. Low-rank optimization for distance matrix completion. In Proceedings of the IEEE Conference on Decision and Control , pages 4455–4460, 2011.
- 6[6] Haw Ren Fang and Dianne P. O’Leary. Euclidean distance matrix completion problems. Optimization Methods and Software , 27(4):695–717, 2012.
- 7[7] Emmanuel J. Candès, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory , 61(4):1985–2007, 2015.
- 8[8] Yudong Chen and Martin J. Wainwright. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. ar Xiv preprint ar Xiv:1509.03025 , 2015.
