Quartic First-Order Methods for Low-Rank Minimization

Radu-Alexandru Dragomir; Alexandre d'Aspremont; J\'er\^ome Bolte

arXiv:1901.10791·math.OC·February 18, 2021·J. Optim. Theory Appl.

Quartic First-Order Methods for Low-Rank Minimization

Radu-Alexandru Dragomir, Alexandre d'Aspremont, J\'er\^ome Bolte

PDF

TL;DR

This paper introduces quartic first-order methods for low-rank matrix problems, leveraging non-Euclidean geometries to improve scalability and performance in matrix completion and factorization tasks.

Contribution

It develops a novel class of algorithms based on quartic kernels and Gram kernels, enhancing efficiency and scalability for nonconvex low-rank minimization.

Findings

01

Algorithms scale well to large problems

02

Achieve state-of-the-art performance

03

Improve numerical stability and efficiency

Abstract

We study a generalized nonconvex Burer-Monteiro formulation for low-rank minimization problems. We use recent results on non-Euclidean first order methods to provide efficient and scalable algorithms. Our approach uses geometries induced by quartic kernels on matrix spaces; for unconstrained cases we introduce a novel family of Gram kernels that considerably improves numerical performances. Numerical experiments for Euclidean distance matrix completion and symmetric nonnegative matrix factorization show that our algorithms scale well and reach state of the art performance when compared to specialized methods.

Tables1

Table 1. Table 1: CPU time (in seconds) needed to reach a decrease of ϵ = 10 − 3 italic-ϵ superscript 10 3 \epsilon=10^{-3} in projected gradient norm (see ( 26 ) for definition). Results have been averaged over 10 random initializations. Hyperparameters for SymHALS, SymANLS and ADMM have been tuned for best performance. Missing values indicate failure of convergence.

Dataset	r	NoLips	PG	Beta	CD	SymHALS	SymANLS	ADMM
Coil-20	10	24.7	51.4	-	26.2	7.0	32.3	-
	20	23.7	36.8	-	21.3	4.0	18.2	-
	30	20.7	40.8	-	35.4	6.5	20.2	-
	40	21.7	49.5	-	57.6	7.5	28.4	-
CBCL	10	38.2	42.7	44.0	35.6	13.6	35.2	42.8
	20	57.7	88.4	-	93.9	17.8	47.8	-
	30	60.9	134.3	-	135.0	15.1	43.4	-
	40	50.8	126.4	-	90.0	23.7	52.5	-
TDT2	10	35.2	54.2	-	97.5	11.0	-	-
	20	52.4	76.1	-	109.9	20.1	-	-
	30	29.4	45.1	-	-	12.1	-	-
	40	28.0	49.8	-	-	17.7	-	-
Reuters	10	6.5	10.0	-	33.0	3.0	54.2	-
	20	28.7	32.8	-	71.7	9.5	74.7	-
	30	24.3	45.5	-	69.4	6.5	91.0	-
	40	40.2	68.5	-	83.2	10.6	108.3	-

Equations121

min F (Y) subject to Y ⪰ 0, rank (Y) \leq r

min F (Y) subject to Y ⪰ 0, rank (Y) \leq r

min Ψ (X) := F (X X^{T}) + g (X)

min Ψ (X) := F (X X^{T}) + g (X)

f (X) := F (X X^{T}) .

f (X) := F (X X^{T}) .

⟨ \nabla F (X) - \nabla F (Y), X - Y ⟩

⟨ \nabla F (X) - \nabla F (Y), X - Y ⟩

∥\nabla F (X) - \nabla F (Y) ∥

μ_{F} ∥ U ∥^{2} \leq \nabla^{2} F (Y) [U, U] \leq L_{F} ∥ U ∥^{2} ⟹ 0 \leq \nabla^{2} G (Y) [U, U] \leq (L_{F} - μ_{F}) ∥ U ∥^{2} .

μ_{F} ∥ U ∥^{2} \leq \nabla^{2} F (Y) [U, U] \leq L_{F} ∥ U ∥^{2} ⟹ 0 \leq \nabla^{2} G (Y) [U, U] \leq (L_{F} - μ_{F}) ∥ U ∥^{2} .

D_{h} (X, Y) = h (X) - h (Y) - ⟨ \nabla h (Y), X - Y ⟩ .

D_{h} (X, Y) = h (X) - h (Y) - ⟨ \nabla h (Y), X - Y ⟩ .

f (X) \leq f (Y) + ⟨ \nabla f (Y), X - Y ⟩ + L D_{h} (X, Y) .

f (X) \leq f (Y) + ⟨ \nabla f (Y), X - Y ⟩ + L D_{h} (X, Y) .

\nabla^{2} f (X) [U, U] \leq L \nabla^{2} h (X) [U, U], \forall X, U \in R^{n \times r}

\nabla^{2} f (X) [U, U] \leq L \nabla^{2} h (X) [U, U], \forall X, U \in R^{n \times r}

T_{λ} (X) = argmin_{U \in R^{n \times r}} {g (U) + f (X) + ⟨ \nabla f (X), U - X ⟩ + \frac{1}{λ} D_{h} (U, X)},

T_{λ} (X) = argmin_{U \in R^{n \times r}} {g (U) + f (X) + ⟨ \nabla f (X), U - X ⟩ + \frac{1}{λ} D_{h} (U, X)},

F (Y) = \frac{1}{2} ⟨ A Y, Y ⟩ + ⟨ B, Y ⟩ \forall Y \in R^{n \times n},

F (Y) = \frac{1}{2} ⟨ A Y, Y ⟩ + ⟨ B, Y ⟩ \forall Y \in R^{n \times n},

f (X) = F (X X^{T}) = \frac{1}{2} ⟨ A (X X^{T}), X X^{T} ⟩ + ⟨ B X, X ⟩ \forall X \in R^{n \times r} .

f (X) = F (X X^{T}) = \frac{1}{2} ⟨ A (X X^{T}), X X^{T} ⟩ + ⟨ B X, X ⟩ \forall X \in R^{n \times r} .

h_{N} (X) = \frac{α}{4} ∥ X ∥^{4} + \frac{σ}{2} ∥ X ∥^{2} \forall X \in R^{n \times r},

h_{N} (X) = \frac{α}{4} ∥ X ∥^{4} + \frac{σ}{2} ∥ X ∥^{2} \forall X \in R^{n \times r},

\nabla^{2} h_{N} (X) [U, U] = α (∥ X ∥^{2} ∥ U ∥^{2} + 2 ⟨ X, U ⟩^{2}) + σ ∥ U ∥^{2} \geq α ∥ X ∥^{2} ∥ U ∥^{2} + σ ∥ U ∥^{2} .

\nabla^{2} h_{N} (X) [U, U] = α (∥ X ∥^{2} ∥ U ∥^{2} + 2 ⟨ X, U ⟩^{2}) + σ ∥ U ∥^{2} \geq α ∥ X ∥^{2} ∥ U ∥^{2} + σ ∥ U ∥^{2} .

\nabla^{2} f (X) [U, U] = \nabla^{2} F (X X^{T}) [U X^{T} + X U^{T}, U X^{T} + X U^{T}] + 2 ⟨ \nabla F (X X^{T}), U U^{T} ⟩ .

\nabla^{2} f (X) [U, U] = \nabla^{2} F (X X^{T}) [U X^{T} + X U^{T}, U X^{T} + X U^{T}] + 2 ⟨ \nabla F (X X^{T}), U U^{T} ⟩ .

\nabla^{2} F (X X^{T}) [U X^{T} + X U^{T}, U X^{T} + X U^{T}] \leq L_{F} ∥ U X^{T} + X U^{T} ∥^{2} .

\nabla^{2} F (X X^{T}) [U X^{T} + X U^{T}, U X^{T} + X U^{T}] \leq L_{F} ∥ U X^{T} + X U^{T} ∥^{2} .

\begin{split}\langle\nabla F(XX^{T}),UU^{T}\rangle&=\langle\nabla F(0),UU^{T}\rangle+\langle\nabla F(XX^{T})-\nabla F(0),UU^{T}\rangle\\ &\leq\|\nabla F(0)\|\,\|U\|^{2}+\|\nabla F(XX^{T})-\nabla F(0)\|\,\|UU^{T}\|\\ &\leq\Big{(}\|\nabla F(0)\|+L_{F}\|XX^{T}\|\Big{)}\|U\|^{2}\end{split}

\begin{split}\langle\nabla F(XX^{T}),UU^{T}\rangle&=\langle\nabla F(0),UU^{T}\rangle+\langle\nabla F(XX^{T})-\nabla F(0),UU^{T}\rangle\\ &\leq\|\nabla F(0)\|\,\|U\|^{2}+\|\nabla F(XX^{T})-\nabla F(0)\|\,\|UU^{T}\|\\ &\leq\Big{(}\|\nabla F(0)\|+L_{F}\|XX^{T}\|\Big{)}\|U\|^{2}\end{split}

\nabla^{2} f (X) [U, U] \leq L_{F} ∥ U X^{T} + X U^{T} ∥^{2} + 2 (L_{F} ∥ X X^{T} ∥ + ∥\nabla F (0) ∥) ∥ U ∥^{2} \leq 2 L_{F} (∥ U X^{T} ∥^{2} + ∥ X U^{T} ∥^{2}) + 2 (L_{F} ∥ X X^{T} ∥ + ∥\nabla F (0) ∥) ∥ U ∥^{2} \leq 6 L_{F} ∥ X ∥^{2} ∥ U ∥^{2} + 2∥\nabla F (0) ∥∥ U ∥^{2} \leq α ∥ X ∥^{2} ∥ U ∥^{2} + σ ∥ U ∥^{2}

\nabla^{2} f (X) [U, U] \leq L_{F} ∥ U X^{T} + X U^{T} ∥^{2} + 2 (L_{F} ∥ X X^{T} ∥ + ∥\nabla F (0) ∥) ∥ U ∥^{2} \leq 2 L_{F} (∥ U X^{T} ∥^{2} + ∥ X U^{T} ∥^{2}) + 2 (L_{F} ∥ X X^{T} ∥ + ∥\nabla F (0) ∥) ∥ U ∥^{2} \leq 6 L_{F} ∥ X ∥^{2} ∥ U ∥^{2} + 2∥\nabla F (0) ∥∥ U ∥^{2} \leq α ∥ X ∥^{2} ∥ U ∥^{2} + σ ∥ U ∥^{2}

\nabla^{2} f (X) [U, U] \leq \nabla^{2} h_{N} (X) [U, U]

\nabla^{2} f (X) [U, U] \leq \nabla^{2} h_{N} (X) [U, U]

T_{λ} (X) = \frac{1}{τ _{σ} ( α ∥ U ∥ ^{2} )} U

T_{λ} (X) = \frac{1}{τ _{σ} ( α ∥ U ∥ ^{2} )} U

U = \nabla h_{N} (X) - λ \nabla f (X) = (α ∥ X ∥^{2} + σ) X - λ \nabla f (X)

U = \nabla h_{N} (X) - λ \nabla f (X) = (α ∥ X ∥^{2} + σ) X - λ \nabla f (X)

τ_{σ} (c) = \frac{σ}{3} + 3 \frac{c + Δ}{2} + \frac{σ ^{3}}{27} + 3 \frac{c - Δ}{2} + \frac{σ ^{3}}{27} where Δ = c^{2} + \frac{4}{27} c σ^{3} .

τ_{σ} (c) = \frac{σ}{3} + 3 \frac{c + Δ}{2} + \frac{σ ^{3}}{27} + 3 \frac{c - Δ}{2} + \frac{σ ^{3}}{27} where Δ = c^{2} + \frac{4}{27} c σ^{3} .

h_{G} (X) = \frac{α}{4} ∥ X ∥^{4} + \frac{β}{4} ∥ X^{T} X ∥^{2} + \frac{σ}{2} ∥ X ∥^{2} \forall X \in R^{n \times r},

h_{G} (X) = \frac{α}{4} ∥ X ∥^{4} + \frac{β}{4} ∥ X^{T} X ∥^{2} + \frac{σ}{2} ∥ X ∥^{2} \forall X \in R^{n \times r},

f (X) = F (X X^{T}) = \frac{μ _{F}}{2} ∥ X X^{T} ∥^{2} + \tilde{F} (X X^{T}) .

f (X) = F (X X^{T}) = \frac{μ _{F}}{2} ∥ X X^{T} ∥^{2} + \tilde{F} (X X^{T}) .

\nabla^{2} h_{G} (X) [U, U] = α (∥ X ∥^{2} ∥ U ∥^{2} + 2 ⟨ X, U ⟩^{2}) + β (\frac{1}{2} ∥ U X^{T} + X U^{T} ∥^{2} + ∥ U^{T} X ∥^{2}) + σ ∥ U ∥^{2} \geq α ∥ X ∥^{2} ∥ U ∥^{2} + β (\frac{1}{2} ∥ U X^{T} + X U^{T} ∥^{2} + ∥ U^{T} X ∥^{2}) + σ ∥ U ∥^{2} .

\nabla^{2} h_{G} (X) [U, U] = α (∥ X ∥^{2} ∥ U ∥^{2} + 2 ⟨ X, U ⟩^{2}) + β (\frac{1}{2} ∥ U X^{T} + X U^{T} ∥^{2} + ∥ U^{T} X ∥^{2}) + σ ∥ U ∥^{2} \geq α ∥ X ∥^{2} ∥ U ∥^{2} + β (\frac{1}{2} ∥ U X^{T} + X U^{T} ∥^{2} + ∥ U^{T} X ∥^{2}) + σ ∥ U ∥^{2} .

\nabla^{2} f (X) [U, U] \leq L_{F} ∥ U X^{T} + X U^{T} ∥^{2} + 2 ⟨ \nabla F (X X^{T}), U U^{T} ⟩

\nabla^{2} f (X) [U, U] \leq L_{F} ∥ U X^{T} + X U^{T} ∥^{2} + 2 ⟨ \nabla F (X X^{T}), U U^{T} ⟩

⟨ \nabla F (X X^{T}), U U^{T} ⟩

⟨ \nabla F (X X^{T}), U U^{T} ⟩

+ ⟨ \nabla F (X X^{T}) - \nabla F (0) - μ_{F} (X X^{T} - 0), U U^{T} ⟩

= ⟨ \nabla F (0), U U^{T} ⟩ + μ_{F} ∥ U^{T} X ∥^{2} + ⟨ \nabla G (X X^{T}) - \nabla G (0), U U^{T} ⟩

\leq ∥\nabla F (0) ∥ ∥ U ∥^{2} + μ_{F} ∥ U^{T} X ∥^{2} + (L_{F} - μ_{F}) ∥ X X^{T} ∥ ∥ U U^{T} ∥

\leq ∥\nabla F (0) ∥ ∥ U ∥^{2} + L_{F} ∥ U^{T} X ∥^{2} + (L_{F} - μ_{F}) ∥ X ∥^{2} ∥ U ∥^{2},

\nabla^{2} f (X) [U, U]

\nabla^{2} f (X) [U, U]

+ 2∥\nabla F (0) ∥ ∥ U ∥^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

∎

11institutetext: Radu-Alexandru Dragomir, corresponding author 22institutetext: Université Toulouse 1 Capitole & D.I. École Normale Supérieure, Paris, France

[email protected] 33institutetext: Alexandre d’Aspremont 44institutetext: CNRS & D.I. École Normale Supérieure, Paris, France

[email protected]. 55institutetext: Jérôme Bolte 66institutetext: Université Toulouse 1 Capitole, Toulouse, France

[email protected]

Quartic First-Order Methods for Low-Rank Minimization

Radu-Alexandru Dragomir

Alexandre d’Aspremont

Jérôme Bolte

(Last revised on January 18, 2021)

Abstract

We study a general nonconvex formulation for low-rank minimization problems. We use recent results on non-Euclidean first-order methods to provide efficient and scalable algorithms. Our approach uses the geometry induced by the Bregman divergence of well-chosen kernel functions; for unconstrained problems we introduce a novel family of Gram quartic kernels that improve numerical performance.

Numerical experiments on Euclidean distance matrix completion and symmetric nonnegative matrix factorization show that our algorithms scale well and reach state of the art performance when compared to specialized methods.

Keywords:

Bregman first-order methods Low-rank minimization Burer-Monteiro Matrix factorization Euclidean Distance matrix completion

MSC:

90C06 90C26

††journal: JOTA

1 Introduction

We consider the problem of minimizing a smooth convex function over the set of low-rank positive semidefinite matrices. Fundamental applications of this problem arise in various areas of data analysis including matrix completion Cand2008 ; Completion2010 ; Jain2012 , matrix sensing Recht2007 , Euclidean matrix completion Mishra2011 ; Fang2012 , phase retrieval Candes2015 , robust principal component analysis Chen2015 , to name a few.

A popular approach to low-rank semidefinite minimization, known as the Burer-Monteiro formulation Burer2005 , consists in explicitly modeling the rank constraint by writing the matrix in a factorized form. This method is especially appealing for large-scale instances, since it requires storing much less variables than the standard semidefinite programming approaches; see Tu2015 ; Bhojanapalli2015 ; Zhao2015 ; Sun2015 ; Chen2015 ; Zheng2016 ; Park2016 and references therein.

This formulation comes however with an important drawback, as the problem becomes nonconvex, even if the original objective is convex. Therefore, local optimization methods can generally only hope to find a stationary point, or at best a local minimum. Nevertheless, recent work shows convergence towards a global optimum for a close enough initialization Tu2015 ; Bhojanapalli2015 ; Park2016 , or under additional statistical assumptions about the problem Chen2015 ; Zheng2015 ; Ge2016 . Although these global optimality results often impose restrictive assumptions that may not be satisfied in practice, they help to explain why using local algorithms to solve Burer-Monteiro problem formulations often leads to satisfactory solutions in practice.

The most commonly used algorithm to solve these problem formulations is some variant of the proximal gradient method. However, a critical issue with gradient schemes is the choice of step sizes, which significantly impacts performance. This step size choice is closely related to the smoothness of the objective. In particular, when it has a $L$ -Lipschitz continuous gradient with respect to the Euclidean norm, standard gradient methods can be applied with a step size lying in $]0,1/L]$ . This smoothness assumption is used in the broad majority of theoretical analyses of gradient algorithms, yet there are many cases where it is not satisfied Bauschke2017 ; Bolte2018 . In particular, it does not hold for the general Burer-Monteiro low-rank problem, as we will show in what follows.

Of course, there is a way to circumvent this issue in classical Euclidean methods, by using an Armijo line search Lin2007 . However, in some cases, this naive line search strategy generates very small step sizes which in turn involve costly subroutines. Other approaches impose a step size that is only proven to be valid in a small neighborhood of the optimum Bhojanapalli2015 ; Park2016 .

Non-Euclidean gradient methods.

We adopt an original approach based on a recent line of work on non-Euclidean gradient methods Bauschke2017 ; Bolte2018 ; van and subsequent work Lu2016 . Unlike standard gradient descent that uses the uniform Euclidean geometry, the NoLips method, also known as Bregman/Mirror descent, uses the Bregman divergence induced by a well-chosen convex kernel function. This allows the algorithm to take gradient steps that are more adapted to the geometry of the problem, advancing faster in directions where the gradient of the objective changes slowly, thus improving convergence speed. The kernel function is chosen so that the objective function satisfies a compatibility condition called relative smoothness Bauschke2017 ; Lu2016 , which is a generalization of the usual smoothness assumption mentioned earlier.

In our setting, the objective has a quartic growth, hence choosing the geometry induced by a quartic polynomial will prove to be efficient.

Contributions.

In this work, we focus on deriving efficient algorithms to find stationary points of nonconvex low-rank problems. Our main contribution is to identify favorable non-Euclidean geometries for these problems, induced by well-chosen quartic kernels.

We first study a simple quartic norm kernel that is compatible with various regularization terms. We then introduce a novel family of quartic kernels that we call Gram kernels, which can be applied to unregularized problems. They provide richer geometries which greatly improve convergence speed with little impact on the iteration complexity. We also extend the NoLips scheme to Dyn-NoLips, allowing for adaptive step size strategies.

To highlight the benefits of our approach, we study applications to symmetric nonnegative matrix factorization and Euclidean distance matrix completion and show competitive numerical performance compared to specialized algorithms for these problems.

Notations

For a square matrix $M$ , we denote its trace $\mathop{\bf Tr}M=\sum_{i=1}^{n}M_{ii}$ . For two matrices $X$ and $Y$ of same size, we denote the standard Euclidean inner product and norm by $\langle X,Y\rangle=\mathop{\bf Tr}(X^{T}Y)$ and $\|X\|=\sqrt{\mathop{\bf Tr}(X^{T}X)}$ . For a function $f:{\mathbb{R}}^{n\times r}\rightarrow{\mathbb{R}}$ , we denote by $\nabla F(X)$ its gradient matrix $\nabla F(X)_{ij}=\frac{\partial F(X)}{\partial x_{ij}}$ and by $\nabla^{2}F(X)[U,V]$ the second derivative at $X$ in the directions $U,V\in{\mathbb{R}}^{n\times r}$ . $I_{r}$ denotes the identity matrix of size $r\times r$ . For two square matrices $X,Y$ , we write $X\preceq Y$ if the matrix $Y-X$ is positive semidefinite. We write $\|\mathcal{A}\|_{\text{op}}$ for the operator norm of a linear application $\mathcal{A}$ .

2 Quartic Geometries for Low-Rank Minimization

2.1 Problem Setup

Let $n\geq 1$ and consider a low-rank semidefinite program, written

[TABLE]

in the variable $Y\in{\mathbb{R}}^{n\times n}$ , where $F$ is a smooth convex function and $r\leq n$ is the target rank. The Burer-Monteiro formulation Burer2005 consists in representing $Y$ as $Y=XX^{T}$ to solve instead

[TABLE]

in the variable $X\in{\mathbb{R}}^{n\times r}$ , where $g$ is a simple convex regularization function. $F$ is typically a quadratic loss function, and $g$ enforces penalties on the factor $X$ such as sparsity when choosing the $\ell_{1}$ norm, or nonnegativity when choosing the indicator function of the nonnegative orthant. We will write $f:{\mathbb{R}}^{n\times r}\rightarrow{\mathbb{R}}$ the factorized function defined by

[TABLE]

Throughout the paper, we make the following standing assumptions.

Assumption 2.1

(a)

$F:{\mathbb{R}}^{n\times n}\rightarrow{\mathbb{R}}$ * is a twice continuously differentiable function which is $\mu_{F}$ -strongly convex and $L_{F}$ -smooth, i.e.,*

[TABLE] 2. (b)

$g:{\mathbb{R}}^{n\times n}\rightarrow{\mathbb{R}}\cup\{+\infty\}$ * is a closed convex proper function,* 3. (c)

$\min_{{\mathbb{R}}^{n\times r}}\Psi>-\infty$ .

Our analysis will involve the following lemma.

Lemma 1

Let $F:{\mathbb{R}}^{n\times n}\rightarrow{\mathbb{R}}$ be a twice differentiable $\mu_{F}$ -strongly convex and $L_{F}$ -smooth function. Then, the function $G:=F-\frac{\mu_{F}}{2}\|\cdot\|^{2}$ is convex and $(L_{F}-\mu_{F})$ -smooth.

Proof

It suffices to use the second-order characterization Nesterov2004 and notice that, for ${Y,U\in{\mathbb{R}}^{n\times n}}$ , we have $\nabla^{2}G(Y)[U,U]=\nabla^{2}F(Y)[U,U]-\mu_{F}\|U\|^{2}$ and hence

[TABLE]

∎

2.2 Relative Smoothness and the Bregman Iteration Map

In this section, we recall the framework of Bauschke2017 ; Bolte2018 to derive non-Euclidean gradient methods.

The first essential step is the choice of a distance kernel. In our context, we choose a differentiable strictly convex function $h:{\mathbb{R}}^{n\times r}\rightarrow{\mathbb{R}}$ , with ${\mathop{\bf dom}h={\mathbb{R}}^{n\times r}}$ (although more general distance kernels can be used). The distance kernel $h$ induces in turn a Bregman distance

[TABLE]

Note that $D_{h}$ is not a proper distance, it is sometimes referred to as a Bregman divergence. However $D_{h}$ enjoys a distance-like separation property: $D_{h}(X,X)=0$ and $D_{h}(X,Y)>0$ for $X\neq Y$ . The choice of a distance kernel suited to the function $f$ is guided by the following relative smoothness condition, also called generalized Lipschitz property.

Definition 1 (Relative smoothness Bauschke2017 )

We say that a differentiable function $f:{\mathbb{R}}^{n\times r}\rightarrow{\mathbb{R}}$ is $L$ -smooth relatively to the distance kernel $h$ if there exists $L>0$ such that for every $X,Y\in{\mathbb{R}}^{n\times r}$ ,

[TABLE]

For twice differentiable functions, relative smoothness has an elementary characterization: $f$ is $L$ -smooth relatively to $h$ if and only if

[TABLE]

where $\nabla^{2}f(X)[U,U]$ denotes the second derivative of $f$ at $X$ in the direction $U$ . Notice that if $h(X)=\frac{1}{2}\|X\|^{2}$ , then $D_{h}(X,Y)=\frac{1}{2}\|X-Y\|^{2}$ and we recover the standard Euclidean descent lemma that would be implied by Lipschitz continuity of the gradient of $f$ .

Bregman iteration map

Now that we are equipped with a non-Euclidean geometry generated by $h$ , we define the Bregman proximal iteration map with step size $\lambda$ as follows.

[TABLE]

which consists in minimizing a surrogate for $\Psi$ where $f$ has been replaced by the upper approximation given by (RelSmooth) and the nonsmooth part $g$ is kept intact, generalizing thus the approach used in the proximal gradient method. The relative smoothness condition ensures that this operation decreases the objective $\Psi$ when $\lambda\in]0,1/L]$ . This iteration map is the basic brick for non-Euclidean methods à la Bregman. The simplest method is NoLips Bauschke2017 and its extension Dyn-NoLips (Algorithm 1), which simply amounts to iterating $X^{k+1}=T_{\lambda_{k}}(X^{k})$ , but other possibilities exist using momentum ideas Auslender2006 ; Hanzely2018 ; Mukkamala2019 .

2.3 The Quartic Geometry

In order to provide some insight into the quartic geometry of our problem, let us consider the example where $F$ is a quadratic function, i.e.,

[TABLE]

where $B\in{\mathbb{R}}^{n\times n}$ and $\mathcal{A}:{\mathbb{R}}^{n\times n}\rightarrow{\mathbb{R}}^{n\times n}$ is some linear map. Then, $f$ writes

[TABLE]

Clearly, $f$ is a quartic function and its gradient is not Lipschitz continuous on ${\mathbb{R}}^{n\times r}$ , as the Hessian “grows” to infinity when $\|X\|\rightarrow\infty$ . In other words, (RelSmooth) does not hold with the Euclidean kernel $h=\frac{1}{2}\|\cdot\|^{2}$ . We now show that relative smoothness holds with a family of well-chosen quartic kernels, which are more adapted to the geometry of $f$ .

2.3.1 The Quartic Norm Kernel

We begin with the simplest quartic kernel, which depends solely on the Frobenius norm of $X$ . Define the norm kernel $h_{N}$ as

[TABLE]

where $\alpha,\sigma>0$ are fixed parameters. Note that this kernel is not new by itself, as it has been already studied in Bolte2018 for vectors in ${\mathbb{R}}^{n}$ . Our first contribution is to show that it is adapted to every function of our class of problems.

Proposition 1 (Norm kernel)

The function $f$ is $1$ -smooth relative to the norm kernel $h_{N}$ for $\alpha\geq 6L_{F}$ and $\sigma\geq 2\|\nabla F(0)\|$ .

Proof

As $F$ is twice differentiable, then so if $f$ and we can use the Hessian characterization (2). For $X,U\in{\mathbb{R}}^{n\times r}$ , the second derivative of $h_{N}$ is written

[TABLE]

On the other hand, the second derivative of $f$ is

[TABLE]

Since $F$ has a Lipschitz continuous gradient, the standard second derivative inequality yields

[TABLE]

Now, the second term can be bounded by using the triangle inequality, the Cauchy-Schwarz inequality and the gradient Lipschitz property, to get

[TABLE]

hence

[TABLE]

where we used the submultiplicative property of the Frobenius norm, and our choice of parameters $\alpha,\sigma$ . Combining (6) and (8) gives that

[TABLE]

for all $X,U\in{\mathbb{R}}^{n\times r}$ , hence that $f$ is $1$ -smooth relatively to $h$ Bauschke2017 .∎

The Bregman iteration map (3) associated with the kernel $h_{N}$ can be computed easily in closed form. We give its expression in the unconstrained case Bolte2018 .

Proposition 2 (Bregman iteration map for $h_{N}$ , unconstrained case)

Assume that there is no penalty term, i.e., that $g\equiv 0$ . The Bregman iteration map of the norm kernel $h_{N}$ with step size $\lambda>0$ is given by

[TABLE]

where

[TABLE]

and $\tau_{\sigma}(c)$ denotes the unique real solution $z$ to the cubic equation $z^{2}(z-\sigma)=c$ .

Note that $\tau_{\sigma}(c)$ can be computed in closed form using Cardano’s method

[TABLE]

Compared to a standard gradient iteration, the additional operations are elementary and have a minimal impact on the arithmetic complexity.

Constraints and regularization terms.

Following the ideas in Bolte2018 , the Bregman iteration map of $h_{N}$ can also be easily computed in closed form when $g$ is the $\ell_{1}$ norm or the $\ell_{0}$ pseudonorm. As we will show in Section 4.1, this is also elementary when $g$ is the indicator function of the nonnegative orthant.

2.3.2 A More Refined Kernel for Unregularized Problems: the Gram Kernel

While the kernel $h_{N}$ is simple and compatible with many penalties $g$ , a better kernel can be derived for unconstrained instances by considering a richer geometry involving the Gram matrix. Define the Gram kernel as

[TABLE]

where $\alpha,\beta\geq 0,\,\sigma>0$ are given parameters. The Gram kernel is more refined than the previous norm kernel since it incorporates some nonisotropic information with the $\|X^{T}X\|^{2}$ term. To show where this term stems from, observe that following Lemma 1, $F$ can be decomposed as $F=\frac{\mu_{F}}{2}\|\cdot\|^{2}+\tilde{F}$ where $\tilde{F}$ is $(L_{F}-\mu_{F})$ -smooth. Hence $f$ writes

[TABLE]

Since $\|XX^{T}\|^{2}=\|X^{T}X\|^{2}$ , the first term can be directly incorporated into the kernel, which allows to prove a tighter relative smoothness inequality.

Proposition 3 (Gram kernel)

$f$ * is $1$ -smooth relatively to the Gram kernel $h_{G}$ for $\alpha\geq 2(L_{F}-\mu_{F})$ , $\beta\geq 2L_{F}$ and $\sigma\geq 2\|\nabla F(0)\|$ .*

Proof

This amounts to refine the analysis of the proof of Proposition 1. Let $X,U\in{\mathbb{R}}^{n\times r}$ . The second derivative of $h_{G}$ at $X$ in the direction $U$ writes

[TABLE]

On the other hand, following (7) the second derivative of $f$ satisfies

[TABLE]

To bound the second term, we use Lemma 1 which states that the function $G(Y):=F(Y)-\mu_{F}\|Y\|^{2}/2$ is convex and smooth with constant $L_{F}-\mu_{F}$ . Using the gradient Lipschitz property of $G$ yields

[TABLE]

using that $\mu_{F}\leq L_{F}$ , and so we have

[TABLE]

which shows that, for the prescribed choice of $\alpha,\beta,\sigma$ , the function $f$ is 1-smooth relatively to $h_{G}$ . ∎

Approximation quality for well-conditioned $F$ .

Let us illustrate the advantage of the Gram kernel when $F$ is well-conditioned. For simplicity, assume here that $F$ is a quadratic function, as in (4), i.e., $F(Y)=\frac{1}{2}\langle\mathcal{A}Y,Y\rangle+\langle B,Y\rangle$ where $\mathcal{A}$ is a positive semidefinite linear operator on ${\mathbb{R}}^{n\times r}$ , and hence $f$ has a quartic and a quadratic term

[TABLE]

The gap between $f$ and $h_{G}$ with the choice of coefficients prescribed by Proposition 3 writes, for $X\in{\mathbb{R}}^{n\times r}$ ,

[TABLE]

where we separated the gap into a quartic term $d_{4}$ and a quadratic term $d_{2}$ . It can be seen from (2) that the quality of approximation of the kernel is given by the difference of the Hessians. Focusing on the quartic part, the Hessian difference is

[TABLE]

for $X,U\in{\mathbb{R}}^{n\times r}$ . Recalling that $F$ is $L_{F}$ -smooth and $\mu_{F}$ -strongly convex, we have that ${\|L_{F}I-\mathcal{A}\|_{\text{op}}}\leq(L_{F}-\mu_{F})$ , therefore

[TABLE]

which shows that the quality of approximation of the quartic part of $f$ by the Gram kernel depends on the condition number $\kappa_{F}:=L_{F}/\mu_{F}$ of $F$ . Note that one could actually refine the analysis by replacing $\kappa_{F}$ with the condition number of $F$ restricted to the set of matrices of rank at most $2r$ , which can be much smaller. This is the case when the linear map $\mathcal{A}$ satisfies the restricted isometry property (RIP), which occurs with high probability in matrix sensing applications with a sufficiently large number $n$ of samples Recht2007 ; Meka2009 ; Jain2012 .

Computing the iteration map.

We show now that, when there is no penalty term $g$ , the Bregman iteration map of $h_{G}$ can be computed efficiently, as it involves solving an easy quartic minimization subproblem of size $r$ .

Proposition 4 (Gram’s iteration map)

Assume that $g\equiv 0$ . For $X\in{\mathbb{R}}^{n\times r}$ , the Bregman iteration map of $f$ for the Gram kernel $h_{G}$ with step size $\lambda>0$ , called Gram’s iteration map, is given by

[TABLE]

where the matrices $V,Z$ are computed through the routine:

•

Set $V=\nabla h_{G}(X)-\lambda\nabla f(X)$ ,

•

diagonalize $V^{T}V$ as $V^{T}V=P^{T}DP$ where $P\in\mathcal{O}_{r}$ and $D=\mathop{\bf diag}(\eta_{1}^{2},\dots,\eta_{r}^{2})$ ,

•

let $\mu=(\mu_{1},\dots,\mu_{r})$ be the unique solution of the convex minimization problem

[TABLE]

•

finally set $Z=P^{T}\mathop{\bf diag}\big{[}\mu_{1}^{2},\dots,\mu_{r}^{2}\big{]}P.$

Proof

When $g\equiv 0$ , The Bregman iteration map of $h_{G}$ writes, for $X\in{\mathbb{R}}^{n\times r}$ ,

[TABLE]

where we remove constant terms and defined $V:=\nabla h_{G}(X)-\lambda\nabla f(X)$ . Write for the sake of clarity $U^{\star}:=T_{\lambda}(X)$ . The optimization problem (14) is strictly convex and the unique solution $U^{\star}$ satisfies $\nabla h_{G}(U^{\star})=V$ , meaning that

[TABLE]

Define $Z:=U^{\star T}U^{\star}\in{\mathbb{R}}^{r\times r}$ . Then, the knowledge of $Z$ determines $U^{\star}$ , since $\|U^{\star}\|^{2}=\mathop{\bf Tr}(Z)$ and therefore $U^{\star}=V(\alpha\mathop{\bf Tr}(Z)I_{r}+\beta Z+\sigma I_{r})^{-1}$ .

Now, taking (15) and multiplying by its transpose implies that

[TABLE]

This shows that $V^{T}V$ is a polynomial in $Z$ , and therefore that they admit the same eigenvectors. Write the diagonalization $V^{T}V=P^{T}\mathop{\bf diag}(\eta_{1}^{2},\dots,\eta_{r}^{2})P$ and $Z=P^{T}\mathop{\bf diag}(\mu_{1}^{2},\dots,\mu_{r}^{2})P$ where $P\in\mathcal{O}_{r}$ and $\mu_{i},\eta_{i}\geq 0$ for $i=1\dots r$ . It follows from diagonalizing (16) and taking the square root that

[TABLE]

This is exactly the first-order optimality condition on $\mu=(\mu_{1},\dots,\mu_{r})$ for the problem

[TABLE]

Note that we do not need to enforce the nonnegativity constraint on $x$ , since we chose $\eta_{i}\geq 0$ it follows that the optimal solution will be nonnegative. Hence, we can reconstruct $Z$ from the diagonalization of $V^{T}V$ and the solution of Problem (18), and thus we get the procedure described in the theorem for computing $U^{*}=T_{\lambda}(X)$ . ∎

Complexity

Note that the order of multiplication is important: we only need to compute the eigendecomposition of $V^{T}V$ , which is of size $r\times r$ . We additionally need to solve a small minimization problem of size $r$ , which can be done efficiently using the quartic NoLips algorithm with norm kernel (see Appendix A for implementation details). Due to this, the complexity of computing the Bregman iteration map of $h_{G}$ is $O(nr^{2}+r^{3}+Kr)$ , where $K$ is the number of iterations needed to solve the subproblem. Since $r$ is usually much smaller than $n$ by several orders of magnitude, the main computational bottleneck remains in most applications computing the gradient $\nabla f(X)$ .

2.3.3 Comparison: how to choose the most appropriate kernel

In order to devise efficient methods, one should search for the kernel $h$ such that the upper approximation of $f$ in (RelSmooth) is as tight as possible, or, equivalently, such that the Hessian of the residual $Lh-f$ is small. On the other hand, $h$ has to be simple enough so that the iteration map (3) is easy to compute (which precludes choosing $h=f$ , as the iteration would be as hard to solve as the initial problem). This trade-off is key in choosing the appropriate kernel. Let us review these two conflicting criteria in our situation.

Complexity of the Bregman iteration map.

For the norm kernel $h_{N}$ , one iteration involves computing the gradient of $f$ , then solving a simple scalar equation. The Gram kernel $h_{G}$ involves solving a subproblem which requires $O(nr^{2}+r^{3})$ additional operations. This overhead is negligible for the typical regime where $r\ll n$ ; however, the iterate can be computed easily only for unconstrained problems.

Quality of Hessian approximation.

We showed in Section 2.3.2 that the quality of the approximation of the quartic component of $f$ by the Gram kernel is bounded by $O(1-\mu_{F}/L_{F})$ . Therefore, it is expected to show good performance when $F$ is sufficiently well-conditioned. The norm kernel, however, has no such property, as its approximation of $f$ is much coarser. The difference stems from the supplementary $\|X^{T}X\|^{2}$ term, which can be much smaller than $\|X\|^{4}$ , especially when the columns of $X$ are nearly orthogonal.

Note that even if $F$ is not globally strongly convex or $\mu_{F}$ is unknown, the Gram kernel can take advantage of local strong convexity through adaptive step sizes, as we show in the sequel.

3 Algorithms for Quartic Low-Rank Minimization

Now that we are equipped with a non-Euclidean geometry induced by one of the kernels $h_{N}$ and $h_{G}$ , we are ready to define the minimization scheme Dyn-NoLips in Algorithm 1. It extends the NoLips algorithm from Bolte2018 to allow step sizes larger than the theoretical value $1/L$ .

Step size choice

The step size $\lambda_{k}$ is chosen so that the new iterate $X^{k}=T_{\lambda_{k}}(X^{k-1})$ satisfies

[TABLE]

There are two ways to ensure this condition holds.

•

Fixed step size. Since $f$ is $L$ -smooth relatively to $h$ , (19) holds as soon as $0<\lambda_{k}\leq{1}/{L}$ .

•

Dynamical step size. In some cases, the relative Lipschitz constant might be too conservative, and better numerical performance can be achieved by taking larger steps. We therefore can use a dynamical strategy for extending the step size, ensuring that (19) holds at each iteration. There are many strategies to efficiently adjust the step size; see, e.g., Nesterov2007 . In our case, we choose a simple strategy similar in spirit to the Armijo line search: at iteration $k$ , start with a tentative step size $\lambda_{k}$ , then find the smallest integer $j$ such that (19) is satisfied with step size $2^{-j}\lambda_{k}$ . Then, set $\lambda_{k+1}=2^{-j+1}\lambda_{k}$ .

Convergence to a stationary point.

We now extend the theoretical convergence results from Bolte2018 to handle the dynamical step size strategy.

Theorem 3.1 (Convergence results)

Let $\{X^{k}\}_{k\geq 0}$ be the sequence generated by Algorithm 1. Assume that

$f$ * is $L$ -smooth relatively to a distance kernel $h$ such that $h$ is strongly convex and twice continuously differentiable on ${\mathbb{R}}^{n\times r}$ , and the penalty function $g$ is convex.* 2. 2.

The function $\Psi=f+g$ is coercive (meaning that $\Psi(X)\rightarrow+\infty$ when $\|X\|\rightarrow+\infty$ ) and semialgebraic.

Then, the sequence $\{\Psi(X^{k})\}_{k\geq 0}$ is nonincreasing, and the sequence $\{X^{k}\}_{k\geq 0}$ converges towards a critical point $X^{*}$ of problem (P).

Proof

First, the step size $\lambda_{k}$ can be bounded for $k\geq 0$ as

[TABLE]

Indeed, the upper bound holds by construction of the algorithm. The lower bound comes from the relative smoothness property: condition (19) is true for every $\lambda\in(0,\frac{1}{L}]$ , so the inner loop will stop whenever $\lambda$ gets below $1/L$ .

Let us now prove the result. Since Condition (19) holds at each iteration $k$ , we can write

[TABLE]

On the other hand, the optimality condition characterizing $X^{k+1}=T_{\lambda_{k}}(X^{k})$ writes

[TABLE]

where $\partial g$ denotes the subdifferential of the convex function $g$ . Combining (22) with the subgradient inequality for $g$ yields

[TABLE]

Summing (21) and (23) gives

[TABLE]

which yields

[TABLE]

From this inequality, we can now prove the same convergence properties as for the standard NoLips scheme. Indeed, the monotonicity of the sequence $\{\Psi(X^{k})\}_{k\geq 0}$ is a direct consequence of the above. Since $\lambda_{k}\leq\lambda_{\rm max}$ , it follows that at every iteration $k\geq 0$ ,

[TABLE]

Now, this inequality is the same as the one needed to prove convergence in the case of the fixed step size in Bolte2018 . Thus, global convergence towards a critical point is a consequence of (Bolte2018, , Th. 4.1), since all the assumptions are met: the kernel $h$ is defined over the entire space ${\mathbb{R}}^{n\times r}$ , it is strongly convex, and $\nabla h$ is Lipschitz continuous on bounded subsets of ${\mathbb{R}}^{n\times r}$ (because we assumed it is $C^{2}$ ). We also need the fact that the sequence $\{X^{k}\}_{k\geq 0}$ is bounded, which is a consequence of the monotonicity of $\{\Psi(X^{k})\}_{k\geq 0}$ and the fact that the function $\Psi$ is coercive.∎

The semialgebraicity assumption is needed to establish the crucial nonsmooth Lojasiewicz property Bolte2007 , required to show convergence to a critical point. It holds for all the applications we cited, since the class of semialgebraic functions includes polynomial functions, $\ell_{1}$ and $\ell_{2}$ norms, the $\ell_{0}$ seminorm and indicators of polynomial sets.

4 Applications

We now illustrate applications of our methodology to two different low-rank problems, symmetric nonnegative matrix factorization and Euclidean distance matrix completion. We show that good numerical performance can be reached using the dynamical step strategy, and that, for Euclidean matrix completion, it can be further improved by using the Gram kernel.

4.1 Symmetric Nonnegative Matrix Factorization

Symmetric Nonnegative Matrix Factorization (SymNMF) is the task of finding, given a symmetric nonnegative matrix $M\in{\mathbb{R}}^{n\times n}$ , a nonnegative matrix $X\in{\mathbb{R}}^{n\times r}$ such that $M\approx XX^{T}$ . This is done by solving

[TABLE]

in the variable $X\in{\mathbb{R}}^{n\times r}$ , where the inequality constraint is meant componentwise and $r\leq n$ is the target rank.

(SymNMF) is used as a probabilistic clustering or graph clustering technique Ding2005 ; He2011 . Numerical experiments by Kuang2015 have shown that it achieves state-of-the-art clustering accuracy on several text and image datasets.

4.1.1 Solving SymNMF.

While (SymNMF) looks similar to the well-known asymmetric NMF problem $\min_{X,Y}\frac{1}{2}\|M-XY^{T}\|$ , it is actually harder. This is because NMF has a favorable block structure that allows the application of efficient alternating algorithms Controller2013 ; Cichocki2009 . SymNMF, however, does not enjoy the same block structure. Current solvers fall into two categories:

Direct solvers. There have been several attempts at solving the original problem, including multiplicative update rules He2011 , projected gradient algorithm quasi-Newton schemes Kuang2015 , and coordinate descent Vandaele2016 .

Nonsymmetric relaxations. Another idea is to use a mere penalty method Kuang2015 ; Lu2017 ; zhu2018a , relaxing (SymNMF) to the following penalized nonsymmetric problem

[TABLE]

in the variables $X,Y\in{\mathbb{R}}^{n\times r}$ , with parameter $\mu\geq 0$ . This formulation is very similar to asymmetric NMF and can be solved by the same fast alternating algorithms that exploit the block structure, such as Alternating Nonnegative Least Squares (ANLS) and Hierarchical Alternating Least Squares zhu2018a (HALS), which are arguably the fastest SymNMF solvers.

Applying NoLips

We propose to apply NoLips for optimizing the original objective function. Problem (SymNMF) falls within our framework with ${F(Y)=\frac{1}{2}\|M-Y\|^{2}}$ , which has a Lipschitz gradient with constant $1$ , and $g(X)=i_{\{X\geq 0\}}$ the indicator function of the nonnegative orthant. Therefore, Proposition 1 implies that $f(X):=\frac{1}{2}\|M-XX^{T}\|^{2}$ is $1$ -smooth relatively to the kernel $h_{N}$ with $\alpha=6$ and $\sigma=2\|\nabla F(0)\|=2\|M\|$ . Since, in addition, $f$ is polynomial and $g$ is the indicator of a polynomial set, $f+g$ is semialgebraic, and it is also coercive, so Theorem 3.1 guarantees that NoLips will converge towards a stationary point of problem (SymNMF).

In this problem, the Bregman iteration map is solved by simply adding a projection step

[TABLE]

where $U=\nabla h_{N}(X)-\lambda\nabla f(X)$ , $\tau_{\sigma}$ has been defined in Proposition 2 and $\Pi_{+}$ is the projection on the nonnegative orthant: $\Pi_{+}(U)=\max(U,0)$ (entrywise).

Computational complexity for NoLips.

The computational complexity of an iteration is dominated by gradient computations and objective function evaluations, as all other operations are linear in the size of the variable.

If $M$ is a $n\times n$ dense matrix, each gradient and function evaluation uses $O(n^{2}r+nr^{2})$ floating point operations. If $M$ is represented as a sparse matrix with $p<<n^{2}$ nonzero elements, then we can take advantage of this structure (Vandaele2016, , Rmk. 2) by using

[TABLE]

which yields a much improved $O\left((r^{2}+p)n\right)$ complexity per iteration.

4.1.2 Numerical experiments

We implemented the following algorithms: Algorithm 1 with dynamical step size and the norm kernel (Dyn-NoLips), the $\beta$ -SNMF scheme from He2011 , where we set $\beta=0.99$ as advised by the authors, the projected gradient algorithm (PG) with Armijo line search from Kuang2015 , where we use the line search parameters $\beta=0.1$ and $\sigma=0.01$ , the coordinate descent scheme (CD) from Vandaele2016 , the ADMM algorithm Lu2017 , and the two fast algorithms from zhu2018a for solving the penalized problem (P-NMF): SymANLS and SymHALS. For the last two, we tuned the $\mu$ penalization parameter for best performance. We left out the quasi-Newton algorithm from Kuang2015 because of its prohibitive $O(n^{3})$ complexity for large datasets.

All algorithms were implemented in Julia Bezanson2017 which is a highly-optimized numerical computing language. Since our algorithms have different complexity per iteration, it is essential to compare them in terms of running time, and Julia provides a fairly accurate way to do so as there is little interpreter overhead in loops.111Tests were run on a PC Intel CORE i7-4910MQ CPU @ 2.90 GHz x 8 with 32 Go RAM.

We used two image and two text datasets.

•

Image.

–

CBCL222http://cbcl.mit.edu/software-datasets/FaceData2.html: 2,429 images of faces of size $19\times 19$

–

Coil-20333http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php: 1440 images of size $128\times 128$ representing 20 objects under various angles.

•

Text.

–

**TDT2444http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html **: dataset of 11,201 news articles classified in 96 semantic categories. We used the version provided by Cai et al. CWH09 ; CMHZ08 ; CHZH07 ; CHH05 , which has been restricted to the largest 30 categories, leaving a total of 9,394 documents.

–

Reuters4: dataset of news articles, which we restricted to the largest 25 categories, leaving a total of 7,963 documents.

For all image and text datasets, we construct a sparse similarity matrix $M$ following the procedure described in (Kuang2015, , Section 7.1). We begin by computing the similarity graph between data points, using cosine similarity on term frequency vectors for text, and a Gaussian kernel for image (with the self-tuning method for the scale). The graph obtained is sparsified by keeping only the edges connecting the k-nearest neighbors, with $k=\lfloor\log_{2}n\rfloor+1$ . Then, $M$ is taken as a normalized version of the graph adjacency matrix.

We use the usual convergence criterion for constrained nonconvex problems

[TABLE]

where $\nabla^{P}f(X)$ is the projected gradient defined as

[TABLE]

Table 1 reports the average time needed to reach a convergence criterion of $\epsilon=10^{-3}$ , for 10 random initializations. For each dataset, we test several values for the rank parameter $r$ . In addition, Figure 1 shows the average evolution of the normalized objective gap $\left(f(X^{k})-f_{min}\right)/\|M\|^{2}$ , where $f_{min}$ is the minimal objective value encountered in all initializations.

Overall, the algorithm that shows the best convergence speed is SymHALS, but it has the disadvantage of needing to tune the penalization parameter $\mu$ . In the experiments we report, small values of $\mu$ yielded optimal performance, while the convergence theory of Zhu2018b only holds for large values for which the algorithm is much slower. By contrast, Dyn-NoLips is hyperparameter-free and has the second best overall performance. The gap with the other methods is particularly significant on the larger TDT2 and Reuters datasets, showing that the method scales well with problem dimension.

4.2 Euclidean Distance Matrix Completion

Euclidean distance matrix completion (EDMC) is the task of recovering the position of $n$ points $x_{1}^{*},\dots,x_{n}^{*}\in{\mathbb{R}}^{r}$ , given the knowledge of a partial set of pairwise distances $d_{ij}=\|x_{i}^{*}-x_{j}^{*}\|^{2}$ for $(i,j)\in\Omega$ , where $\Omega\subset[1,n]\times[1,n]$ . It is a fundamental problem with applications in sensor network localization and the study of conformation of molecules; see Fang2012 ; Qi2013 ; Dokmanic2015 and references therein.

The Burer-Monteiro nonconvex formulation for solving this problem writes

[TABLE]

in the variable $X\in{\mathbb{R}}^{n\times r}$ . It can be rewritten $f(X)=\frac{1}{2}\|\mathcal{P}_{\Omega}(\kappa(XX^{T})-D)\|^{2}$ where $D$ is the matrix of known distances, $\mathcal{P}_{\Omega}$ denotes the projection operator such that $\mathcal{P}_{\Omega}(Y)_{ij}=Y_{ij}$ if $(i,j)\in\Omega$ , and $\mathcal{P}_{\Omega}(Y)_{ij}=0$ elsewhere, and $\kappa$ is the linear operator defined for $Y\in{\mathbb{R}}^{n\times n}$ by

[TABLE]

Applying NoLips with the norm kernel.

Problem (EDMC) falls within our framework with $F(Y)=\frac{1}{2}\|\mathcal{P}_{\Omega}(\kappa(Y)-D)\|^{2}$ , which can be shown to have a Lipschitz gradient with constant

[TABLE]

Therefore, as in the case of SymNMF, the norm kernel $h_{N}$ can be used with an initial step size $1$ and parameters $\alpha=6L_{EDM}$ and $\sigma=\frac{1}{3}\|\nabla F(0)\|=2\|\mathcal{P}_{\Omega}(D)\|$ .

Using the Gram kernel

As the problem is unconstrained, we can also apply minimization using the Gram kernel $h_{G}$ . We use the parameters $\alpha=2L_{EDM}$ , ${\beta=L_{EDM}}$ and $\sigma=2\|\mathcal{P}_{\Omega}(D)\|$ , which ensure that $f$ is $1$ -smooth relatively to $h_{G}$ by Proposition 3.

Computational complexity for NoLips.

As before, the main computational bottleneck for an iteration consists in computing the value and gradient of the objective function. If $p=|\Omega|$ denotes the number of known distances, then the computational complexity is $O(pr)$ . If the Gram kernel is used, each iteration requires an additional $O(nr^{2}+r^{3})$ flops (see Section 2.3.2), which is negligible compared to the latter in the usual setting where $p>>n$ and $r$ is small.

Numerical experiments

We implement the following algorithms: NoLips with a dynamical step size and the norm kernel (Dyn-NoLips), NoLips with a dynamical step size and the Gram kernel (Dyn-NoLips-Gram), gradient descent with Armijo line search (GD), the Riemannian trust region algorithm from Mishra2011 (TR). We leave out semidefinite relaxations because of their memory requirement which is prohibitive on large data. As the implementation for TR is provided in Matlab, we run our experiments on Matlab as well, with the same setup as in Section 4.1.

We try the algorithms on a standard EDMC problem, the 3-dimensional Helix dataset Mishra2011 which is generated as $X_{i}=(\cos(3t_{i}),\sin(3t_{i}),2t_{i})$ where $\{t_{i}\}_{i=1}^{n}$ are sampled uniformly in $[0,2\pi]$ . We randomly keep only 10 % on the pairwise distances, and test on two different problem sizes: $n=2000$ and $n=5000$ . Figure 2 reports the normalized root mean squared error (RMSE) over all distances (known and unknown) averaged on $10$ random initializations. All the algorithms manage to recover the ground truth; the Dyn-NoLips-Gram algorithm shows the best numerical performance, which demonstrates the advantage of using the Gram geometry.

5 Conclusion

We proposed a generic approach for solving Burer-Monteiro formulations of low-rank minimization problems using the methodology of Bregman gradient methods and relative smoothness. We studied two quartic kernels, including a new Gram kernel, and demonstrated their benefits on numerical experiments. In future work, performance could be improved further by studying inertial variants Mukkamala2019 ; Hanzely2018 . New kernels could also be explored beyond the class of quartic functions to tackle other problems with inherent non-Euclidean geometries.

Code

The code for reproducing experiments for SymNMF and Euclidean Distance Matrix Completion can be downloaded from the public repository

https://github.com/RaduAlexandruDragomir/QuarticLowRankOptimization

Acknowledgments

The authors would like to thank the anonymous reviewers for their insightful comments.

Radu-Alexandru Dragomir would like to acknowledge support from an AMX fellowship, the Air Force Office of Scientific Research, Air Force Material Command, USAF, under grant number FA9550-18-1-0226, as well as from Sébastien Gadat.

Jérôme Bolte was partially supported by ANR-3IA Artificial and Natural Intelligence Toulouse Institute, and Air Force Office of Scientific Research, and Air Force Material Command, USAF, under grant numbers FA9550-18-1-0226 & FA9550-19-1-7026.

AA is at CNRS & département d’informatique, École normale supérieure, UMR CNRS 8548, 45 rue d’Ulm 75005 Paris, France, INRIA and PSL Research University. AA acknowledges support from the French government under management of Agence Nationale de la Recherche as part of the ”Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute), the ML & Optimisation joint research initiative with the fonds AXA pour la recherche and Kamet Ventures, as well as a Google focused award.

Appendix

Appendix A Solving the Subproblem for Computing the Bregman Iteration Map of the Gram Kernel

While it seems that computing the Bregman iteration map of the Gram kernel involves solving another difficult quartic subproblem, it is actually of small size ( $r$ is typically not larger than a few dozens) and can be solved efficiently with the NoLips scheme.

Indeed, the objective function $\phi$ of problem (18) is 1-smooth relatively to the norm kernel in ${\mathbb{R}}^{r}$ $h_{N}(x)=\frac{\alpha_{u}}{4}\|x\|^{4}+\frac{\sigma_{u}}{2}\|x\|^{2}$ with a choice of parameters $\alpha_{u}=\alpha+3\beta$ and $\sigma_{u}=\sigma$ .

Algorithm 2 details the procedure. We initialize $\mu$ with the values for the previous iteration of the outer procedure. This proves to be efficient as the values will not vary much from one iteration to another. For the stopping criterion, we use the scaled gradient norm $\|\nabla\phi(v)\|/\|\eta\|$ and a tolerance value $\epsilon=10^{-6}$ .

The subproblem being very well conditionned, it is minimized easily; in numerical experiments, it usually convergences in no more than $20$ iterations.

Bibliography45

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Emmanuel J. Candès and Benjamin Recht. Exact Matrix Completion Via Convex Optimization. Found Comput Math , 9(6), 2009.
2[2] Jian-Feng Cai, Emmanuel J. Candès, and Zuowei Shen. A Singular Value Tresholding Algorithm for Matrix Completion. SIAM Journal on Optimization , 20(4):1956–1982, 2010.
3[3] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank Matrix Completion using Alternating Minimization. In Proceedings of the Forty-fifth Annual ACM Symposium on Theory of Computing , pages 665—-674, 2013.
4[4] Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization. SIAM Review , 52(3):471–501, 2007.
5[5] Bandev Mishra, Gilles Meyer, and Rodolphe Sepulchre. Low-rank optimization for distance matrix completion. In Proceedings of the IEEE Conference on Decision and Control , pages 4455–4460, 2011.
6[6] Haw Ren Fang and Dianne P. O’Leary. Euclidean distance matrix completion problems. Optimization Methods and Software , 27(4):695–717, 2012.
7[7] Emmanuel J. Candès, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory , 61(4):1985–2007, 2015.
8[8] Yudong Chen and Martin J. Wainwright. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. ar Xiv preprint ar Xiv:1509.03025 , 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Quartic First-Order Methods for Low-Rank Minimization

Abstract

Keywords:

MSC:

1 Introduction

Non-Euclidean gradient methods.

Contributions.

Notations

2 Quartic Geometries for Low-Rank Minimization

2.1 Problem Setup

Assumption 2.1

Lemma 1

Proof

2.2 Relative Smoothness and the Bregman Iteration Map

Definition 1 (Relative smoothness Bauschke2017 )

Bregman iteration map

2.3 The Quartic Geometry

2.3.1 The Quartic Norm Kernel

Proposition 1 (Norm kernel)

Proof

Proposition 2 (Bregman iteration map for hNh_{N}hN​, unconstrained case)

Constraints and regularization terms.

2.3.2 A More Refined Kernel for Unregularized Problems: the Gram Kernel

Proposition 3 (Gram kernel)

Proof

Approximation quality for well-conditioned FFF.

Computing the iteration map.

Proposition 4 (Gram’s iteration map)

Proof

Complexity

2.3.3 Comparison: how to choose the most appropriate kernel

Complexity of the Bregman iteration map.

Quality of Hessian approximation.

3 Algorithms for Quartic Low-Rank Minimization

Step size choice

Convergence to a stationary point.

Theorem 3.1 (Convergence results)

Proof

4 Applications

4.1 Symmetric Nonnegative Matrix Factorization

4.1.1 Solving SymNMF.

Applying NoLips

Computational complexity for NoLips.

4.1.2 Numerical experiments

4.2 Euclidean Distance Matrix Completion

Applying NoLips with the norm kernel.

Using the Gram kernel

Computational complexity for NoLips.

Numerical experiments

5 Conclusion

Code

Acknowledgments

Appendix

Appendix A Solving the Subproblem for Computing the Bregman Iteration Map of the Gram Kernel

Proposition 2 (Bregman iteration map for $h_{N}$ , unconstrained case)

Approximation quality for well-conditioned $F$ .