A Non-Convex Relaxation for Fixed-Rank Approximation

Carl Olsson; Marcus Carlsson; Erik Bylow

arXiv:1706.05855·math.OC·November 13, 2017·ICCV

A Non-Convex Relaxation for Fixed-Rank Approximation

Carl Olsson, Marcus Carlsson, Erik Bylow

PDF

Open Access

TL;DR

This paper introduces a non-convex relaxation method for fixed-rank matrix approximation that avoids the bias of nuclear norm approaches and often converges to better solutions, especially under RIP conditions.

Contribution

It proposes a novel non-convex relaxation technique for low-rank matrix approximation that reduces bias and demonstrates favorable convergence properties compared to nuclear norm methods.

Findings

01

The non-convex relaxation often has a single local minimizer under RIP.

02

Numerical tests show better solutions than nuclear norm methods.

03

The approach performs well even when RIP does not hold.

Abstract

This paper considers the problem of finding a low rank matrix from observations of linear combinations of its elements. It is well known that if the problem fulfills a restricted isometry property (RIP), convex relaxations using the nuclear norm typically work well and come with theoretical performance guarantees. On the other hand these formulations suffer from a shrinking bias that can severely degrade the solution in the presence of noise. In this theoretical paper we study an alternative non-convex relaxation that in contrast to the nuclear norm does not penalize the leading singular values and thereby avoids this bias. We show that despite its non-convexity the proposed formulation will in many cases have a single local minimizer if a RIP holds. Our numerical tests show that our approach typically converges to a better solution than nuclear norm based alternatives even in cases…

Equations90

rank (X) \leq r min ∥ X - X_{0} ∥_{F}^{2},

rank (X) \leq r min ∥ X - X_{0} ∥_{F}^{2},

X min I (rank (X) \leq r) + ∥ A X - b ∥^{2} .

X min I (rank (X) \leq r) + ∥ A X - b ∥^{2} .

(1 - δ_{q}) ∥ X ∥_{F}^{2} \leq ∥ A X ∥^{2} \leq (1 + δ_{q}) ∥ X ∥_{F}^{2},

(1 - δ_{q}) ∥ X ∥_{F}^{2} \leq ∥ A X ∥^{2} \leq (1 + δ_{q}) ∥ X ∥_{F}^{2},

X min R_{r} (X) + ∥ A X - b ∥^{2},

X min R_{r} (X) + ∥ A X - b ∥^{2},

R_{r} (X) = Z max i = r + 1 \sum N z_{i}^{2} - ∥ X - Z ∥^{2},

R_{r} (X) = Z max i = r + 1 \sum N z_{i}^{2} - ∥ X - Z ∥^{2},

R_{r} (X) + ∥ X - X_{0} ∥_{F}^{2},

R_{r} (X) + ∥ X - X_{0} ∥_{F}^{2},

I (rank (X) \leq r) + ∥ X - X_{0} ∥_{F}^{2} .

I (rank (X) \leq r) + ∥ X - X_{0} ∥_{F}^{2} .

X_{s} \in X arg min R_{r} (X) + ∥ X - Z ∥_{F}^{2} .

X_{s} \in X arg min R_{r} (X) + ∥ X - Z ∥_{F}^{2} .

X min R_{μ} (X) + ∥ A X - b ∥^{2} .

X min R_{μ} (X) + ∥ A X - b ∥^{2} .

F (X) = G (X) - δ_{q} ∥ X ∥_{F}^{2} + H (X) + ∥ b ∥^{2},

F (X) = G (X) - δ_{q} ∥ X ∥_{F}^{2} + H (X) + ∥ b ∥^{2},

2 δ_{q} X_{s} - \nabla H (X_{s}) \in \partial G (X_{s}) .

2 δ_{q} X_{s} - \nabla H (X_{s}) \in \partial G (X_{s}) .

L (X, Z) = - i = 1 \sum r z_{i}^{2} + 2 ⟨ Z, X ⟩,

L (X, Z) = - i = 1 \sum r z_{i}^{2} + 2 ⟨ Z, X ⟩,

\partial G (X) = convhull {\nabla_{X} L (X, Z), Z \in Z (X)},

\partial G (X) = convhull {\nabla_{X} L (X, Z), Z \in Z (X)},

\partial G (X) = 2 Z arg max L (X, Z) .

\partial G (X) = 2 Z arg max L (X, Z) .

z_{i} \in ⎩ ⎨ ⎧ max (x_{i}, s) s [0, s] i \leq r i \geq r, x_{i} \neq = 0 i > r, x_{i} = 0 .

z_{i} \in ⎩ ⎨ ⎧ max (x_{i}, s) s [0, s] i \leq r i \geq r, x_{i} \neq = 0 i > r, x_{i} = 0 .

L (X, Z) = - i = 1 \sum r (z_{i} - x_{i})^{2} + i = 1 \sum r x_{i}^{2},

L (X, Z) = - i = 1 \sum r (z_{i} - x_{i})^{2} + i = 1 \sum r x_{i}^{2},

z_{i} \in {x_{i} [0, x_{r}] i \leq r i \geq r .

z_{i} \in {x_{i} [0, x_{r}] i \leq r i \geq r .

z_{i} \in {x_{i} [0, \overline{s}] i \in I i \in / I, z_{i}^{'} \in {x_{i}^{'} [0, \overline{s}^{'}] i \in I^{'} i \in / I^{'},

z_{i} \in {x_{i} [0, \overline{s}] i \in I i \in / I, z_{i}^{'} \in {x_{i}^{'} [0, \overline{s}^{'}] i \in I^{'} i \in / I^{'},

⟨ z^{'} - z, x^{'} - x ⟩ > \frac{1 - c}{2} ∥ x^{'} - x ∥^{2} .

⟨ z^{'} - z, x^{'} - x ⟩ > \frac{1 - c}{2} ∥ x^{'} - x ∥^{2} .

\sum_{\footnotesize\begin{array}[]{c}i\in I\\ i\in I^{\prime}\end{array}}(x_{i}-x_{i}^{\prime})^{2}+\sum_{\footnotesize\begin{array}[]{c}i\in I\\ i\notin I^{\prime}\end{array}}x_{i}(x_{i}-z_{i}^{\prime})+\sum_{\footnotesize\begin{array}[]{c}i\notin I\\ i\in I^{\prime}\end{array}}x^{\prime}_{i}(x^{\prime}_{i}-z_{i}).

\sum_{\footnotesize\begin{array}[]{c}i\in I\\ i\in I^{\prime}\end{array}}(x_{i}-x_{i}^{\prime})^{2}+\sum_{\footnotesize\begin{array}[]{c}i\in I\\ i\notin I^{\prime}\end{array}}x_{i}(x_{i}-z_{i}^{\prime})+\sum_{\footnotesize\begin{array}[]{c}i\notin I\\ i\in I^{\prime}\end{array}}x^{\prime}_{i}(x^{\prime}_{i}-z_{i}).

\sum_{\footnotesize\begin{array}[]{c}i\in I\\ i\in I^{\prime}\end{array}}(x_{i}-x_{i}^{\prime})^{2}+\sum_{\footnotesize\begin{array}[]{c}i\in I\\ i\notin I^{\prime}\end{array}}x_{i}^{2}+\sum_{\footnotesize\begin{array}[]{c}i\notin I\\ i\in I^{\prime}\end{array}}x_{i}^{\prime 2}.

\sum_{\footnotesize\begin{array}[]{c}i\in I\\ i\in I^{\prime}\end{array}}(x_{i}-x_{i}^{\prime})^{2}+\sum_{\footnotesize\begin{array}[]{c}i\in I\\ i\notin I^{\prime}\end{array}}x_{i}^{2}+\sum_{\footnotesize\begin{array}[]{c}i\notin I\\ i\in I^{\prime}\end{array}}x_{i}^{\prime 2}.

x_{i} (x_{i} - z_{i}^{'}) + x_{j}^{'} (x_{j}^{'} - z_{j}) \geq \frac{1 - c}{2} (x_{i}^{2} + x_{j}^{'2}),

x_{i} (x_{i} - z_{i}^{'}) + x_{j}^{'} (x_{j}^{'} - z_{j}) \geq \frac{1 - c}{2} (x_{i}^{2} + x_{j}^{'2}),

x_{i} z_{i}^{'} \leq x_{i} x_{j}^{'} \leq \frac{x _{i}^{2} + x _{j}^{'2}}{2},

x_{i} z_{i}^{'} \leq x_{i} x_{j}^{'} \leq \frac{x _{i}^{2} + x _{j}^{'2}}{2},

x_{j}^{'} z_{j} < c x_{j}^{'} x_{i} \leq c \frac{x _{i}^{2} + x _{j}^{'2}}{2},

a^{*} = U, V, U^{'}, V^{'} min \frac{⟨ Z ^{'} - Z , X ^{'} - X ⟩}{∥ X ^{'} - X ∥ _{F}^{2}} \leq 1

a^{*} = U, V, U^{'}, V^{'} min \frac{⟨ Z ^{'} - Z , X ^{'} - X ⟩}{∥ X ^{'} - X ∥ _{F}^{2}} \leq 1

a^{*} = M_{π} min \frac{⟨ M _{π} z ^{'} - z , M _{π} x ^{'} - x ⟩}{∥ M _{π} x ^{'} - x ∥ ^{2}},

a^{*} = M_{π} min \frac{⟨ M _{π} z ^{'} - z , M _{π} x ^{'} - x ⟩}{∥ M _{π} x ^{'} - x ∥ ^{2}},

⟨ Z^{'} - Z, X^{'} - X ⟩ > \frac{1 - c}{2} ∥ X^{'} - X ∥_{F}^{2},

⟨ Z^{'} - Z, X^{'} - X ⟩ > \frac{1 - c}{2} ∥ X^{'} - X ∥_{F}^{2},

⟨ Z^{'} - Z, X^{'} - X ⟩ > \frac{1 - c + ϵ}{2} ∥ X^{'} - X ∥_{F}^{2},

⟨ Z^{'} - Z, X^{'} - X ⟩ > \frac{1 - c + ϵ}{2} ∥ X^{'} - X ∥_{F}^{2},

\frac{⟨ Z ^{'} - Z , X ^{'} - X ⟩}{∥ X ^{'} - X ∥ _{F}^{2}} \geq \frac{⟨ M _{π} z ^{'} - z , M _{π} x ^{'} - x ⟩}{∥ M _{π} x ^{'} - x ∥ ^{2}} .

\frac{⟨ Z ^{'} - Z , X ^{'} - X ⟩}{∥ X ^{'} - X ∥ _{F}^{2}} \geq \frac{⟨ M _{π} z ^{'} - z , M _{π} x ^{'} - x ⟩}{∥ M _{π} x ^{'} - x ∥ ^{2}} .

⟨ Z^{'} - Z, X^{'} - X ⟩ \geq \frac{1 - c + ϵ}{2} ∥ X^{'} - X ∥_{F}^{2},

⟨ Z^{'} - Z, X^{'} - X ⟩ \geq \frac{1 - c + ϵ}{2} ∥ X^{'} - X ∥_{F}^{2},

\nabla H (X) = 2 δ_{q} X + 2 (A^{*} A - I) X - 2 A^{*} b .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Advanced Image Processing Techniques · Statistical and numerical algorithms

Full text

A Non-Convex Relaxation for Fixed-Rank Approximation

Carl Olsson1,2 Marcus Carlsson1 Erik Bylow1

1Centre for Mathematical Sciences

Lund University

2Department of Electrical Engineering

Chalmers University of Technology

Abstract

This paper considers the problem of finding a low rank matrix from observations of linear combinations of its elements. It is well known that if the problem fulfills a restricted isometry property (RIP), convex relaxations using the nuclear norm typically work well and come with theoretical performance guarantees. On the other hand these formulations suffer from a shrinking bias that can severely degrade the solution in the presence of noise.

In this theoretical paper we study an alternative non-convex relaxation that in contrast to the nuclear norm does not penalize the leading singular values and thereby avoids this bias. We show that despite its non-convexity the proposed formulation will in many cases have a single local minimizer if a RIP holds. Our numerical tests show that our approach typically converges to a better solution than nuclear norm based alternatives even in cases when the RIP does not hold.

1 Introduction

Low rank approximation is an important tool in applications such as rigid and non rigid structure from motion, photometric stereo and optical flow [25, 4, 26, 13, 2, 12]. The rank of the approximating matrix typically describes the complexity of the solution. For example, in non-rigid structure from motion the rank measures the number of basis elements needed to describe the point motions [4]. Under the assumption of Gaussian noise the objective is typically to solve

[TABLE]

where $X_{0}$ is a measurement matrix and $\|\cdot\|_{F}$ is the Frobenius norm. The problem can be solved optimally using SVD [9], but the strategy is limited to problems where all matrix elements are directly measured. In this paper we will consider low rank approximation problems where linear combinations of the elements are observed. We aim to solve problems of the form

[TABLE]

Here ${\mathbb{I}}({\text{ rank}}(X)\leq r)$ is [math] if ${\text{ rank}}(X)\leq r$ and $\infty$ otherwise. The linear operator ${\mathcal{A}}:\mathbb{R}^{m\times n}\rightarrow\mathbb{R}^{p}$ is assumed to fulfill a restricted isometry property (RIP) [24]

[TABLE]

for all matrices with ${\text{ rank}}(X)\leq q$ . The standard approach for problems of this class is to replace the rank function with the convex nuclear norm $\|X\|_{*}=\sum_{i}\sigma_{i}(X)$ [24, 5]. It was first observed that this is the convex envelope of the rank function over the set $\{X;\sigma_{1}(X)\leq 1\}$ in [11]. Since then a number of generalizations that give performance guarantees for the nuclear norm relaxation have appeared, e.g. [24, 23, 5, 6]. The approach does however suffer from a shrinking bias that can severely degrade the solution in the presence of noise. In contrast to the rank constraint the nuclear norm penalizes both small singular values of $X$ , assumed to stem from measurement noise, and large singular values, assumed to make up the true signal, equally. In some sense the suppression of noise also requires an equal suppression of signal. Non-convex alternatives have been shown to improve performance [22, 20].

In this paper we will consider the relaxation

[TABLE]

where

[TABLE]

and $z_{i}$ , $i=1,...,N$ are the singular values of $X$ and $Z$ respectively. The minimization over $Z$ does not have any closed form solution, however it was shown in [18, 1] how to efficiently evaluate and compute its proximal operator. Figure 1 shows a three dimensional illustration of the level sets of the regularizer.

In [18, 17] it was shown that

[TABLE]

is the the convex envelope of

[TABLE]

By itself the regularization term $R_{r}(X)$ is not convex, but when adding a quadratic term $\|X\|_{F}^{2}$ the result is convex. It is shown in [1] that (7) and (6) have the same optimizer as long as the singular values of $X_{0}$ are distinct. (When this is not the case the minimizer is not unique.) Assuming that (3) holds $\|{\mathcal{A}}X\|^{2}$ will behave roughly like $\|X\|_{F}$ for matrices of rank less than $q$ , and therefore it seems reasonable that (4) should have some convexity properties. In this paper we study the stationary points of (4). We show that if a RIP property holds it is in many cases possible to guarantee that any stationary point of (4) (with rank $r$ ) is unique.

A number of recent works propose to use the related $\|\cdot\|_{r*}$ -norm [19, 10, 16, 15] (sometimes referred to as the spectral k-support norm). This is a generalization of the nuclear norm which is obtained when selecting $r=1$ . It can be shown [19] that the extreme points of the unit ball with this norm are rank $r$ matrices. Therefore this choice may be more appropriate than the nuclear norm when searching for solutions of a particular (known) rank. It can be seen (e.g. from the derivations in [16]) that $\|X\|_{r*}$ is the convex envelope of (7) when $X_{0}=0$ , which gives $\|X\|^{2}_{r*}=\mathcal{R}_{r}(X)+\|X\|_{F}^{2}$ . While the approach is convex the extra norm penalty adds a (usually) unwanted shrinking bias similar to what the nuclear norm does. In contrast, our approach avoids this bias since it uses a non-convex regularizer. Despite this non-convexity we are still able to derive strong optimality guarantees for an important class of problem instances.

1.1 Main Results and Contributions

Our main result, Theorem 2.4, shows that if $X_{s}$ is a stationary point of (4) and the singular values $z_{i}$ of the matrix $Z=(I-{\mathcal{A}}^{*}A)X_{s}+{\mathcal{A}}^{*}{\bf b}$ fulfill $z_{r+1}<(1-2\delta_{2r})z_{r}$ then there can not be any other stationary point with rank less than or equal to $r$ . The matrix $Z$ is related to the gradient of the objective function at $X_{s}$ (see Section 2). The term $\|X-Z\|_{F}^{2}$ can be seen as a local approximation of $\|{\mathcal{A}}X-{\bf b}\|^{2}$ close to $X_{s}$ (see [21]). If for example there is a rank $r$ matrix $X_{0}$ such that ${\bf b}={\mathcal{A}}X_{0}$ then it is easy to show that $X_{0}$ is a stationary point and the corrresponding $Z$ is identical to $X_{0}$ . Since this means that $z_{r+1}=0$ our results certify that this is the only stationary point to the problem if ${\mathcal{A}}$ fulfills (3) with $\delta_{2r}<\frac{1}{2}$ . The following lemma clarifies the connection between the stationary point $X_{s}$ and $Z$ .

Lemma 1.1.

The point $X_{s}$ is stationary in $F(X)$ = $\mathcal{R}(X)+\|{\mathcal{A}}X-\bf{b}\|^{2}$ if and only if $2Z\in\partial G(X_{s})$ , where $G(X)=\mathcal{R}_{r}(X)+\|X\|_{F}^{2}$ , and if and only if

[TABLE]

(The proof is identical to that of Lemma 3.1 in [21].) In [1] it is shown that as long as $z_{r}\neq z_{r+1}$ the unique solution of (8) is the best rank $r$ approximation of $Z$ . When there are several singular values that are equal to $z_{r}$ , (8) will have multiple solutions and some of them will not be of rank $r$ .

In [7] the relationship between minimizers of (4) and (2) is studied. We note that [7] shows that if $\|{\mathcal{A}}\|\leq 1$ then any local minimizer of (4) is also a local minimizer of (2), and that their global minimizers coincide. In this situation local minimizers of $F$ will therefore be rank $r$ approximations of $Z$ . Hence, loosely speaking our results state that in this situation any local minimum of $F$ is likely to be unique.

Our work builds on that of [21] which derives similar results for the non-convex regularizer $\mathcal{R}_{\mu}(X)=\sum_{i}\mu-\max(\sqrt{\mu}-x_{i})^{2}$ , where $x_{i}$ are the singular values of $X$ . In this case a trade-off between rank and residual error is optimized using the formulation

[TABLE]

While it can be argued that (9) and (4) are essentially equivalent since we can iteratively search for a $\mu$ that gives the desired rank, the results of [21] may not rule out the existence of multiple high rank stationary points. In contrast, when using (4) our results imply that if $z_{r+1}<(1-2\delta_{r})z_{r}$ then $X_{s}$ is the unique stationary point of the problem. (To see this, note that if there are other stationary points we can by the preceding discussion assume that at least one is of rank $r$ or less which contradicts our main result in Theorem 2.4). Hence, in this sense our results are stronger than those of [21] and allow for directly searching for a matrix of the desired rank with an essentially parameter free formulation.

1.2 Notation

In this section we introduce some preliminary material and notation. In general we will use boldface to denote a vector ${\bf x}$ and its $i$ th element $x_{i}$ . Unless otherwise stated the singular values of a matrix $X$ will be denoted $x_{i}$ and the vector of singular values ${\bf x}$ . By $\|{\bf x}\|$ we denote the standard euclidean norm $\|{\bf x}\|=\sqrt{{\bf x}^{T}{\bf x}}$ . A diagonal matrix with diagonal elements ${\bf x}$ will be denoted $D_{\bf x}$ . For matrices we define the scalar product as $\langle X,Y\rangle={\text{ tr}}(X^{T}Y)$ , where tr is the trace function, and the Frobenius norm $\|X\|_{F}=\sqrt{\langle X,X\rangle}=\sqrt{\sum_{i=1}^{n}x_{i}}$ . The adjoint of the linear matrix operator ${\mathcal{A}}$ is denoted ${\mathcal{A}}^{*}$ . By $\partial F(X)$ we mean the set of subgradients of the function $F$ at $X$ and by a stationary point we mean a solution to $0\in\partial F(X)$ .

2 Optimality Conditions

Let $F(X)=\mathcal{R}_{r}(X)+\|{\mathcal{A}}X-{\bf b}\|^{2}$ . We can equivalently write

[TABLE]

where $G(X)=\mathcal{R}_{r}(X)+\|X\|_{F}^{2}$ and $H(X)=\delta_{q}\|X\|^{2}_{F}+\left(\|{\mathcal{A}}X\|^{2}-\|X\|_{2}^{2}\right)-2\langle{\mathcal{A}}X,{\bf b}\rangle$ . The function $G$ is convex and sub-differentiable. Any stationary point $X_{s}$ of $F$ therefore has to fulfill

[TABLE]

Computation of the gradient gives the optimality conditions $2Z\in\partial G(X_{s})$ where $Z=(I-{\mathcal{A}}^{*}{\mathcal{A}})X_{s}+{\mathcal{A}}^{*}{\bf b}$ .

2.1 Subgradients of $G$

For our analysis we need to determine the subdifferential $\partial G(X)$ of the function $G(X)$ . Let ${\bf x}$ be the vector of singular values of $X$ and $X=UD_{\bf x}V^{T}$ be the SVD. Using Von Neumann’s trace theorem it is easy to see [18] that the $Z$ that maximizes (5) has to be of the form $Z=UD_{\bf z}V^{T}$ , where ${\bf z}$ are singular values. If we let

[TABLE]

then we have $G(X)=\max_{Z}L(X,Z)$ . The function $L$ is linear in $X$ and concave in $Z$ . Furthermore for any given $X$ the corresponding maximizers can be restricted to a compact set (because of the dominating quadratic term). By Danskin’s Theorem, see [3], the subgradients of $G$ are then given by

[TABLE]

where $\mathcal{Z}(X)=\operatorname*{arg\,max}_{Z}L(X,Z)$ . We note that by concavity the maximizing set $\mathcal{Z}(X)$ is convex. Since $\nabla_{X}L(X,Z)=2Z$ we get

[TABLE]

To find the set of subgradients we thus need to determine all maximizers of $L$ . Since the maximizing $Z$ has the same $U$ and $V$ as $X$ what remains is to determine the singular values of $Z$ . It can be shown [18] that these have the form

[TABLE]

for some number $s\geq x_{r}$ . (The case $x_{i}=0$ , $i>r$ is actually not addressed in [18]. However, it is easy to see that any value in $[0,s]$ works since $z_{i}$ vanishes from (12) when $x_{i}=0$ , $i>r$ . In fact, any value $[-s,s]$ works, but we use the convention that singular values are positive. Note that the columns of $U$ that correspond to zero singular values of $X$ are not uniquely defined. We can always achieve a decreasing sequence with $z_{i}\in[0,s]$ by changing signs and switching order.)

For a general matrix $X$ the value of $s$ can not be determined analytically but has to be computed numerically by maximizing a one dimensional concave and differentiable function [18]. If ${\text{ rank}}(X)\leq r$ it is however clear that the optimal choice is $s=x_{r}$ . To see this we note that since the optimal $Z$ is of the form $UD_{\bf z}V^{T}$ we have

[TABLE]

if ${\text{ rank}}(x)\leq r$ . Selecting $s=x_{r}$ and inserting (15) into (16) gives $L(X,Z)=\sum_{i=1}^{r}x_{i}^{2}$ , which is clearly the maximum. Hence if ${\text{ rank}}(X)=r$ we conclude that the subgradients of $g$ are given by $2Z=2UD_{\bf z}V^{T}$ where

[TABLE]

2.2 Growth estimates for the $\partial G(X)$

Next we derive a bound on the growth of the subgradients that will be useful when considering the uniqueness of low rank stationary points.

Let ${\bf x}$ and ${\bf x}^{\prime}$ be two vectors both with at most $r$ non-zero (positive) elements, and $I$ and $I^{\prime}$ be the indexes of the $r$ largest elements of ${\bf x}$ and ${\bf x}^{\prime}$ respectively. We will assume that both $I$ and $I^{\prime}$ contain $r$ elements. If in particular ${\bf x}^{\prime}$ has fewer than $r$ non-zero elements we also include some zero elements in $I^{\prime}$ . We define the corresponding sequences ${\bf z}$ and ${\bf z}^{\prime}$ by

[TABLE]

where $\overline{s}=\min_{i\in I}x_{i}$ and $\overline{s}^{\prime}=\min_{i\in I}x^{\prime}_{i}$ . If ${\bf x}^{\prime}$ has fewer than $r$ non-zero elements then $\overline{s}^{\prime}=0$ . Note that we do not require that the elements of the ${\bf x},{\bf x}^{\prime},{\bf z}$ and ${\bf z}^{\prime}$ vectors are ordered in decreasing order. We will see later (Lemma 2.2) that in order to estimate the effects of the $U$ and $V$ matrices we need to be able to handle permutations of the singular values. For our analysis we will also use the quantity $\underline{s}=\max_{i\notin I}z_{i}$ .

Lemma 2.1.

If $\underline{s}<c\overline{s}$ , where $0<c<1$ then

[TABLE]

Proof.

Since $z_{i}=x_{i}$ when $i\in I$ and $x_{i}=0$ otherwise, we can write the inner product $\langle{\bf z}^{\prime}-{\bf z},{\bf x}^{\prime}-{\bf x}\rangle$ as

[TABLE]

Note that $\|{\bf x}^{\prime}-{\bf x}\|^{2}=$

[TABLE]

Since the second and third sum in (20) have the same number of terms it suffices to show that

[TABLE]

when $i\in I$ , $i\notin I^{\prime}$ and $j\notin I$ , $j\in I^{\prime}$ . By the assumption $\underline{s}<c\overline{s}$ we know that $z_{j}<cx_{i}$ . We further know that $z^{\prime}_{i}\leq\overline{s}^{\prime}\leq x^{\prime}_{j}$ . This gives

[TABLE]

Inserting these inequalities into the left hand side of (22) gives the desired bound. ∎

The above result gives an estimate of the growth of the subdifferential in terms of the singular values. To derive a similar estimate for the matrix elements we need the following lemma:

Lemma 2.2.

Let ${\bf x}$ , ${\bf x}^{\prime}$ , ${\bf z}$ , ${\bf z}^{\prime}$ be fixed vectors with non-increasing and non-negative elements such that ${\bf x}\neq{\bf x}^{\prime}$ and ${\bf z}$ and ${\bf z}^{\prime}$ fulfill (15) (with ${\bf x}$ and ${\bf x}^{\prime}$ respectively). Define $X^{\prime}=U^{\prime}D_{{\bf x}^{\prime}}V^{\prime T}$ , $X=UD_{\bf x}V^{T}$ , $Z^{\prime}=U^{\prime}D_{{\bf z}^{\prime}}V^{\prime T}$ , and $Z=UD_{\bf z}V^{T}$ as functions of unknown orthogonal matrices $U$ , $V$ , $U^{\prime}$ and $V^{\prime}$ . If

[TABLE]

then

[TABLE]

where $M_{\pi}$ belongs to the set of permutation matrices.

The proof is almost identical to that of Lemma 4.1 in [21] and therefore we omit it. While our subdifferential is different to the one studied in [21], for both of them we have that the ${\bf z}$ and ${\bf x}$ vectors fulfill $z_{i}\geq x_{i}\geq 0$ which is the only property that is used in the proof.

Corollary 2.3.

Assume that $X$ is of rank $r$ and $2Z\in\partial G(X)$ . If the singular values of the matrix $Z$ fulfill $z_{r+1}<cz_{r}$ , where $0<c<1$ , then for any $2Z^{\prime}\in\partial G(X^{\prime})$ with ${\text{ rank}}(X^{\prime})\leq r$ we have

[TABLE]

as long as $\|X^{\prime}-X\|_{F}\neq 0$ .

Proof.

We let ${\bf x},{\bf x}^{\prime},{\bf z}$ and ${\bf z}^{\prime}$ be the singular values of the matrices $X,X^{\prime},Z$ and $Z^{\prime}$ respectively. Our proof essentially follows that of Corollary 4.2 in [21], where a similar result is first proven under the assumption that ${\bf x}\neq{\bf x}^{\prime}$ and then generalized to the general case using a continuity argument. For this purpose we need to extend the infeasible interval somewhat. Since $0<c<1$ and $z_{r+1}<cz_{r}$ are open there is an $\epsilon>0$ such that $z_{r+1}<(c-\epsilon)z_{r}$ and $0<c-\epsilon<1$ . Now assume that $a^{*}>1$ in (25), then clearly

[TABLE]

since $\frac{1-c+\epsilon}{2}<1$ . Otherwise $a^{*}\leq 1$ and we have

[TABLE]

According to Lemma 2.1 the right hand side is strictly larger than $\frac{1-c+\epsilon}{2}$ , which proves that (28) holds for all $X^{\prime}$ with ${\bf x}^{\prime}\neq{\bf x}$ .

It remains to show that

[TABLE]

for the case ${\bf x}^{\prime}={\bf x}$ and $\|X^{\prime}-X\|_{F}\neq 0$ . Since $\epsilon>0$ is arbitrary this proves the Corollary. This can be done as in [21] using continuity of the scalar product and the Frobenius norm. Specifically, a sequence $X(t)\rightarrow X$ , when $t\rightarrow 0$ , is defined by modifying the largest singular value and letting $\sigma_{1}(X(t))=\sigma_{1}(X)+t$ . It is easy to verify that $X(t)$ fulfills (28) for every $t>0$ . Letting $t\rightarrow 0$ then proves (30). ∎

2.3 Uniqueness of Low Rank Stationary Points

In this section we show that if a RIP (3) holds and the singular values $z_{r}$ and $z_{r+1}$ are well separated there can only be one stationary point of $F$ that has rank $r$ . We first derive a bound on the gradients of $H$ . We have

[TABLE]

This gives $\langle\nabla H(X^{\prime})-\nabla H(X),X^{\prime}-X\rangle=$

[TABLE]

By (3) $\left|\|{\mathcal{A}}(X^{\prime}-X)\|^{2}-\|X^{\prime}-X\|_{F}^{2}\right|\leq\delta\|X^{\prime}-X\|_{F}^{2},$ if ${\text{ rank}}(X^{\prime}-X)\leq q$ which gives

[TABLE]

This leads us to our main result

Theorem 2.4.

Assume that $X_{s}$ is a stationary point of $F$ , that is, $(I-{\mathcal{A}}^{*}{\mathcal{A}})X_{s}+{\mathcal{A}}^{*}{\bf b}=Z$ , where $2Z\in\partial G(X_{s})$ , ${\text{ rank}}(X_{s})=r$ and the singular values of $Z$ fulfill $z_{r+1}<(1-2\delta_{2r})z_{r}$ . If $X^{\prime}_{s}$ is another stationary point then $\text{rank}(X^{\prime}_{s})>r$ .

Proof.

Assume that ${\text{ rank}}(X^{\prime}_{s})\leq r$ . Since both $X_{s}$ and $X^{\prime}_{s}$ are stationary we have

[TABLE]

where $2Z\in\partial G(X_{s})$ and $2Z^{\prime}\in\partial G(X^{\prime}_{s})$ . Taking the difference between the two equations yields

[TABLE]

which implies

[TABLE]

where $V=X^{\prime}_{s}-X_{s}$ has ${\text{ rank}}(V)\leq 2r$ . By (33) the left hand side is less than $2\delta_{2r}\|V\|_{F}^{2}$ . However, according to Corollary 2.3 (with $c=1-2\delta_{2r}$ ) the right hand side is larger than $2\delta_{2r}\|V\|_{F}^{2}$ which contradicts ${\text{ rank}}(X^{\prime}_{s})\leq r$ . ∎

Remark.

Note that [7] shows that if $\|{\mathcal{A}}\|\leq 1$ then any local minimizer of (4) is also a local minimizer of (2) and therefore of rank $r$ . Hence any local minimizer $X_{s}$ obeying the conditions of the theorem will be unique.

3 Implementation and Experiments

In this section we test the proposed approach on some simple real and synthetic applications (some that fulfill (3) and some that do not). For our implementation we use the GIST approach from [14] because of its simplicity. Given a current iterate $X_{k}$ this method solves

[TABLE]

where $M_{k}=X_{k}-\frac{1}{\tau_{k}}({\mathcal{A}}^{*}{\mathcal{A}}X_{k}-A^{*}{\bf b})$ . Note that if $\tau_{k}=1$ then any fixed point of (38) is a stationary point by Lemma 1.1. To solve (38) we use the proximal operator computed in [18].

Our algorithm consists of repeatedly solving (38) for a sequence of $\{\tau_{k}\}$ . We start from a larger value ( $\tau_{0}=5$ in our implementation) and reduce towards $1$ as long as this results in decreasing objective values. Specifically we set $\tau_{k+1}=\frac{\tau_{k}-1}{1.1}+1$ if the previous step was successful in reducing the objective value. Otherwise we increase $\tau$ according to $\tau_{k+1}=1.5(\tau_{k}-1)+1$ .

3.1 Synthetic Data

We first evaluate the quality of the relaxation on a number of synthetic experiments. We compare the two formulations (4) and

[TABLE]

In Figure 2 (a) we tested these two relaxations on a number of synthetic problems with varying noise levels. The data was created so that the operator ${\mathcal{A}}$ fulfills (3) with $\delta=0.2$ . By column stacking an $m\times n$ matrix $X$ the linear mapping ${\mathcal{A}}$ can be represented with a matrix $A$ of size $p\times mn$ . It is easy to see that if we let $p=mn$ the term $(1-\delta_{q})$ of (3) will be the same as the smallest singular value of $A$ squared. (In this case the RIP constraint will hold for any rank and we therefore suppress the subscript $q$ and only use $\delta$ .) For the data in Figure 3 (a) we selected $400\times 400$ matrices $A$ with random $\mathcal{N}(0,1)$ (gaussian mean [math] and variation $1$ ) entries and modified their singular values. We then generated a $20\times 20$ matrices $X$ of rank $5$ by sampling $20\times 5$ matrices $U$ and $V$ with $\mathcal{N}(0,1)$ entries and computed $X=UV^{T}$ . The measurement vector was created by computing ${\mathcal{A}}X+{\bm{\epsilon}}$ , where ${\bm{\epsilon}}$ is $\mathcal{N}(0,\sigma^{2})$ for varying noise level $\sigma$ between [math] and $1$ . In Figure 2 (a) we plotted the measurement fit $\|{\mathcal{A}}X-\|$ versus the noise level $\sigma$ for the solutions obtained with (4) and (39). Note that since the formulation (39) does not directly specify the rank of the sought matrix we iteratively searched for the smallest value of $\mu$ that gives the correct rank, using a bisection scheme. The reason for choosing the smallest $\mu$ is that this reduces the shrinking bias to a minimum while it still gives the correct rank.

In Figure 2 (b) we computed the $Z$ matrix and plotted the fraction of problem instances where its singular values fulfilled $z_{r+1}<(1-2\delta)z_{r}$ , with $\delta=0.2$ . For these instances the obtained stationary points are also globally optimal according to our main results.

In Figure 2 (c) we did the same experiment as in (a) but with an under determined $A$ of size $300\times 400$ . It is known [24] that if $A$ is $p\times mn$ and the elements of $A$ are drawn from $\mathcal{N}(0,\frac{1}{p})$ then ${\mathcal{A}}$ fulfills (3) with high probability. The exact value of $\delta_{q}$ is however difficult to determine and therefore we are not able to verify optimality in this case.

3.2 Non-Rigid Structure from Motion

In this section we consider the problem of Non-Rigid Structure from Motion. We follow the aproach of Dai. et al. [8] and let

[TABLE]

where $X_{i}$ , $Y_{i}$ , $Z_{i}$ are $1\times m$ matrices containing the $x$ -, $y$ - and $z$ -coordinates of the tracked points in images $i$ . Under the assumption of an orthographic camera the projection of the $3D$ points can be modeled using $M=RX$ , where $R$ is a $2F\times 3F$ block diagonal matrix with $2\times 3$ blocks $R_{i}$ , consisting of two orthogonal rows that encode the camera orientation in image $i$ . The resulting $2F\times m$ measurement matrix $M$ consists of the $x$ - and $y$ -image coordinates or the tracked points. Under the assumption of a linear shape basis model [4] with $r$ deformation modes, the matrix $X^{\#}$ can be factorized into $X^{\#}=CB$ , where the $r\times 3m$ matrix $B$ contain the basis elements. It is clear that such a factorization is possible when $X^{\#}$ is of rank $r$ . We therefore search for the matrix $X^{\#}$ of rank $r$ that minimizes the residual error $\|PX-M\|_{F}^{2}$ .

The linear operator defined by ${\mathcal{A}}(X^{\#})=RX$ does by itself not obey (3) since there are typically low rank matrices in its nullspace. This can be seen by noting that if $N_{i}$ is the $3\times 1$ vector perpendicular to the two rows of $R_{i}$ , that is $R_{i}N_{i}=0$ then

[TABLE]

where $C_{i}$ is any $1\times m$ matrix, is in the null space of $R_{i}$ . Therefore any matrix of the form

[TABLE]

where $n_{ij}$ are the elements of $N_{i}$ , vanishes under ${\mathcal{A}}$ . Setting everything but the first row of $N(C)$ to zero shows that there is a matrix of rank $1$ in the null space of ${\mathcal{A}}$ . Moreover, if the rows of the optimal $X^{\#}$ spans such a matrix it will not be unique since we may add $N(C)$ without affecting the projections or the rank.

In Figure 7 we compare the two relaxations

[TABLE]

and

[TABLE]

on the four MOCAP sequences displayed in Figure 7, obtained from [8]. These consist of real motion capture data and therefore the ground truth solution is only approximatively of low rank.

In Figure 7 we plot the rank of the obtained solution versus the datafit $\|RX-M\|^{2}_{F}$ . Since (44) does not allow us to directly specify the rank of the sought matrix, we solved the problem for $50$ values of $\mu$ between $1$ and $100$ (orange curve) and computed the resulting rank and datafit. Note that even if a change of $\mu$ is not large enough to change the rank of the solution it does affect the non-zero singular values. To achieve the best result for a specific rank with (44) we should select the smallest $\mu$ that gives the correct rank. Even though (3) does not hold, the relaxation (43) consistently gives better data fit with lower rank than (44). Figure 7 also shows the rank versus the distance to the ground truth solution. For high rank the distance is typically larger for (43) than (44). A feasible explanation is that when the rank is high it is more likely that the row space of $X^{\#}$ contains a matrix of the type $N(C)$ . Loosely speaking, when we allow too complex deformations it becomes more difficult to uniquely recover the shape. The nuclear norm’s built in bias to small solutions helps to regularize the problem when the rank constraint is not discriminative enough.

One way to handle the null space of ${\mathcal{A}}$ is to add additional regularizes that penalize low rank matrices of the type $N(C)$ . Dai et al. [8] suggested to use the derivative prior $\|DX^{\#}\|_{F}^{2}$ , where the matrix $D:\mathbb{R}^{F}\rightarrow\mathbb{R}^{F-1}$ is a first order difference operator. The nullspace of $D$ consists of matrices that are constant in each column. Since this implies that the scene is rigid it is clear that $N(C)$ is not in the nullspace of $D$ . We add this term and compare

[TABLE]

and

[TABLE]

Figures 7 and 7 show the results. In this case both the data fit and the distance to the ground truth is consistently better with (45) than (46). When the rank increases most of the regularization comes from the derivative prior leading to both methods providing similar results.

4 Conclusions

In this paper we studied the local minima of a non-convex rank regularization approach. Our main theoretical result shows that if a RIP property holds then there is often a unique local minimum. Since the proposed relaxation (4) and the original objective (2) is shown to have the same global minimizers if $\|{\mathcal{A}}\|\leq 1$ in [7] this result is also relevant for the original discontinuous problem. Our experimental evaluation shows that the proposed approach often gives better solutions than standard convex alternatives, even when the RIP constraint does not hold.

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] F. Andersson, M. Carlsson, and C. Olsson. Convex envelopes for fixed rank approximation. Optimization Letters , pages 1–13, 2017.
2[2] R. Basri, D. Jacobs, and I. Kemelmacher. Photometric stereo with general, unknown lighting. Int. J. Comput. Vision , 72(3):239–257, May 2007.
3[3] D. Bertsekas. Nonlinear Programming . Athena Scientific, 1999.
4[4] C. Bregler, A. Hertzmann, and H. Biermann. Recovering non-rigid 3d shape from image streams. In IEEE Conference on Computer Vision and Pattern Recognition , 2000.
5[5] E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J. ACM , 58(3):11:1–11:37, June 2011.
6[6] E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics , 9(6):717–772, 2009.
7[7] M. Carlsson. On convexification/optimization of functionals including an l 2-misfit term. ar Xiv preprint ar Xiv:1609.09378 , 2016.
8[8] Y. Dai, H. Li, and M. He. A simple prior-free method for non-rigid structure-from-motion factorization. International Journal of Computer Vision , 107(2):101–122, 2014.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

A Non-Convex Relaxation for Fixed-Rank Approximation

Abstract

1 Introduction

1.1 Main Results and Contributions

Lemma 1.1**.**

1.2 Notation

2 Optimality Conditions

2.1 Subgradients of GGG

2.2 Growth estimates for the ∂G(X)\partial G(X)∂G(X)

Lemma 2.1**.**

Proof.

Lemma 2.2**.**

Corollary 2.3**.**

Proof.

2.3 Uniqueness of Low Rank Stationary Points

Theorem 2.4**.**

Proof.

Remark.

3 Implementation and Experiments

3.1 Synthetic Data

3.2 Non-Rigid Structure from Motion

4 Conclusions

Lemma 1.1.

2.1 Subgradients of $G$

2.2 Growth estimates for the $\partial G(X)$

Lemma 2.1.

Lemma 2.2.

Corollary 2.3.

Theorem 2.4.