Best Pair Formulation & Accelerated Scheme for Non-convex Principal   Component Pursuit

Aritra Dutta; Filip Hanzely; Jingwei Liang; Peter Richt\'arik

arXiv:1905.10598·math.OC·December 2, 2020

Best Pair Formulation & Accelerated Scheme for Non-convex Principal Component Pursuit

Aritra Dutta, Filip Hanzely, Jingwei Liang, Peter Richt\'arik

PDF

TL;DR

This paper formulates robust principal component analysis as a best pair problem and introduces an accelerated proximal gradient method with proven convergence and superior performance in experiments.

Contribution

It is the first to formulate RPCA as a best pair problem and develops an accelerated scheme with theoretical convergence guarantees.

Findings

01

The proposed algorithm outperforms baseline methods in experiments.

02

Global convergence and local linear rate are established for the scheme.

03

Numerical results demonstrate superior efficiency on real and synthetic data.

Abstract

The best pair problem aims to find a pair of points that minimize the distance between two disjoint sets. In this paper, we formulate the classical robust principal component analysis (RPCA) as the best pair; which was not considered before. We design an accelerated proximal gradient scheme to solve it, for which we show global convergence, as well as the local linear rate. Our extensive numerical experiments on both real and synthetic data suggest that the algorithm outperforms relevant baseline algorithms in the literature.

Tables2

Table 1. Table 1: Quantitative performance of different algorithms in inlier detection experiment. Except R-SGD2 all methods are competitive.

Metric	SGD	R-SGD1	R-SGD2	Inc	R-Inc	MD	R-MD	RPCA-F	Best pair	SVT
$\frac{{‖ P_{L} - P_{L^{*}} ‖}_{F}}{3 \sqrt{2}}$	0.7	0.86	4.66	0.77	0.72	0.67	0.67	0.78	0.76	0.79

Table 2. Table 2: Algorithms compared in this paper.

Algorithm

Abbreviation

Appearing in Experiment

Reference

Inexact Augmented Lagrange

Method of Multipliers

iEALM

Fig. 1, 3, 4

(Lin et al., 2010)

Accelerated Proximal Gradient

APG

Fig. 3, 4

(Wright et al., 2009)

Singular Value Thresholding

SVT

Table 1

(Cai et al., 2010)

Grassmannian Robust Adaptive

Subspace Tracking Algorithm

GRASTA

Fig. 5

(He et al., 2012)

Go Decomposition

GoDec

Fig. 4, 10

(Zhou and Tao, 2011)

Robust PCA Gradient Descent

RPCA GD

Fig. 2, 3, 4, 5, 9, 10, 11

(Yi et al., 2016)

Robust PCA Nonconvex Feasibility

RPCA NCF

Fig. 1, 3, 4, 5, 10, 11 , 12

(Dutta et al., 2018a)

Robust stochastic PCA Algorithms

SGD, R-SGD1, R-SGD2

Inc, R-Inc, MD, R-MD

Fig. 12, Table 1

(Goes et al., 2014)

Equations204

min_{L \in R^{m \times n}} F (A, L) + λ R (L),

min_{L \in R^{m \times n}} F (A, L) + λ R (L),

min_{L \in R^{m \times n}} ∥ A - L ∥_{F}^{2} + ι_{rank (L) \leq r} (L),

min_{L \in R^{m \times n}} ∥ A - L ∥_{F}^{2} + ι_{rank (L) \leq r} (L),

min_{L \in R^{m \times n}} ∥ A - L ∥_{ℓ_{0}} + λ rank (L) .

min_{L \in R^{m \times n}} ∥ A - L ∥_{ℓ_{0}} + λ rank (L) .

min_{L \in R^{m \times n}} ∥ A - L ∥_{ℓ_{1}} + λ ∥ L ∥_{⋆} .

min_{L \in R^{m \times n}} ∥ A - L ∥_{ℓ_{1}} + λ ∥ L ∥_{⋆} .

min_{L, S \in R^{m \times n}} ∥ S ∥_{ℓ_{1}} + λ ∥ L ∥_{⋆} subject to L + S = A .

min_{L, S \in R^{m \times n}} ∥ S ∥_{ℓ_{1}} + λ ∥ L ∥_{⋆} subject to L + S = A .

(P_{Ω} [A])_{ij} = {A_{ij} 0 (i, j) \in Ω, otherwise.

(P_{Ω} [A])_{ij} = {A_{ij} 0 (i, j) \in Ω, otherwise.

min_{L, S \in R^{m \times n}} ∥ S ∥_{ℓ_{1}} + λ ∥ L ∥_{⋆} subject to P_{Ω} (L + S) = P_{Ω} (A) .

min_{L, S \in R^{m \times n}} ∥ S ∥_{ℓ_{1}} + λ ∥ L ∥_{⋆} subject to P_{Ω} (L + S) = P_{Ω} (A) .

\displaystyle\mathcal{X}\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\big{\{}X:KX=A\big{\}},\enskip\mathcal{Y}\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\big{\{}X:\mathrm{rank}(L)\leq r,{|\kern-1.125pt|}S_{i,\cdot}{|\kern-1.125pt|}_{0}\leq\alpha m,{|\kern-1.125pt|}S_{\cdot,j}{|\kern-1.125pt|}_{0}\leq\alpha n,\,i\in[m],\,j\in[n]\big{\}}.

\displaystyle\mathcal{X}\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\big{\{}X:KX=A\big{\}},\enskip\mathcal{Y}\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\big{\{}X:\mathrm{rank}(L)\leq r,{|\kern-1.125pt|}S_{i,\cdot}{|\kern-1.125pt|}_{0}\leq\alpha m,{|\kern-1.125pt|}S_{\cdot,j}{|\kern-1.125pt|}_{0}\leq\alpha n,\,i\in[m],\,j\in[n]\big{\}}.

find X \in R^{2 m \times n} such that X \in X \cap Y .

find X \in R^{2 m \times n} such that X \in X \cap Y .

X \in X, Y \in Y min \frac{1}{2} ∣ ∣ X - Y ∣ ∣^{2} .

X \in X, Y \in Y min \frac{1}{2} ∣ ∣ X - Y ∣ ∣^{2} .

ι_{X} (X) = def {0 + \infty : X \in X, : otherwise .

ι_{X} (X) = def {0 + \infty : X \in X, : otherwise .

X, Y \in R^{2 m \times n} min ι_{X} (X) + \frac{1}{2} ∣ ∣ X - Y ∣ ∣^{2} + ι_{Y} (Y) .

X, Y \in R^{2 m \times n} min ι_{X} (X) + \frac{1}{2} ∣ ∣ X - Y ∣ ∣^{2} + ι_{Y} (Y) .

{}^{1}\big{(}{\iota_{\mathcal{X}}(Y)}\big{)}\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\min_{X\in\mathbb{R}^{2m\times n}}{{\frac{\raisebox{0.41669pt}{\footnotesize$1$}}{\raisebox{-1.24994pt}{\footnotesize$2$}}}}{|\kern-1.125pt|}X-Y{|\kern-1.125pt|}^{2}+\iota_{\mathcal{X}}(X).

{}^{1}\big{(}{\iota_{\mathcal{X}}(Y)}\big{)}\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\min_{X\in\mathbb{R}^{2m\times n}}{{\frac{\raisebox{0.41669pt}{\footnotesize$1$}}{\raisebox{-1.24994pt}{\footnotesize$2$}}}}{|\kern-1.125pt|}X-Y{|\kern-1.125pt|}^{2}+\iota_{\mathcal{X}}(X).

\boxed{\textstyle\min_{Y\in\mathbb{R}^{2m\times n}}\iota_{\mathcal{Y}}(Y)+\,^{1}\big{(}{\iota_{\mathcal{X}}(Y)}\big{)}.}

\boxed{\textstyle\min_{Y\in\mathbb{R}^{2m\times n}}\iota_{\mathcal{Y}}(Y)+\,^{1}\big{(}{\iota_{\mathcal{X}}(Y)}\big{)}.}

\nabla\big{(}{{}^{1}(\iota_{\mathcal{X}}(Y))}\big{)}=(\mathrm{Id}-\mathrm{P}_{\mathcal{X}})(Y)

\nabla\big{(}{{}^{1}(\iota_{\mathcal{X}}(Y))}\big{)}=(\mathrm{Id}-\mathrm{P}_{\mathcal{X}})(Y)

\begin{gathered}Z_{a,k}=Y_{k}+a_{k}(Y_{k}-Y_{k-1}),\\ Z_{b,k}=Y_{k}+b_{k}(Y_{k}-Y_{k-1}),\\ Y_{k+1}=\mathrm{P}_{\mathcal{Y}}\big{(}{Z_{a,k}-\gamma(Z_{b,k}-\mathrm{P}_{\mathcal{X}}(Z_{b,k}))}\big{)}.\end{gathered}

\begin{gathered}Z_{a,k}=Y_{k}+a_{k}(Y_{k}-Y_{k-1}),\\ Z_{b,k}=Y_{k}+b_{k}(Y_{k}-Y_{k-1}),\\ Y_{k+1}=\mathrm{P}_{\mathcal{Y}}\big{(}{Z_{a,k}-\gamma(Z_{b,k}-\mathrm{P}_{\mathcal{X}}(Z_{b,k}))}\big{)}.\end{gathered}

\textstyle\min_{X\in\mathbb{R}^{2m\times n}}\iota_{\mathcal{X}}(X)+\,^{1}\big{(}{\iota_{\mathcal{Y}}(X)}\big{)},

\textstyle\min_{X\in\mathbb{R}^{2m\times n}}\iota_{\mathcal{X}}(X)+\,^{1}\big{(}{\iota_{\mathcal{Y}}(X)}\big{)},

\mathrm{P}_{\mathcal{X}}(X)=\Big{(}{{\footnotesize\begin{matrix}S\\ L\end{matrix}}}\Big{)}+{{\frac{\raisebox{1.24994pt}{\small$1$}}{\raisebox{-2.08333pt}{\small$2$}}}}\Big{(}{{\footnotesize\begin{matrix}\mathrm{P}_{\Omega}[A-S-L]\\ \mathrm{P}_{\Omega}[A-S-L]\end{matrix}}}\Big{)}.

\mathrm{P}_{\mathcal{X}}(X)=\Big{(}{{\footnotesize\begin{matrix}S\\ L\end{matrix}}}\Big{)}+{{\frac{\raisebox{1.24994pt}{\small$1$}}{\raisebox{-2.08333pt}{\small$2$}}}}\Big{(}{{\footnotesize\begin{matrix}\mathrm{P}_{\Omega}[A-S-L]\\ \mathrm{P}_{\Omega}[A-S-L]\end{matrix}}}\Big{)}.

[η_{1} < R < η_{2}] = def {Y \in R^{n} : η_{1} < R (Y) < η_{2}} .

[η_{1} < R < η_{2}] = def {Y \in R^{n} : η_{1} < R (Y) < η_{2}} .

\varphi^{\prime}\big{(}{R(Y)-R(\overline{Y})}\big{)}\mathrm{dist}\big{(}{0,\partial R(Y)}\big{)}\geq 1.

\varphi^{\prime}\big{(}{R(Y)-R(\overline{Y})}\big{)}\mathrm{dist}\big{(}{0,\partial R(Y)}\big{)}\geq 1.

\textstyle\min_{Y\in\mathbb{R}^{2m\times n}}\big{\{}\Phi(Y)\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\mathcal{R}(Y)+\mathcal{F}(Y)\big{\}},

\textstyle\min_{Y\in\mathbb{R}^{2m\times n}}\big{\{}\Phi(Y)\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\mathcal{R}(Y)+\mathcal{F}(Y)\big{\}},

β_{k} = def \frac{1 - γ L - a _{k} - ν}{2 γ}, \underline{β} = def k \in N lim inf β_{k} and α_{k} = def \frac{γ b _{k}^{2} L ^{2} + ν a _{k}}{2 ν γ}, \overline{α} = def k \in N lim sup α_{k} .

β_{k} = def \frac{1 - γ L - a _{k} - ν}{2 γ}, \underline{β} = def k \in N lim inf β_{k} and α_{k} = def \frac{γ b _{k}^{2} L ^{2} + ν a _{k}}{2 ν γ}, \overline{α} = def k \in N lim sup α_{k} .

δ = def \underline{β} - \overline{α} > 0.

δ = def \underline{β} - \overline{α} > 0.

\mathcal{Y}_{L}\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\Big{\{}{Y=\Big{(}{{\footnotesize\begin{matrix}S\\ L\end{matrix}}}\Big{)}:\mathrm{rank}(L)\leq r}\Big{\}}\enskip\text{and}\enskip\mathcal{Y}_{S}\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\Big{\{}{Y=\Big{(}{{\footnotesize\begin{matrix}S\\ L\end{matrix}}}\Big{)}:\textrm{$S$ is $\alpha$-sparse}}\Big{\}}.

\mathcal{Y}_{L}\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\Big{\{}{Y=\Big{(}{{\footnotesize\begin{matrix}S\\ L\end{matrix}}}\Big{)}:\mathrm{rank}(L)\leq r}\Big{\}}\enskip\text{and}\enskip\mathcal{Y}_{S}\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\Big{\{}{Y=\Big{(}{{\footnotesize\begin{matrix}S\\ L\end{matrix}}}\Big{)}:\textrm{$S$ is $\alpha$-sparse}}\Big{\}}.

X^{⋆} \in X and S^{⋆} \in S, L^{⋆} \in Y_{L} .

X^{⋆} \in X and S^{⋆} \in S, L^{⋆} \in Y_{L} .

D_{k} = def (Y_{k} - Y^{⋆} Y_{k - 1} - Y^{⋆}) and Q = def [(1 + a) P Id - a P 0] with a \in [0, 1] .

D_{k} = def (Y_{k} - Y^{⋆} Y_{k - 1} - Y^{⋆}) and Q = def [(1 + a) P Id - a P 0] with a \in [0, 1] .

D_{k + 1} = Q D_{k} + o (∣ ∣ D_{k} ∣ ∣) .

D_{k + 1} = Q D_{k} + o (∣ ∣ D_{k} ∣ ∣) .

T_{α} [S] = def {P_{Ω_{α}} (S) \in R^{m \times n} : (i, j) \in Ω_{α} if ∣ S_{ij} ∣ \geq ∣ S_{(i, .)}^{(α n)} ∣ and ∣ S_{ij} ∣ \geq ∣ S_{(., j)}^{(α m)} ∣} .

T_{α} [S] = def {P_{Ω_{α}} (S) \in R^{m \times n} : (i, j) \in Ω_{α} if ∣ S_{ij} ∣ \geq ∣ S_{(i, .)}^{(α n)} ∣ and ∣ S_{ij} ∣ \geq ∣ S_{(., j)}^{(α m)} ∣} .

G_{k + 1} = def \frac{1}{γ} (Z_{a, k} - Y_{k + 1}) - \nabla F (Z_{b, k}) + \nabla F (Y_{k + 1}) .

G_{k + 1} = def \frac{1}{γ} (Z_{a, k} - Y_{k + 1}) - \nabla F (Z_{b, k}) + \nabla F (Y_{k + 1}) .

{|\kern-1.125pt|}G_{k+1}{|\kern-1.125pt|}\leq{\big{(}}{{{\frac{\raisebox{0.50003pt}{\footnotesize$1$}}{\raisebox{-1.49994pt}{\footnotesize$\gamma$}}}}+L}{\big{)}}\Delta_{k+1}+({{\frac{\raisebox{0.60004pt}{\footnotesize$a_{k}$}}{\raisebox{-1.79993pt}{\footnotesize$\gamma$}}}}+b_{k}L)\Delta_{k}.

{|\kern-1.125pt|}G_{k+1}{|\kern-1.125pt|}\leq{\big{(}}{{{\frac{\raisebox{0.50003pt}{\footnotesize$1$}}{\raisebox{-1.49994pt}{\footnotesize$\gamma$}}}}+L}{\big{)}}\Delta_{k+1}+({{\frac{\raisebox{0.60004pt}{\footnotesize$a_{k}$}}{\raisebox{-1.79993pt}{\footnotesize$\gamma$}}}}+b_{k}L)\Delta_{k}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Best Pair Formulation & Accelerated Scheme for Non-convex Principal Component Pursuit

Aritra Dutta

KAUST

Thuwal, KSA

[email protected] Filip Hanzely

KAUST

Thuwal, KSA

[email protected]

Jingwei Liang

University of Cambridge

Cambridge, UK

[email protected]

Peter Richtárik

KAUST

Thuwal, KSA

[email protected]

Abstract

The best pair problem aims to find a pair of points that minimize the distance between two disjoint sets. In this paper, we formulate the classical robust principal component analysis (RPCA) as the best pair; which was not considered before. We design an accelerated proximal gradient scheme to solve it, for which we show global convergence, as well as the local linear rate. Our extensive numerical experiments on both real and synthetic data suggest that the algorithm outperforms relevant baseline algorithms in the literature.

1 Introduction

Let $A\in\mathbb{R}^{m\times n}$ be a given matrix, the generalized low-rank recovery model can be written as

[TABLE]

where $\mathcal{F}(A,L)$ is a loss function, $\mathcal{R}(L)\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\sum_{i=1}^{n}\mathcal{R}_{i}(L)$ is a suitable regularizer, and $\lambda>0$ is a balancing parameter. By an appropriate choice of the loss function and the regularizer, (1) can express a wide range of low-rank approximation problems of matrices. For example, by setting $\mathcal{F}(A,L)=\|A-L\|_{F}^{2},\lambda=1$ , and $\mathcal{R}(L)=\iota_{{\rm rank}(L)\leq r}(L)$ — the characteristic function (10) of the set $\{L\in\mathbb{R}^{m\times n}:{\rm rank}(L)\leq r\}$ , (1), specializes to:

[TABLE]

which is a best approximation formulation of the classical principal component analysis (PCA). The solution to problem (2) is given by: $\hat{L}=U{\mathbf{H}}_{r}(\Sigma)V^{\top},$ where $U\Sigma V^{\top}=A$ is a singular value decomposition (SVD) of $A$ and ${\mathbf{H}}_{r}(\cdot)$ is the hard-thresholding operator that keeps the $r$ largest singular values. Although PCA is vastly used and a successful designing tool in different engineering applications, it can only handle the presence of uniformly distributed noise and is rather sensitive to sparse outliers in the data matrix (Lin et al., 2010; Wright et al., 2009; Candès et al., 2011). To overcome this shortcoming and to deal with sparse errors, (Chandrasekaran et al., 2011; Candès et al., 2011) replaced the Frobenius norm in (2) by the $\ell_{0}$ pseudo norm, and introduced the celebrated principal component pursuit (PCP) problem:

[TABLE]

However, the above problem is non-convex and NP-hard. One of the most commonly used, tractable surrogate reformulations of (3) is replacing the rank function with nuclear norm $\|L\|_{\star}$ and $\ell_{0}$ pseudo norm with $\ell_{1}$ -norm $\|A-L\|_{\ell_{1}}$ (Cai et al., 2010; Recht et al., 2010). Exploiting this idea, Robust PCA (RPCA) was introduced as a convex surrogate of the PCP problem (Wright et al., 2009; Lin et al., 2010; Candès et al., 2011):

[TABLE]

It was shown in (Chandrasekaran et al., 2011; Candès et al., 2011) that under a rank-sparsity incoherence assumption, problem (3) can be provably solved via (4), as the solutions of them lie close to each other with high probability.

Besides (4), there are other formulations of RPCA. One of the most popular way is to introduce an auxiliary variable, $S$ , and add an additional constraint $L+S=A$ , which yields:

[TABLE]

This constrained formulation enables several avenues to solve RPCA, such as, the exact and inexact augmented Lagrangian method of multipliers by Lin et al. (Lin et al., 2010), accelerated proximal gradient method (Wright et al., 2009), alternating direction method (Yuan and Yang, 2013), alternating projection with intermediate denoising (Netrapalli et al., 2014), dual approach (Lin et al., 2009), and SpaRCS (Waters et al., 2011), manifold optimization by Yi et al. (Yi et al., 2016) and Zhang and Yang (Zhang and Yang, 2018), are a few popular ones. We refer to (Bouwmans and Zahzah, 2014) for a comprehensive review of RPCA algorithms.

For the discussion above, $A$ is fully observed with no data missing. One can consider that $A$ is partially observed, that is, there exists a projection operator (or simply a Bernoulli binary mask) $\mathrm{P}_{\Omega}$ on the set of observed data entries $\Omega\subseteq[m]\times[n]$ and is defined by

[TABLE]

The partial observed version of (5) reads

[TABLE]

Besides (5) and (7), other tractable reformulations of (3) still exist. For example, if the rank and target sparsity is user-inferred then it is common practice to relax the equality constraint in (5) and consider it in the objective function as a penalty. This, together with explicit constraints on the target rank, $r$ , and target sparsity level, $\alpha$ , (user-inferred hyperparameters), leads to the GoDec formulation (Zhou and Tao, 2011). One can also extend the above model to the case of partially observed data that leads to a more general class of problems that is commonly known as the robust matrix completion (RMC) problem (Chen et al., 2011; Tao and Yang, 2011; Cherapanamjeri et al., 2017b, a) that contains the variant proposed in (Zhou and Tao, 2011) as a special case. With $S=0$ , the matrix completion (MC) problem is also a special case of the RMC problem (Candès and Plan, 2009; Jain et al., 2013; Cai et al., 2010; Jain and Netrapalli, 2015; Candès and Recht, 2009; Keshavan et al., 2010; Candès and Tao, 2010; Mareček et al., 2017; Wen et al., 2012). Lastly, when the whole matrix is observed, the RMC problem is nothing but (5).

Recently, (Dutta et al., 2018a) reformulated (3) as a non-convex feasibility problem, which does not require any objective function, convex relaxation, or surrogate convex constraints. Rather, it exploits the following idea: the solution to the PCP problem lies in the intersection of two sets—one convex and one non-convex, if one considers both the target rank $r$ and the target sparsity $\alpha$ as hyperparameters. Let $X=\Big{(}{{\footnotesize\begin{matrix}S\\ L\end{matrix}}}\Big{)}\in\mathbb{R}^{2m\times n}$ and $K=[\mathrm{Id},\mathrm{Id}]$ where $\mathrm{Id}$ is the identity operator of $\mathbb{R}^{m\times n}$ , define

[TABLE]

Note that $\mathcal{X}$ is convex and $\mathcal{Y}$ is non-convex111The $\alpha$ -sparsity constraint on $S$ means that for $\alpha\in(0,1)$ , each row and column of $S$ contains no more than $\alpha n$ and $\alpha m$ number of non-zero entries, respectively. This is slightly more complicated than directly applying $\|\cdot\|_{0}$ constraint. However, it often works better in practice.. Given the sets, Dutta et al. (Dutta et al., 2018a) reformulated (3) as non-convex feasibility problem:

[TABLE]

Note that if we replace the $\mathrm{Id}$ in $K$ with Bernoulli binary matrix, then we obtain the reformulation of PCP problem with partial observation.

1.1 Formulation and Contributions

In this paper we consider reformulating the feasibility problem (8) as a best pair problem. Given two sets $\mathcal{X},\mathcal{Y}\subset\mathbb{R}^{2m\times n}$ , the best pair problems aims to find a pair of points $(X^{\star},Y^{\star})\in\mathcal{X}\times\mathcal{Y}$ such that they have the closest distance, that is $(X^{\star},Y^{\star})$ a the solution of the problem below:

[TABLE]

When the intersection of $\mathcal{X}$ and $\mathcal{Y}$ is non-empty, that is $\mathcal{X}\cap\mathcal{Y}\neq\emptyset$ , (9) reduces to the feasibility problem, with $X^{\star}=Y^{\star}\in\mathcal{X}\cap\mathcal{Y}$ . Given a set $\mathcal{X}$ , define its characteristic function by

[TABLE]

Then (9) can be equivalently written as

[TABLE]

Observe that for a given $Y$ , problem (11) becomes $\min_{X\in\mathbb{R}^{2m\times n}}\iota_{\mathcal{X}}(X)+\frac{1}{2}{|\kern-1.125pt|}X-Y{|\kern-1.125pt|}^{2}$ which is the Moreau envelope (Bauschke and Combettes, 2011) of $\iota_{\mathcal{X}}(X)$ of index $1$ :

[TABLE]

As a result, we can simplify (11) to the case of only $Y$ ,

[TABLE]

For the rest of the paper, we focus on (12) and our main contributions are summarised below:

•

New formulation and a new algorithm for non-convex PCP. We reformulate the non-convex set feasibility formulation of RPCA to a best pair problem. Although our formulation was inspired by formulation (8) from (Dutta et al., 2018a), to the best of our knowledge, we are the first to formulate and solve RPCA via the best pair. To this end, we design a fast and efficient algorithm—an accelerated proximal gradient method—to solve it.

•

Theoretical convergence guarantees. Both global and local convergence analysis of the scheme are provided. Globally, we show that our algorithm converges to a critical point. If the algorithm additionally starts sufficiently close to the optimum, we show that it converges to a global minimizer. Locally, our algorithm enjoys a fast linear rate, which we can sharply estimate. We owe this novelty to our best pair formulation. In contrast, the non-convex projection RPCA from (Dutta et al., 2018a) or GoDec (Zhou and Tao, 2011) can only guarantee a local linear convergence.

•

Numerical experiments and applications to real-world problems. We apply the proposed method to several well-tested applications in computer vision. Our extensive experiments on both real and synthetic data suggest that our algorithm matches or outperforms relevant baseline algorithms in fractions of their execution time. Additionally, in the supplementary material, we provide empirical validity of the hyperparameters sensitivity of our approach.

1.2 Notations

Throughout the paper, $\mathbb{N}$ is the set of non-negative integers. For a nonempty closed convex set $\Omega\subset\mathbb{R}^{n}$ , denote $\mathrm{P}_{\Omega}$ the orthogonal projector onto $\Omega$ . Let $\mathcal{R}:\mathbb{R}^{n}\to\mathbb{R}\cup\{+\infty\}$ be a lower semi-continuous (lsc) function, its domain is defined as $\mathrm{dom}(\mathcal{R})\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\{x\in\mathbb{R}^{n}:\mathcal{R}(x)<+\infty\}$ , and it is said to be proper if $\mathrm{dom}(\mathcal{R})\neq\emptyset$ . We need the following notions from variational analysis, see e.g. (Rockafellar and Wets, 1998) for details. Given $x\in\mathrm{dom}(\mathcal{R})$ , the Fréchet subdifferential $\partial^{F}\mathcal{R}(x)$ of $\mathcal{R}$ at $x$ , is the set of vectors $v\in\mathbb{R}^{n}$ that satisfies $\liminf_{z\to x,\,z\neq x}\frac{1}{{|\kern-1.125pt|}x-z{|\kern-1.125pt|}}(\mathcal{R}(z)-\mathcal{R}(x)-\langle v,\,z-x\rangle)\geq 0$ . If $x\notin\mathrm{dom}(\mathcal{R})$ , then $\partial^{F}\mathcal{R}(x)=\emptyset$ . The limiting-subdifferential (or simply subdifferential) of $\mathcal{R}$ at $x$ , written as $\partial\mathcal{R}(x)$ , is defined as $\partial\mathcal{R}(x)\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\{v\in\mathbb{R}^{n}:\exists x_{k}\to x,\mathcal{R}(x_{k})\to\mathcal{R}(x),v_{k}\in\partial^{F}\mathcal{R}(x_{k})\to v\}$ . Denote $\mathrm{dom}(\partial\mathcal{R})\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\{x\in\mathbb{R}^{n}:\partial\mathcal{R}(x)\neq\emptyset\}$ . Both $\partial^{F}\mathcal{R}(x)$ and $\partial\mathcal{R}(x$ ) are closed, with $\partial^{F}\mathcal{R}(x)$ convex and $\partial^{F}\mathcal{R}(x)\subset\partial\mathcal{R}(x)$ (Rockafellar and Wets, 1998, Proposition 8.5). Since $\mathcal{R}$ is lsc, it is (subdifferentially) regular at $x$ if and only if $\partial^{F}\mathcal{R}(x)=\partial\mathcal{R}(x)$ (Rockafellar and Wets, 1998, Corollary 8.11). A necessary condition for $x$ to be a minimizer of $\mathcal{R}$ is $0\in\partial\mathcal{R}(x)$ . The set of critical points of $\mathcal{R}$ is $\mathrm{crit}(\mathcal{R})=\{x\in\mathbb{R}^{n}:0\in\partial\mathcal{R}(x)\}$ .

2 An accelerated proximal gradient method

In this section, we describe a gradient-based optimization method for solving (12). Denote $\mathrm{P}_{\mathcal{X}},\mathrm{P}_{\mathcal{Y}}$ the projection operators onto $\mathcal{X}$ and $\mathcal{Y}$ , respectively. Since $\mathcal{X}$ is a non-empty closed convex set, its characteristic function $\iota_{\mathcal{X}}$ is proper convex and lower semi-continuous. Owing to (Bauschke and Combettes, 2011), the Moreau envelope is convex differentiable with gradient reads

[TABLE]

which is $1$ -Lipschitz continuous. Clearly, (12) admits a “non-smooth + smooth” structure, and in literature one prevailing algorithm to apply is the proximal gradient method (Lions and Mercier, 1979), a.k.a. Forward–Backward splitting. In this paper, we consider an accelerated version of the method, see Algorithm 1, which is based on inertial technique.

Remark 2.1.

**

•

If we choose $\gamma=1$ and $a_{k},b_{k}\equiv 0$ , Algorithm 1 becomes the Backward–Backward splitting, which is the method of alternating projections for the considered feasibility problem (8). Therefore, we recover the method from (Dutta et al., 2018a) as a special case.

•

From (8) to (12), we can also consider the Moreau envelope of the non-convex set $\mathcal{Y}$ , that is

[TABLE]

which also works well in practice.

•

Algorithm 1 is a special cases of the multi-step inertial proximal gradient descent method considered in (Liang et al., 2016) for general non-convex composite optimization.

Note that the two projection operators $\mathrm{P}_{\mathcal{X}},\mathrm{P}_{\mathcal{Y}}$ are very easy to compute. Given $X=\Big{(}{{\footnotesize\begin{matrix}S\\ L\end{matrix}}}\Big{)}$ , since $\mathcal{X}$ is an affine subspace, the projection of $X$ onto $\mathcal{X}$ reads $\mathrm{P}_{\mathcal{X}}(X)={{\frac{\raisebox{0.87495pt}{\small$ 1 $}}{\raisebox{-1.45833pt}{\small$ 2 $}}}}\Big{(}{{\footnotesize\begin{matrix}A+S-L\\ A-S+L\end{matrix}}}\Big{)}$ . If $K=[\mathrm{P}_{\Omega},\mathrm{P}_{\Omega}]$ where $\mathrm{P}_{\Omega}$ is the binary mask defined in (6), then for the partial observed case, we have

[TABLE]

For the projection $\mathrm{P}_{\mathcal{Y}}$ which contains a low-rank projection and sparsity projection, we refer to (Dutta et al., 2018a) for more details.

2.1 Global convergence

Since set $\mathcal{Y}$ is semi-algebraic (Bolte et al., 2010), our global convergence guarantees of Algorithm 1 is based on Kurdyka-Łojasiewicz property.

Kurdyka-Łojasiewicz property. Let $R:\mathbb{R}^{n}\to\mathbb{R}\cup\{+\infty\}$ be a proper lsc function. For $\eta_{1},\eta_{2}$ such that $-\infty<\eta_{1}<\eta_{2}<+\infty$ , define the set

[TABLE]

Definition 2.2.

Function $R$ is said to have the Kurdyka-Łojasiewicz property at $\overline{Y}\in\mathrm{dom}(R)$ if there exists $\eta\in]0,+\infty]$ , a neighbourhood $U$ of $\overline{Y}$ and a continuous concave function $\varphi:[0,\eta[\to\mathbb{R}_{+}$ such that

(i)

$\varphi(0)=0$ , $\varphi$ is $C^{1}$ on $]0,\eta[$ , and for all $s\in]0,\eta[$ , $\varphi^{\prime}(s)>0$ ; 2. (ii)

for all $Y\in U\cap[R(\overline{Y})<R<R(\overline{Y})+\eta]$ , the Kurdyka-Łojasiewicz inequality holds

[TABLE]

Proper lsc functions which satisfy the Kurdyka-Łojasiewicz property at each point of $\mathrm{dom}(\partial R)$ are called KL functions.

KL functions include the class of semi-algebraic functions, see (Bolte et al., 2007, 2010). For instance, the $\ell_{0}$ pseudo-norm and the rank function are KL.

Global convergence. To deliver the convergence result, we rewrite (12) into the following generic form

[TABLE]

where we assume that

(A.1)

$\mathcal{R}:\mathbb{R}^{n}\to\mathbb{R}\cup\{+\infty\}$ is proper lower semi-continuous, and bounded from below; 2. (A.2)

$\mathcal{F}:\mathbb{R}^{n}\to\mathbb{R}$ is convex differentiable and its gradient $\nabla\mathcal{F}$ is $L$ -Lipschitz continuous.

Let $\nu>0$ be a constant. Define the following quantities,

[TABLE]

Theorem 2.3 (Global convergence).

For problem (15), assume \reftagform@A.1-\reftagform@A.2 hold, and that $\Phi$ is a proper lsc KL function which is bounded from below. For Algorithm 1, choose $\nu,\gamma,a_{k},b_{k}$ such that

[TABLE]

Then each bounded sequence $\{Y_{k}\}_{k\in\mathbb{N}}$ satisfies

(i)

$\{Y_{k}\}_{k\in\mathbb{N}}$ * has finite length, i.e. $\sum_{k\in\mathbb{N}}{|\kern-1.125pt|}Y_{k}-Y_{k-1}{|\kern-1.125pt|}<+\infty$ ;* 2. (ii)

There exists a critical point $Y^{\star}\in\mathrm{crit}(\Phi)$ such that $\lim_{k\to\infty}Y_{k}=Y^{\star}$ . 3. (iii)

If $\Phi$ has the KL property at a global minimizer $Y^{\star}$ , then starting sufficiently close from $Y^{\star}$ , any sequence $\{Y_{k}\}_{k\in\mathbb{N}}$ converges to a global minimum of $\Phi$ and satisfies \reftagform@i.

The proof of the above theorem can be found in the supplementary material. We also refer to (Liang et al., 2016) and the reference therein for more results on non-convex proximal gradient method.

2.2 Local linear convergence

Now we turn to the local perspective and present a local linear convergence analysis for Algorithm 1. For the constraint set $\mathcal{Y}$ define in (8), consider the following decomposition of it

[TABLE]

For the sequence $Y_{k}$ generated by (13), suppose $Y_{k}=\Big{(}{{\footnotesize\begin{matrix}S_{k}\\ L_{k}\end{matrix}}}\Big{)}$ . It is immediate that $\mathrm{rank}(L_{k})\leq r$ holds for all $k$ . For $S_{k}$ , though it is always $\alpha$ -sparse, the locations of non-zero elements change along the course of iteration. In the following, we first show that after a finite number of iterations the locations of non-zero elements of $S_{k}$ stop changing, that is $S_{k}$ will have the same support as that of $S^{\star}$ to which $S_{k}$ converges, and then Algorithm 1 enters a linear convergence regime.

Support identification of $S_{k}$ . Let $Y^{\star}=\Big{(}{{\footnotesize\begin{matrix}S^{\star}\\ L^{\star}\end{matrix}}}\Big{)}$ be a critical point of (12) to which $Y_{k}$ converges. Let $\mathcal{S}$ be the subspace extended by the support of $S^{\star}$ . Clearly, $S^{\star}\in\mathcal{S}$ and we have the result below concerning the relation between $S_{k}$ and $\mathcal{S}$ .

Theorem 2.4 (Support identification).

*For Algorithm 1, suppose Theorem 2.3 holds. Then $Y_{k}$ converges to a critical point $Y^{\star}$ of (12). For all $k$ large enough, we have $S_{k}\in\mathcal{S}$ . *

Let $S^{\star}$ be the point that $S_{k}$ converges to, the above result simply means that after finite number of iterations, $\mathrm{supp}(S_{k})=\mathrm{supp}(S^{\star})$ holds for all $k$ large enough.

Local linear convergence. Given a critical point $Y^{\star}$ , let $X^{\star}=\mathrm{P}_{\mathcal{X}}(Y^{\star})$ , we have

[TABLE]

Note that the first two sets, $\mathcal{X},\mathcal{S}$ are (affine) subspaces, hence smooth, and $\mathcal{Y}_{L}$ is the set of fixed-rank matrices which is $C^{2}$ -smooth manifold (Lee, 2003). To derive the local linear rate, we need to utilize the smoothness of these sets. Let $\mathcal{M}$ be a $C^{2}$ -smooth manifold and let $\mathcal{T}_{\mathcal{M}}(X)$ the tangent space of $\mathcal{M}$ at $X\in\mathcal{M}$ , we have the following lemma which is crucial for our local linear convergence analysis.

Lemma 2.5 ((Liang et al., 2014, Lemma 5.1)).

Let $\mathcal{M}$ be a $C^{2}$ -smooth manifold around $X$ . Then for any $X^{\prime}\in\mathcal{M}\cap\mathcal{N}$ , where $\mathcal{N}$ is a neighbourhood of $X$ , the projection operator $\mathrm{P}_{\mathcal{M}}(X^{\prime})$ is uniquely valued and $C^{1}$ around $X$ , and thus $X^{\prime}-X=\mathrm{P}_{\mathcal{T}_{\mathcal{M}}(X)}(X^{\prime}-X)+o({|\kern-1.125pt|}X^{\prime}-X{|\kern-1.125pt|})$ . If moreover, $\mathcal{M}=X+\mathcal{T}_{\mathcal{M}}(X)$ is an affine subspace, then $X^{\prime}-X=\mathrm{P}_{\mathcal{T}_{\mathcal{M}}(X)}(X^{\prime}-X)$ .

Denote the tangent spaces of $\mathcal{X},\mathcal{Y}$ at $X^{\star},Y^{\star}$ as $T_{\mathcal{X}}^{X^{\star}}$ and $T_{\mathcal{Y}}^{Y^{\star}}$ , respectively. We refer to the supplementary material for detailed expressions of these tangent spaces. Denote $\mathrm{P}_{T_{\mathcal{X}}^{X^{\star}}}$ and $\mathrm{P}_{T_{\mathcal{Y}}^{Y^{\star}}}$ the projections onto the tangent spaces. Define the matrix $\mathcal{P}\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\mathrm{P}_{T_{\mathcal{Y}}^{Y^{\star}}}\big{(}{(1-\gamma)\mathrm{Id}+\gamma\mathrm{P}_{T_{\mathcal{X}}^{X^{\star}}}}\big{)}\mathrm{P}_{T_{\mathcal{Y}}^{Y^{\star}}}$ , and

[TABLE]

Denote $\rho_{{}_{\mathcal{P}}},\rho_{{}_{\mathcal{Q}}}$ the spectral radiuses of $\mathcal{P},\mathcal{Q}$ , respectively.

Theorem 2.6 (Local linear convergence).

For Algorithm 1, suppose Theorem 2.4 holds. Then $Y_{k}$ converges to a critical point $Y^{\star}$ of (12). Suppose $b_{k}=a_{k}\equiv a\in[0,1]$ , there exists a $K>0$ such that for all $k\geq K$ ,

[TABLE]

*Moreover, if $\rho_{{}_{\mathcal{P}}}<1$ , then so is $\rho_{{}_{\mathcal{Q}}}$ , and for all $k$ large enough we have ${|\kern-1.125pt|}Y_{k}-Y^{\star}{|\kern-1.125pt|}=O(\rho_{{}_{\mathcal{Q}}}^{k})$ . *

Remark 2.7.

**

•

If $T_{\mathcal{X}}^{X^{\star}}\cap T_{\mathcal{Y}}^{Y^{\star}}=\{0\}$ , then it can be shown that $\rho_{{}_{\mathcal{P}}}<1$ .

•

Given $\rho_{{}_{\mathcal{P}}}$ , $\rho_{{}_{\mathcal{Q}}}$ can be expressed explicitly in terms of $a$ and $\rho_{{}_{\mathcal{P}}}$ . For the case that $a_{k}\to a\in[0,1]$ and $b_{k}\to b\in[0,1]$ , we refer to (Liang, 2016, Chapter 6) for detailed discussion on the local linear convergence analysis.

An numerical illustration on our theoretical rate estimation and practical observation is provided in the supplementary material Section C-Figure 13.

3 Numerical experiments

In this section, we extensively tested our best-pair formulation on both real and synthetic data against a vast genre of PCP algorithms. The first set of algorithms that we tested against, e.g. iEALM and APG, determine the target rank and sparsity robustly from the given set of hyperparameters. On the other hand, for the second set of algorithms, e.g. RPCA gradient descent (RPCA GD), Go decomposition (GoDec), and RPCA nonconvex feasibility (RPCA NCF), the target rank and sparsity are user-inferred. Although our accelerated proximal gradient algorithm belongs to the second class, to show its effectiveness, we compare it with both classes of state-of-the-art robust PCP algorithms (see Table 2 in the supplementary material) on several computer vision applications—removal of shadows and specularities from face images, Background estimation or tracking from video sequences, and inlier detection from a grossly corrupted dataset (see Section A.3.1 in the supplementary material)222In all experiments, we use the approximate projection (Dutta et al., 2018a; Yi et al., 2016; Zhang and Yang, 2018) onto $\mathcal{Y}$ as the exact one is expensive:

$\mathcal{T}_{\alpha}[S]\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\{\mathrm{P}_{\Omega_{\alpha}}(S)\in\mathbb{R}^{m\times n}:(i,j)\in\Omega_{\alpha}\;{\rm if}\;|S_{ij}|\geq|S_{(i,.)}^{(\alpha n)}|\;\;{\rm and}\;|S_{ij}|\geq|S_{(.,j)}^{(\alpha m)}|\}.$

If the sparsity constraint was defined only along rowsc (or only columns), the exact projection would be cheap. However, the approximate projection produces better results, thus we stick with it..

Results on synthetic data.

The primary goal of these set of experiments is to understand the behavior of our proposed method on some well-understood data and to test against some state-of-the-art algorithms. To construct our test matrix $A$ , for these experiments, we used the idea proposed by Wright et al. (Wright et al., 2009). First, we generate the low-rank matrix, $L$ , as a product of two independent full-rank matrices of size $m\times r$ with $r<m$ such that elements are independent and identically distributed (i.i.d.) and sampled from a normal distribution— $\mathcal{N}(0,1)$ . We generate the sparse matrix, $S$ , such that its elements are chosen from the interval $[-500,500]$ . We create the sparse support set by using the operator (2). Finally, we write $A$ as $A=L+S$ . We fix $m=200$ and define $\rho_{r}={\rm rank}(L)/m$ , where ${\rm rank}(L)$ varies. We choose the sparsity level $\alpha\in(0,1)$ .

Phase transition experiments. For each pair of $(\rho_{r},\alpha)$ , we apply iEALM, RPCA NCF, and our algorithm to recover the pair $(\hat{L},\hat{S})$ . For iEALM, we set $\lambda=1/\sqrt{m}$ and use $\mu=1.25/\|A\|_{2}$ and $\rho=1.5$ , where $\|A\|_{2}$ is the spectral norm (maximum singular value) of $A$ . For a given $\varepsilon>0$ , if the recovered matrix pair $(\hat{L},\hat{S})$ , satisfies the relative error $\tfrac{\|A-\hat{L}-\hat{S}\|_{F}}{\|{A}\|_{F}}<\varepsilon$ then we consider the construction is viable. In Figure 1, we produce the phase transition diagrams to show the fraction of perfect recovery of $A$ , where white denotes success and black denotes failure. We run the experiments for 5 times and plot the results. The success of iEALM is approximately below the line $\rho_{r}+\alpha\approx 0.25.$ On the other hand, we note that the performance of our best pair RPCA is almost similar to that of (Dutta et al., 2018a), when the sparsity level $\alpha$ is small and both approaches can efficiently provide a feasible reconstruction for any $\rho_{r}$ in that case. We also note that for low sparsity level, iEALM can only provide a feasible reconstruction for $\rho_{r}\leq 0.25$ . Due to their robustness to any low-rank structure when $\alpha$ is low, RPCA NCF and best pair RPCA can be proved to be very effective in many real-world applications. In many real-world problems, involving the video/image data can ideally have any inherent low-rank structure and are generally corrupted by very sparse outliers of arbitrary large magnitudes. In those instances, RPCA NCF and our best pair RPCA could be very useful. We show more justification in the later section.

Root mean square error measure. To validate our performance against RPCA GD of Yi et al. (Yi et al., 2016), we use a different metric—root mean square error (RMSE). Since RPCA GD does not explicitly recover a sparse matrix, $S$ , it is unjustified to test it against the same relative error. Therefore, for the true low-rank, $L$ , and a low-rank recovery, $\hat{L}$ , we use the metric $\nicefrac{{\|L-\hat{L}\|_{F}}}{{\sqrt{mn}}}$ as the measure of RMSE. From Figure 2, we can conclude that our best pair RPCA has less RMSE compare to that of RPCA GD. Moreover, the RMSE remains unaltered as the cardinality of support set, $\Omega$ increases. Also, see Figure 9 in the Appendix.

Removal of shadows and specularities.

Set of images of an object under unknown pose and arbitrary lighting conditions, form a convex cone in the space of all possible images which may have unbounded dimension (Basri and Jacobs, 2003; Belhumeur and Kriegman, 1998). However, the images under distant, isotropic lighting can be approximated by a 9-dimensional linear subspace which is popularly referred to as the harmonic plane. We used three subjects B11,B12, and B13 from the Extended Yale Face Database (Georghiades et al., 2001) for our simulations. We used 63 downsampled images of resolution of $120\times 160$ of each subject. For APG and iEALM, we set the parameters the same as in the previous section. For RPCA GD, RPCA NCF, and our method, we set target rank $r=9$ and sparsity level to $0.1$ . The qualitative analysis on the recovered images from Figure 3 shows while RPCA GD recovers patchy and granular face images, our best pair reformulation provides comparable reconstruction to that of iEALM, APG, and RPCA NCF.

Background estimation from video sequences.

Background estimation or moving object tracking (Bouwmans et al., 2017; Dutta, 2016; Bouwmans et al., 2016; Dutta et al., 2017a; Bouwmans, 2014; Dutta et al., 2017b, 2018b; Dutta and Richtárik, 2019; Dutta and Li, 2017) is considered as one of the classic problems in computer vision and is used as a crucial component in human activity recognition, tracking, and video analysis from surveillance cameras. When the video is captured by a static camera, minimizing the rank of the matrix $A\in\mathbb{R}^{m\times n}$ , that concatenates $n$ video frames (after converting them into vectors) represents the structure of the linear subspace, $L$ , that contains the background and an error, $S$ , that emphasizes the foreground components. However, the exact desired rank is often tuned empirically, as the ideal rank-one background is often unrealistic as the changing illumination, occluded foreground/background objects, reflection, and noise are typically also a part of the video frames. Based on the above observation, we note that the problem can be cast typically as (4). However, as we explained in some cases, when the target rank and the sparsity level is user-inferred hyperparameters, one might use a different approach as in (Zhou and Tao, 2011; Dutta et al., 2018a; Yi et al., 2016) as well. Additionally, there might be missing/unobserved pixels in the video and that makes the problem more complex and only a few methods, such as RPCA NCF, GRASTA (He et al., 2012), RPCA GD remedy to that. Therefore, we tested our best pair RPCA to a wide range of methods. In our experiments, we use two different video sequences: (i) the Basic sequence from Stuttgart synthetic dataset (Brutzer et al., 2011), (ii) the waving tree sequence (Toyama et al., 1999). We extensively use the Stuttgart video sequence as it is a challenging sequence that comprises both static and dynamic foreground objects and varying illumination in the background. Additionally, it comes with foreground ground truth for each frame. For iEALM and APG, we set the parameters the same as in the previous sections. For Best pair RPCA, RPCA GD, RPCA NCF, and GoDec, we set $r=2$ , target sparsity 10% and additionally, for GoDec, we set $q=2$ . For GRATSA, we set the parameters the same as those mentioned in the authors’ website (gra, 2012). The qualitative analysis on the background and foreground recovered on both, full observation (in Figure 4) and partial observation (in Figure 5), suggest that our method recovers a visually better quality background and foreground compare to the other methods. Note that, RPCA GD recovers a fragmentary foreground with more false positives compare to our method; moreover, RPCA GD, GRASTA, iEALM, and APG cannot remove the static foreground object. We provide a detailed quantitative evaluation of our best pair RPCA with respect to the $\varepsilon$ -proximity metric– $d_{\varepsilon}(X,Y)$ as in (Dutta et al., 2018a) and the mean structural similarity index measure (SSIM) by (Wang et al., 2004) in recovering the foreground objects in Figures 10 and 11 in Appendix.

Supplementary Material

The organization of this supplementary material is: extra supporting numerical experiments are reported in Section A; Proofs for the global convergence result of Algorithm 1 is provided in Section B; The proof of local linear convergence and a numerical example are provided in Section C. Lastly, we provide a comprehensive table to list all baselines we compare to in Section D.

Appendix A Extra Experiments

In this section, we empirically study convergence properties of Algorithm 1 on synthetic, well-understood data. In particular, we examine its sensitivity to user-specified parameters $\gamma,a_{k},b_{k}$ , target sparsity level $\alpha$ , target rank $r$ and lastly the sensitivity to initialization. Moreover, we provide extra phase transition diagrams and both quantitative and qualitative results on the inlier detection problem.

A.1 Sensitivity to the choice of $\gamma,a_{k},b_{k}$

In this experiment, we compare different choices of algorithm parameters $\gamma,a_{k},b_{k}$ on instances of (9) with various target sparsity level $\alpha$ and target rank $r$ . In each experiment, we make sure that the solution exists; we generate random matrices $\tilde{L},\tilde{S}$ (with independent entries ${\cal N}(0,1)$ ), project them onto low rank and sparse constraint set respectively to obtain $\hat{L},\hat{S}$ and set $A=\hat{L}+\hat{S}$ . For simplicity we consider only $a_{k}=b_{k}=a$ and $m=n=100$ . Figure 6 shows the result. We see that parameter choice $\gamma=1.1,a_{k}=b_{k}=\frac{1}{2}$ is the most reliable.

A.2 Sensitivity to the choice of $r,\alpha$

In this experiment, we examine how sensitive is Algorithm 1 on the correct choice of the target sparsity level $\alpha$ and the target rank $r$ .

In each experiment, we generate random matrices $\tilde{L},\tilde{S}$ (with independent entries ${\cal N}(0,1)$ ), project them onto $\hat{r}$ -low rank and $\hat{\alpha}$ -sparse constraint set respectively to obtain $\hat{L},\hat{S}$ and set $A=\hat{L}+\hat{S}$ . Then, we run Algorithm 1 with various choices of $r,\alpha$ and report the results. For simplicity we consider only $\gamma=1.1,a_{k}=b_{k}=\frac{1}{2}$ (from the previous experiment) and $m=n=100$ . Figure 7 shows the result. We can see that if sparsity level is underestimated, the method converges very slowly. Moreover, the method is more sensitive to the correct choices of target sparsity than target rank. The last take-away from this experiment is that over-estimation of target parameters usually leads to slightly slower convergence.

A.3 Sensitivity to the choice of the starting point

In the last experiment, we examine how the starting point influences the convergence rate. For each problem instance, we perform 50 independent runs of Algorithm 1 and report the best, worst and median performance.

For simplicity, we consider only problems with known target rank and sparsity – we generate random matrices $\tilde{L},\tilde{S}$ (with independent entries ${\cal N}(0,1)$ ), project them onto low rank and sparse constraint set respectively to obtain $\hat{L},\hat{S}$ and set $A=\hat{L}+\hat{S}$ . Further, we set $a_{k}=b_{k}=0.5$ , $\gamma=1.1$ and $m=n=100$ . Figure 8 shows the result. We can see that the convergence speed of Algorithm 1 is, in most cases, not influenced significantly by the starting point. Thus, the non-convex nature of the problem is surprisingly not causing any issues. Lastly, the convergence rate of Algorithm 1 is faster for small values of $\alpha,r$ , which is often the most interesting case in terms of the practical application.

A.3.1 Inlier detection

Historically, PCA and RPCA are used in detecting the inliers and the outliers from a composite dataset. We infused 400 random, grayscale, downsampled ( $20\times 20$ pixels) natural images from the BACKGROUND/Google folder of the Caltech101 database (Fei-Fei et al., 2007) with the Yale Extended Face Database to construct the data set. The inliers are the grayscale images of faces (of the same resolution) under different illuminations while the 400 random natural images serve as outliers. The goal is to consider a low-dimensional model and to project the inliers to a 9-dimensional linear subspace where the images of the same face lie. Goes et al. in (Goes et al., 2014) designed seven algorithms to explicitly find a low-rank subspace. To this end, Goes et al. used the classical SGD, an incremental approach, and mirror descent algorithms to find the 9-dimensional subspace. However, we split the dataset, $A$ , into a 9-dimensional low-rank subspace $L$ and expect the outliers to be in the sparse set, $S$ . Once we find $L$ , we find the basis of $L$ via orthogonalization and project the faces on it. In Figure 12, we show the qualitative results of our experiments333The codes and datasets for experiments in Section A.3.1 are obtained from https://github.com/jwgoes/RSPCA.

As proposed in (Goes et al., 2014), we use the normalized error term $\|P_{L}-P_{L^{*}}\|_{F}/{3\sqrt{2}}$ , where $L$ is subspace fitted by the PCA to the set of inliers and $L^{*}$ be the subspace fitted by different algorithms. Note that, the metric is expected to lie between 0 and 1 where the smaller is the better. We refer to Table 1 for our quantitative results.

Appendix B Proof of the global convergence

For convenience, define $\Delta_{k}\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}{|\kern-1.125pt|}Y_{k}-Y_{k-1}{|\kern-1.125pt|}$ .

Lemma B.1.

For the update of $Y_{k+1}$ in (13), given any $k\in\mathbb{N}$ , define

[TABLE]

Then, we have $G_{k+1}\in\partial\Phi(Y_{k+1})$ , and

[TABLE]

Proof.

From the definition of proximity operator and the update of $Y_{k+1}$ (13), we have $Z_{a,k}-\gamma\nabla\mathcal{F}(Z_{b,k})-Y_{k+1}\in\gamma\partial\mathcal{R}(Y_{k+1})$ . Adding $\gamma\nabla\mathcal{F}(Y_{k+1})$ to both sides, we obtain

[TABLE]

Applying further the triangle inequality together with the Lipschitz continuity of $\nabla\mathcal{F}$ , we get

[TABLE]

which concludes the proof. ∎

Lemma B.2.

For Algorithm 1, given the parameters $\gamma,a_{k},b_{k}$ , the following inequality holds:

[TABLE]

Proof.

Define the function

[TABLE]

It can be shown that the update of $Y_{k+1}$ in (13) is equivalent to

[TABLE]

which means that $\mathcal{L}_{k}(Y_{k+1})\leq\mathcal{L}_{k}(Y_{k})$ , which means

[TABLE]

Therefore, we get

[TABLE]

Since $\nabla\mathcal{F}$ is $L$ -Lipschitz, then

[TABLE]

For the inner product $\langle Y_{k}-Y_{k+1},\,Y_{k}-Y_{k-1}\rangle$ , applying the Pythagorean relation $2\langle c_{1}-c_{2},\,c_{1}-c_{3}\rangle={|\kern-1.125pt|}c_{1}-c_{2}{|\kern-1.125pt|}^{2}+{|\kern-1.125pt|}c_{1}-c_{3}{|\kern-1.125pt|}^{2}-{|\kern-1.125pt|}c_{2}-c_{3}{|\kern-1.125pt|}^{2}$ , we get

[TABLE]

Using further Young’s inequality with $\nu>0$ we obtain

[TABLE]

Combining the above $3$ inequalities with (21) yields

[TABLE]

which leads to

[TABLE]

Owing to the definition of $\underline{\beta}$ and $\mkern 1.5mu\overline{\mkern-1.5mu{\alpha}\mkern-1.5mu}\mkern 1.5mu$ we conclude the proof. ∎

Define $\mathcal{H}$ the product space $\mathcal{H}\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\mathbb{R}^{n}\times\mathbb{R}^{n}$ and ${Z}_{k}=(Y_{k},Y_{k-1})\in\mathcal{H}$ . Then given ${Z}_{k}$ , define the function

[TABLE]

which is is a KL function if $\Phi$ is. Denote $\mathcal{C}_{Y_{k}},\mathcal{C}_{{Z}_{k}}$ the set of cluster points of sequences $\{Y_{k}\}_{k\in\mathbb{N}}$ and $\{{Z}_{k}\}_{k\in\mathbb{N}}$ respectively, and $\mathrm{crit}(\Psi)=\{Z=(Y,Y)\in\mathcal{H}:Y\in\mathrm{crit}(\Phi)\}$ .

Lemma B.3.

For Algorithm 1, choose $\nu,\gamma,a_{k},b_{k}$ such that (17) holds. If $\Phi$ is bounded from below, then

(i)

$\sum_{k\in\mathbb{N}}\Delta_{k}^{2}<+\infty$ ; 2. (ii)

The sequence $\Psi({Z}_{k})$ is monotonically decreasing and convergent; 3. (iii)

The sequence $\Phi(Y_{k})$ is convergent.

Proof.

Define $\delta=\underline{\beta}-\mkern 1.5mu\overline{\mkern-1.5mu{\alpha}\mkern-1.5mu}\mkern 1.5mu>0$ , from Lemma B.2, we have

[TABLE]

Let $Y_{-1}=Y_{0}$ and the above inequality over $k$ :

[TABLE]

which means, as $\Phi(Y_{0})$ is bounded,

[TABLE]

From Lemma B.2, by pairing terms on both sides of (19), we get

[TABLE]

Since we assume $\underline{\beta}-\mkern 1.5mu\overline{\mkern-1.5mu{\alpha}\mkern-1.5mu}\mkern 1.5mu>0$ , hence $\Psi({Z}_{k})$ is monotonically non-increasing. The convergence of $\Phi(Y_{k})$ is straightforward. ∎

Lemma B.4.

For Algorithm 1, choose $\nu,\gamma,a_{k},b_{k}$ such that (17) holds. If $\Phi$ is bounded from below and $\{Y_{k}\}_{k\in\mathbb{N}}$ is bounded, then $Y_{k}$ converges to a critical point of $\Phi$ .

Proof.

Since $\{Y_{k}\}_{k\in\mathbb{N}}$ is bounded, there exists a subsequence $\{Y_{k_{j}}\}_{k\in\mathbb{N}}$ and cluster point $\overline{Y}$ such that $Y_{k_{j}}\to\overline{Y}$ as $j\to\infty$ . Next we show that $\Phi(Y_{k_{j}})\to\Phi(\overline{Y})$ and that $\overline{Y}$ is a critical point of $\Phi$ .

Since $\mathcal{R}$ is lsc, then $\liminf_{j\to\infty}\mathcal{R}(Y_{k_{j}})\geq\mathcal{R}(\overline{Y})$ . From (20), we have $\mathcal{L}_{k_{j}-1}(Y_{k_{j}})\leq\mathcal{L}_{k_{j}-1}(\overline{Y})$ and thus

[TABLE]

Taking the limit of the above inequality and using $\Delta_{k}^{2}\to 0$ , $Y_{k_{j}}\to\overline{Y}$ , we get $\limsup_{j\to\infty}\mathcal{R}(Y_{k_{j}})\leq\mathcal{R}(\overline{Y})$ . As a result, $\lim_{k\to\infty}\mathcal{R}(Y_{k_{j}})=\mathcal{R}(\overline{Y})$ . Since $\mathcal{F}$ is continuous, then $\mathcal{F}(Y_{k_{j}})\to\mathcal{F}(\overline{Y})$ , hence $\Phi(Y_{k_{j}})\to\Phi(\overline{Y})$ .

Furthermore, owing to Lemma B.1, $G_{k_{j}}\in\partial\Phi(Y_{k_{j}})$ , and (i) of Lemma B.3 we have $G_{k_{j}}\to 0$ as $k\to\infty$ . Therefore, as $j\to\infty$ , we have

[TABLE]

Hence $0\in\partial\Phi(\overline{Y})$ , i.e. $\overline{Y}$ is a critical point. ∎

Proof of Theorem 2.3.

Putting together the above lemmas, we draw the following useful conclusions:

(C.1)

Denote $\delta=\underline{\beta}-\mkern 1.5mu\overline{\mkern-1.5mu{\alpha}\mkern-1.5mu}\mkern 1.5mu$ , then $\Psi({Z}_{k+1})+\delta\Delta_{k+1}^{2}\leq\Psi({Z}_{k})$ ; 2. (C.2)

Define

[TABLE]

then we have $W_{k}\in\partial\Psi({Z}_{k})$ . Owing to Lemma B.1, there exists a $\sigma>0$ such that ${|\kern-1.125pt|}W_{k}{|\kern-1.125pt|}\leq\sigma(\Delta_{k}+\Delta_{k-1})$ ; 3. (C.3)

if $Y_{k_{j}}$ is a subsequence such that $Y_{k_{j}}\to\overline{Y}$ , then $\Psi({Z}_{k})\to\Psi(\overline{Z})$ where $\overline{Z}=(\overline{Y},\overline{Y})$ . 4. (C.4)

$\mathcal{C}_{{Z}_{k}}\subseteq\mathrm{crit}(\Psi)$ ; 5. (C.5)

$\lim_{k\to\infty}\mathrm{dist}({Z}_{k},\mathcal{C}_{{Z}_{k}})=0$ ; 6. (C.6)

$\mathcal{C}_{{Z}_{k}}$ is non-empty, compact and connected; 7. (C.7)

$\Psi$ is finite and constant on $\mathcal{C}_{{Z}_{k}}$ .

Next we prove the claims of Theorem 2.3.

(i)

Consider a critical point of $\Phi$ , $\overline{Y}\in\mathrm{crit}(\Phi)$ , such that $\overline{Z}=(\overline{Y},\overline{Y})\in\mathcal{C}_{{Z}_{k}}$ . Then owing to \reftagform@C.3, we have $\Psi({Z}_{k})\to\Psi(\overline{Z})$ .

Suppose there exists $K$ such that $\Psi(Z_{K})=\Psi(\overline{Z})$ . Then, the descent property \reftagform@C.1 implies that $\Psi({Z}_{k})=\Psi(\overline{Z})$ holds for all $k\geq K$ . Thus, ${Z}_{k}$ is constant for $k\geq K$ , hence has finite length.

On the other hand, suppose that $\psi_{k}\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\Psi({Z}_{k})-\Psi(\overline{Z})>0$ . Owing to \reftagform@C.6, \reftagform@C.7 and Definition 2.2, the KL property of $\Psi$ implies that there exist $\varepsilon,\eta$ and a concave function $\varphi$ , and

[TABLE]

such that for all $Z\in\mathcal{U}$ :

[TABLE]

Let $k_{1}\in\mathbb{N}$ be such that $\Psi({Z}_{k})<\Psi(\overline{Z})+\eta$ holds for all $k\geq k_{1}$ . Owing to \reftagform@C.5, there exists another $k_{2}\in\mathbb{N}$ such that $\mathrm{dist}({Z}_{k},\mathcal{C}_{{Z}_{k}})<\varepsilon$ holds for all $k\geq k_{2}$ . Let $K=\max\{k_{1},k_{2}\}$ . Then ${Z}_{k}\in\mathcal{U}$ holds for all $k\geq K$ . Furthermore using (26), we have for $k\geq K$

[TABLE]

Note that since $\varphi$ is concave, $\varphi^{\prime}$ is decreasing. As $\Psi({Z}_{k})$ is decreasing too, we have

[TABLE]

From \reftagform@C.1, since $\mathrm{dist}(0,\partial\Psi({Z}_{k}))\leq{|\kern-1.125pt|}w_{k}{|\kern-1.125pt|}$ , we get

[TABLE]

Moreover, \reftagform@C.2 yields $\Psi({Z}_{k})-\Psi({Z}_{k+1})\geq\delta\Delta_{k+1}^{2}$ and thus

[TABLE]

which yields

[TABLE]

Taking the square root of both sides and applying Young’s inequality we further obtain

[TABLE]

Summing up both sides over $k$ , and using $x_{0}=x_{-1}$ , we get

[TABLE]

which concludes the finite length property of $Y_{k}$ . 2. (ii)

Then the convergence of the sequence follows from the fact that $\{Y_{k}\}_{k\in\mathbb{N}}$ is a Cauchy sequence, hence convergent. Owing to Lemma B.4, there exists a critical point $Y^{\star}\in\mathrm{crit}(\Phi)$ such that $\lim_{k\to\infty}Y_{k}=Y^{\star}$ . 3. (iii)

We now turn to prove local convergence to a global minimizer. Note that if $Y^{\star}$ is a global minimizer of $\Phi$ , then ${Z}^{\star}$ is a global minimizer of $\Psi$ . Let $r>\rho>0$ such that ${\mathds{B}}_{r}({Z}^{\star})\subset\mathcal{U}$ and $\eta<\delta(r-\rho)^{2}$ . Suppose that the initial point $Y_{0}$ is chosen such that following conditions hold,

[TABLE]

The descent property \reftagform@C.1 of $\Psi$ together with (29) imply that for any $k\in\mathbb{N}$ , $\Psi({Z}^{\star})\leq\Psi({Z}_{k+1})\leq\Psi({Z}_{k})\leq\Psi(Z_{0})<\Psi({Z}^{\star})+\eta$ , and

[TABLE]

Therefore, given any $k\in\mathbb{N}$ , if we have $Y_{k}\in{\mathds{B}}_{\rho}(Y^{\star})$ , then

[TABLE]

which means that $Y_{k+1}\in{\mathds{B}}_{r}(Y^{\star})$ .

For any $k\in\mathbb{N}$ , define the following partial sum $p_{k}\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\sum_{j=k-2}^{k-1}\sum_{i=1}^{j}\Delta_{i}$ . Note that $p_{k}=0$ for $k=1$ , and $\lim_{k\to+\infty}p_{k}=\ell$ . Next we prove the following claims through induction: for $k\in\mathbb{N}$

[TABLE]

From (31) we have

[TABLE]

Applying the triangle inequality we then obtain

[TABLE]

which means $Y_{1}\in{\mathds{B}}_{\rho}(Y^{\star})$ . Now, taking $\kappa=1$ in (28) yields, for any $k\in\mathbb{N}$ ,

[TABLE]

Let $k=1$ . Since $Y_{0}=Y_{-1}$ , we have

[TABLE]

Therefore, (33) and (34) hold for $k=1$ .

Now assume that they hold for some $k>1$ . Using the triangle inequality and (34),

[TABLE]

As $\varphi(\psi)\geq 0$ and $\varphi^{\prime}(\psi)>0$ for $\psi\in]0,\eta[$ , and in view of (30), we arrive at

[TABLE]

whence we deduce that (33) holds at $k+1$ . Now, taking (36) at $k+1$ gives

[TABLE]

Adding both sides of (37) and (34) we get

[TABLE]

meaning that (34) holds at $k+1$ . This concludes the induction proof.

In summary, the above result shows that if we start close enough from $Y^{\star}$ (so that (29)-(30) hold), then the sequence $\{Y_{k}\}_{k\in\mathbb{N}}$ will remain in the neighbourhood ${\mathds{B}}_{\rho}(Y^{\star})$ and thus converges to a critical point $\overline{Y}$ owing to Lemma B.4. Moreover, $\Psi({Z}_{k})\to\Psi(\overline{Z})\geq\Psi({Z}^{\star})$ by virtue of \reftagform@C.3. Now we need to show that $\Psi(\overline{Z})=\Psi({Z}^{\star})$ . Suppose that $\Psi(\overline{Z})>\Psi({Z}^{\star})$ . As $\Psi$ has the KL property at ${Z}^{\star}$ , we have

[TABLE]

But this is impossible since $\varphi^{\prime}(s)>0$ for $s\in]0,\eta[$ , and $\mathrm{dist}\big{(}{0,\partial\Psi(\overline{Z})}\big{)}=0$ as $\overline{Z}$ is a critical point. Hence we have $\Psi(\overline{Z})=\Psi({Z}^{\star})$ , which means $\Phi(\overline{Y})=\Phi(Y^{\star})$ , i.e. the cluster point $\overline{Y}$ is actually a global minimizer. This concludes the proof. ∎

Appendix C Proof of local linear convergence

Before presenting the proof for local linear convergence, in Figure 13 below we provide the comparison of theoretical estimation and practical observation. The size of the problem is $\mathbb{R}^{32\times 32}$ , which is small as larger size will make the rate estimation very slow. It can be observed that our theoretical rate estimation is very tight given that the red line and the black one are parallel to each other.

Since we are in the non-convex setting, we need the prox-regularity of the non-convexity. A lower semi-continuous function $\mathcal{R}$ is $r$ -prox-regular at $\bar{x}\in\mathrm{dom}(\mathcal{R})$ for $\bar{v}\in\partial\mathcal{R}(\bar{x})$ if $\exists r>0$ such that $\mathcal{R}(x^{\prime})>\mathcal{R}(x)+\langle v,\,x^{\prime}-x\rangle-\frac{1}{2r}{|\kern-1.125pt|}x-x^{\prime}{|\kern-1.125pt|}^{2}$ $\forall x,x^{\prime}$ near $\bar{x}$ , $\mathcal{R}(x)$ near $\mathcal{R}(\bar{x})$ and $v\in\partial\mathcal{R}(x)$ near $\bar{v}$ .

To prove Theorem 2.4, we rely on a so-called partial smoothness concept. Let $\mathcal{M}\subset\mathbb{R}^{n}$ be a $C^{2}$ -smooth submanifold, let $\mathcal{T}_{\mathcal{M}}(x)$ the tangent space of $\mathcal{M}$ at any point $x\in\mathcal{M}$ .

Definition C.1.

The function $\mathcal{R}:\mathbb{R}^{n}\to\mathbb{R}\cup\{+\infty\}$ is $C^{2}$ -partly smooth at $\bar{x}\in\mathcal{M}$ relative to $\mathcal{M}$ for $\bar{v}\in\partial\mathcal{R}(\bar{x})\neq\emptyset$ if $\mathcal{M}$ is a $C^{2}$ -submanifold around $\bar{x}$ , and

(i)

(Smoothness): $\mathcal{R}$ restricted to $\mathcal{M}$ is $C^{2}$ around $\bar{x}$ ; 2. (ii)

(Regularity): $\mathcal{R}$ is regular at all $x\in\mathcal{M}$ near $\bar{x}$ and $\mathcal{R}$ is $r$ -prox-regular at $\bar{x}$ for $\bar{v}$ ; 3. (iii)

(Sharpness): $\mathcal{T}_{\mathcal{M}}(\bar{x})=\mathrm{par}(\partial\mathcal{R}(x))^{\perp}$ ; 4. (iv)

(Continuity): The set-valued mapping $\partial\mathcal{R}$ is continuous at $\bar{x}$ relative to $\mathcal{M}$ .

We denote the class of partly smooth functions at $x$ relative to $\mathcal{M}$ for $v$ as $\mathrm{PSF}_{x,v}(\mathcal{M})$ . Partial smoothness was first introduced in (Lewis, 2003) and its directional version stated here is due to (Lewis and Zhang, 2013; Drusvyatskiy and Lewis, 2013). Prox-regularity is sufficient to ensure that the partly smooth submanifolds are locally unique (Lewis and Zhang, 2013, Corollary 4.12), (Drusvyatskiy and Lewis, 2013, Lemma 2.3 and Proposition 10.12).

Proof of Theorem 2.4.

First we have

–

$\mathcal{Y}_{L}$ is a the set of fixed-rank matrices, hence it is partly smooth.

–

Since $\mathcal{S}$ is a subspace, hence it is partly smooth at $S^{\star}$ relative to any $W\in(\mathcal{S})^{\bot}$ .

Under the conditions of Theorem 2.3, there exists a critical point $Y^{\star}$ such that $Y_{k}\to Y^{\star}$ and $\Phi(Y_{k})\to\Phi(Y^{\star})$ .

Convergence properties of $\{Y_{k}\}_{k\in\mathbb{N}}$ (Theorem 2.3) entails ${|\kern-1.125pt|}Z_{a,k}-Y_{k}{|\kern-1.125pt|}\to 0$ and ${|\kern-1.125pt|}Z_{b,k}-Y^{\star}{|\kern-1.125pt|}\to 0$ . In turn,

[TABLE]

Altogether, this shows that the conditions of (Lewis and Zhang, 2013, Theorem 4.10) or (Drusvyatskiy and Lewis, 2013, Proposition 10.12) are fulfilled on $\mathcal{R}$ at $Y^{\star}$ for $-\nabla\mathcal{F}(Y^{\star})$ , and the identification result follows, that is

[TABLE]

for all $k$ large enough, and we conclude the proof. ∎

Tangent space $T_{\mathcal{X}}^{X^{\star}}$

Given $X^{\star}\in\mathcal{X}$ , the tangent space simply reads $KX=0$ . Let $E$ be the kernel of $K$ , then we have the projection operator onto $KX=0$ reads

[TABLE]

Tangent space of $\mathcal{Y}_{L}$

Let $\mathbb{M}=M_{m,n}(\mathbb{R})$ be the space of $m\times n$ matrices with the classical inner product $\langle A,\,B\rangle=\mathrm{Trace}(A^{T}B)$ . The set of matrices with fixed rank $r$ ,

[TABLE]

is a smooth manifold around any matrix $L\in\mathcal{Y}_{L}$ . Given $L^{\star}$ , with the help of the singular value decomposition $L=U\Sigma V^{T}$ , the tagent space at $L$ to $\mathcal{Y}_{L}$ is

[TABLE]

Let $U=[u_{1},u_{2},\cdots,u_{m}]$ , $V=[v_{1},v_{2},\cdots,v_{n}]$ and $\Sigma$ be diagonal matrix with singular value written in decreasing order.

Denote

[TABLE]

then $\mathcal{L}$ forms the basis of $\mathcal{T}$ and $\mathrm{dim}(\mathcal{L})=mn-r^{2}$ , there for define

[TABLE]

and

[TABLE]

then $\mathrm{P}_{T_{\mathcal{Y}_{L}}^{L^{\star}}}$ is the explicit form of the projection operator of projecting onto subspace $T_{\mathcal{Y}_{L}}^{L^{\star}}$ .

Tangent space of $\mathcal{S}$

Given $S^{\star}\in\mathcal{S}$ , denote the tangent space as $T_{\mathcal{S}}^{S^{\star}}$ . Let $\mathrm{vec}(S^{\star})$ be the vector form of $S^{\star}$ , then we haves

[TABLE]

Finally, we have

[TABLE]

Proof of Theorem 2.6.

From (13), when $a_{k},b_{k}\equiv 0$ , we have thats

[TABLE]

Let $Y^{\star}$ be a critical point that $Y_{k}$ converges to, then

[TABLE]

Denote $X_{k}=\mathrm{P}_{\mathcal{X}}(Y_{k})$ and $X^{\star}=\mathrm{P}_{\mathcal{X}}(Y^{\star})$ , we have

[TABLE]

Consider the difference of the above two equations, owing to Lemma 2.5, we get

[TABLE]

which means

[TABLE]

Note that $\mathcal{P}$ is symmetric positive semi-definite, hence all its eigenvalues are real and lie in $[0,1]$ .

Now, assume that $b_{k}=a_{k}\equiv a$ , then we have from (13)

[TABLE]

Follow the derivation of $Y_{k+1}-Y^{\star}$ above, we get

[TABLE]

Plus the definition of $D_{k}$ and the fact that $o({|\kern-1.125pt|}Y_{k}-Y^{\star}{|\kern-1.125pt|})=o({|\kern-1.125pt|}D_{k}{|\kern-1.125pt|})$ , we obtain

[TABLE]

Owing to (Liang, 2016, Chapter 6), if $\rho_{{}_{\mathcal{P}}}<1$ , then so is $\rho_{{}_{\mathcal{Q}}}<1$ , and the linear convergence result follows. ∎

Appendix D Table of baseline methods

Bibliography58

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1gra (2012) 2012. https://sites.google.com/site/hejunzz/grasta.
2Basri and Jacobs (2003) R. Basri and D. Jacobs. Lambertian reflection and linear subspaces. IEEE Transaction on Pattern Analysis and Machine Intelligence , 25(3):218–233, 2003.
3Bauschke and Combettes (2011) H. H. Bauschke and P. L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces . Springer, 2011.
4Belhumeur and Kriegman (1998) P. N. Belhumeur and D. J. Kriegman. What is the set of images of an object under all possible illumination conditions? International Journal of Computer Vision , 28(3):245–260, 1998.
5Bolte et al. (2007) J. Bolte, A. Daniilidis, and A. Lewis. The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization , 17(4):1205–1223, 2007.
6Bolte et al. (2010) J. Bolte, A. Daniilidis, O. Ley, and L. Mazet. Characterizations of Łojasiewicz inequalities: subgradient flows, talweg, convexity. Transactions of the American Mathematical Society , 362(6):3319–3363, 2010.
7Bouwmans (2014) T. Bouwmans. Traditional and recent approaches in background modeforeground detection: An overview. Computer Science Review , 11–12:31 – 66, 2014.
8Bouwmans and Zahzah (2014) T. Bouwmans and E.-H. Zahzah. Robust PCA via principal component pursuit: A review for a comparative evaluation in video surveillance. Computer Vision and Image Understanding , 122:22–34, 2014.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Best Pair Formulation & Accelerated Scheme for Non-convex Principal Component Pursuit

Abstract

1 Introduction

1.1 Formulation and Contributions

1.2 Notations

2 An accelerated proximal gradient method

Remark 2.1**.**

2.1 Global convergence

Definition 2.2**.**

Theorem 2.3** (Global convergence).**

2.2 Local linear convergence

Theorem 2.4** (Support identification).**

Lemma 2.5** ((Liang et al., 2014, Lemma 5.1)).**

Theorem 2.6** (Local linear convergence).**

Remark 2.7**.**

3 Numerical experiments

Results on synthetic data.

Removal of shadows and specularities.

Background estimation from video sequences.

Supplementary Material

Appendix A Extra Experiments

A.1 Sensitivity to the choice of γ,ak,bk\gamma,a_{k},b_{k}γ,ak​,bk​

A.2 Sensitivity to the choice of r,αr,\alphar,α

A.3 Sensitivity to the choice of the starting point

A.3.1 Inlier detection

Appendix B Proof of the global convergence

Lemma B.1**.**

Lemma B.2**.**

Lemma B.3**.**

Lemma B.4**.**

Appendix C Proof of local linear convergence

Definition C.1**.**

Tangent space TXX⋆T_{\mathcal{X}}^{X^{\star}}TXX⋆​

Tangent space of YL\mathcal{Y}_{L}YL​

Tangent space of S\mathcal{S}S

Appendix D Table of baseline methods

Remark 2.1.

Definition 2.2.

Theorem 2.3 (Global convergence).

Theorem 2.4 (Support identification).

Lemma 2.5 ((Liang et al., 2014, Lemma 5.1)).

Theorem 2.6 (Local linear convergence).

Remark 2.7.

A.1 Sensitivity to the choice of $\gamma,a_{k},b_{k}$

A.2 Sensitivity to the choice of $r,\alpha$

Lemma B.1.

Lemma B.2.

Lemma B.3.

Lemma B.4.

Definition C.1.

Tangent space $T_{\mathcal{X}}^{X^{\star}}$

Tangent space of $\mathcal{Y}_{L}$

Tangent space of $\mathcal{S}$