Best Pair Formulation & Accelerated Scheme for Non-convex Principal Component Pursuit
Aritra Dutta, Filip Hanzely, Jingwei Liang, Peter Richt\'arik

TL;DR
This paper formulates robust principal component analysis as a best pair problem and introduces an accelerated proximal gradient method with proven convergence and superior performance in experiments.
Contribution
It is the first to formulate RPCA as a best pair problem and develops an accelerated scheme with theoretical convergence guarantees.
Findings
The proposed algorithm outperforms baseline methods in experiments.
Global convergence and local linear rate are established for the scheme.
Numerical results demonstrate superior efficiency on real and synthetic data.
Abstract
The best pair problem aims to find a pair of points that minimize the distance between two disjoint sets. In this paper, we formulate the classical robust principal component analysis (RPCA) as the best pair; which was not considered before. We design an accelerated proximal gradient scheme to solve it, for which we show global convergence, as well as the local linear rate. Our extensive numerical experiments on both real and synthetic data suggest that the algorithm outperforms relevant baseline algorithms in the literature.
| Metric | SGD | R-SGD1 | R-SGD2 | Inc | R-Inc | MD | R-MD | RPCA-F | Best pair | SVT |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.7 | 0.86 | 4.66 | 0.77 | 0.72 | 0.67 | 0.67 | 0.78 | 0.76 | 0.79 |
| Algorithm | Abbreviation | Appearing in Experiment | Reference | ||
|---|---|---|---|---|---|
|
iEALM | Fig. 1, 3, 4 | (Lin et al., 2010) | ||
| Accelerated Proximal Gradient | APG | Fig. 3, 4 | (Wright et al., 2009) | ||
| Singular Value Thresholding | SVT | Table 1 | (Cai et al., 2010) | ||
|
GRASTA | Fig. 5 | (He et al., 2012) | ||
| Go Decomposition | GoDec | Fig. 4, 10 | (Zhou and Tao, 2011) | ||
| Robust PCA Gradient Descent | RPCA GD | Fig. 2, 3, 4, 5, 9, 10, 11 | (Yi et al., 2016) | ||
| Robust PCA Nonconvex Feasibility | RPCA NCF | Fig. 1, 3, 4, 5, 10, 11 , 12 | (Dutta et al., 2018a) | ||
| Robust stochastic PCA Algorithms |
|
Fig. 12, Table 1 | (Goes et al., 2014) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Best Pair Formulation & Accelerated Scheme for Non-convex Principal Component Pursuit
Aritra Dutta
KAUST
Thuwal, KSA
[email protected] Filip Hanzely
KAUST
Thuwal, KSA
Jingwei Liang
University of Cambridge
Cambridge, UK
Peter Richtárik
KAUST
Thuwal, KSA
Abstract
The best pair problem aims to find a pair of points that minimize the distance between two disjoint sets. In this paper, we formulate the classical robust principal component analysis (RPCA) as the best pair; which was not considered before. We design an accelerated proximal gradient scheme to solve it, for which we show global convergence, as well as the local linear rate. Our extensive numerical experiments on both real and synthetic data suggest that the algorithm outperforms relevant baseline algorithms in the literature.
1 Introduction
Let be a given matrix, the generalized low-rank recovery model can be written as
[TABLE]
where is a loss function, is a suitable regularizer, and is a balancing parameter. By an appropriate choice of the loss function and the regularizer, (1) can express a wide range of low-rank approximation problems of matrices. For example, by setting , and — the characteristic function (10) of the set , (1), specializes to:
[TABLE]
which is a best approximation formulation of the classical principal component analysis (PCA). The solution to problem (2) is given by: where is a singular value decomposition (SVD) of and is the hard-thresholding operator that keeps the largest singular values. Although PCA is vastly used and a successful designing tool in different engineering applications, it can only handle the presence of uniformly distributed noise and is rather sensitive to sparse outliers in the data matrix (Lin et al., 2010; Wright et al., 2009; Candès et al., 2011). To overcome this shortcoming and to deal with sparse errors, (Chandrasekaran et al., 2011; Candès et al., 2011) replaced the Frobenius norm in (2) by the pseudo norm, and introduced the celebrated principal component pursuit (PCP) problem:
[TABLE]
However, the above problem is non-convex and NP-hard. One of the most commonly used, tractable surrogate reformulations of (3) is replacing the rank function with nuclear norm and pseudo norm with -norm (Cai et al., 2010; Recht et al., 2010). Exploiting this idea, Robust PCA (RPCA) was introduced as a convex surrogate of the PCP problem (Wright et al., 2009; Lin et al., 2010; Candès et al., 2011):
[TABLE]
It was shown in (Chandrasekaran et al., 2011; Candès et al., 2011) that under a rank-sparsity incoherence assumption, problem (3) can be provably solved via (4), as the solutions of them lie close to each other with high probability.
Besides (4), there are other formulations of RPCA. One of the most popular way is to introduce an auxiliary variable, , and add an additional constraint , which yields:
[TABLE]
This constrained formulation enables several avenues to solve RPCA, such as, the exact and inexact augmented Lagrangian method of multipliers by Lin et al. (Lin et al., 2010), accelerated proximal gradient method (Wright et al., 2009), alternating direction method (Yuan and Yang, 2013), alternating projection with intermediate denoising (Netrapalli et al., 2014), dual approach (Lin et al., 2009), and SpaRCS (Waters et al., 2011), manifold optimization by Yi et al. (Yi et al., 2016) and Zhang and Yang (Zhang and Yang, 2018), are a few popular ones. We refer to (Bouwmans and Zahzah, 2014) for a comprehensive review of RPCA algorithms.
For the discussion above, is fully observed with no data missing. One can consider that is partially observed, that is, there exists a projection operator (or simply a Bernoulli binary mask) on the set of observed data entries and is defined by
[TABLE]
The partial observed version of (5) reads
[TABLE]
Besides (5) and (7), other tractable reformulations of (3) still exist. For example, if the rank and target sparsity is user-inferred then it is common practice to relax the equality constraint in (5) and consider it in the objective function as a penalty. This, together with explicit constraints on the target rank, , and target sparsity level, , (user-inferred hyperparameters), leads to the GoDec formulation (Zhou and Tao, 2011). One can also extend the above model to the case of partially observed data that leads to a more general class of problems that is commonly known as the robust matrix completion (RMC) problem (Chen et al., 2011; Tao and Yang, 2011; Cherapanamjeri et al., 2017b, a) that contains the variant proposed in (Zhou and Tao, 2011) as a special case. With , the matrix completion (MC) problem is also a special case of the RMC problem (Candès and Plan, 2009; Jain et al., 2013; Cai et al., 2010; Jain and Netrapalli, 2015; Candès and Recht, 2009; Keshavan et al., 2010; Candès and Tao, 2010; Mareček et al., 2017; Wen et al., 2012). Lastly, when the whole matrix is observed, the RMC problem is nothing but (5).
Recently, (Dutta et al., 2018a) reformulated (3) as a non-convex feasibility problem, which does not require any objective function, convex relaxation, or surrogate convex constraints. Rather, it exploits the following idea: the solution to the PCP problem lies in the intersection of two sets—one convex and one non-convex, if one considers both the target rank and the target sparsity as hyperparameters. Let X=\Big{(}{{\footnotesize\begin{matrix}S\\ L\end{matrix}}}\Big{)}\in\mathbb{R}^{2m\times n} and where is the identity operator of , define
[TABLE]
Note that is convex and is non-convex111The -sparsity constraint on means that for , each row and column of contains no more than and number of non-zero entries, respectively. This is slightly more complicated than directly applying constraint. However, it often works better in practice.. Given the sets, Dutta et al. (Dutta et al., 2018a) reformulated (3) as non-convex feasibility problem:
[TABLE]
Note that if we replace the in with Bernoulli binary matrix, then we obtain the reformulation of PCP problem with partial observation.
1.1 Formulation and Contributions
In this paper we consider reformulating the feasibility problem (8) as a best pair problem. Given two sets , the best pair problems aims to find a pair of points such that they have the closest distance, that is a the solution of the problem below:
[TABLE]
When the intersection of and is non-empty, that is , (9) reduces to the feasibility problem, with . Given a set , define its characteristic function by
[TABLE]
Then (9) can be equivalently written as
[TABLE]
Observe that for a given , problem (11) becomes which is the Moreau envelope (Bauschke and Combettes, 2011) of of index :
[TABLE]
As a result, we can simplify (11) to the case of only ,
[TABLE]
For the rest of the paper, we focus on (12) and our main contributions are summarised below:
- •
New formulation and a new algorithm for non-convex PCP. We reformulate the non-convex set feasibility formulation of RPCA to a best pair problem. Although our formulation was inspired by formulation (8) from (Dutta et al., 2018a), to the best of our knowledge, we are the first to formulate and solve RPCA via the best pair. To this end, we design a fast and efficient algorithm—an accelerated proximal gradient method—to solve it.
- •
Theoretical convergence guarantees. Both global and local convergence analysis of the scheme are provided. Globally, we show that our algorithm converges to a critical point. If the algorithm additionally starts sufficiently close to the optimum, we show that it converges to a global minimizer. Locally, our algorithm enjoys a fast linear rate, which we can sharply estimate. We owe this novelty to our best pair formulation. In contrast, the non-convex projection RPCA from (Dutta et al., 2018a) or GoDec (Zhou and Tao, 2011) can only guarantee a local linear convergence.
- •
Numerical experiments and applications to real-world problems. We apply the proposed method to several well-tested applications in computer vision. Our extensive experiments on both real and synthetic data suggest that our algorithm matches or outperforms relevant baseline algorithms in fractions of their execution time. Additionally, in the supplementary material, we provide empirical validity of the hyperparameters sensitivity of our approach.
1.2 Notations
Throughout the paper, is the set of non-negative integers. For a nonempty closed convex set , denote the orthogonal projector onto . Let be a lower semi-continuous (lsc) function, its domain is defined as , and it is said to be proper if . We need the following notions from variational analysis, see e.g. (Rockafellar and Wets, 1998) for details. Given , the Fréchet subdifferential of at , is the set of vectors that satisfies . If , then . The limiting-subdifferential (or simply subdifferential) of at , written as , is defined as . Denote . Both and ) are closed, with convex and (Rockafellar and Wets, 1998, Proposition 8.5). Since is lsc, it is (subdifferentially) regular at if and only if (Rockafellar and Wets, 1998, Corollary 8.11). A necessary condition for to be a minimizer of is . The set of critical points of is .
2 An accelerated proximal gradient method
In this section, we describe a gradient-based optimization method for solving (12). Denote the projection operators onto and , respectively. Since is a non-empty closed convex set, its characteristic function is proper convex and lower semi-continuous. Owing to (Bauschke and Combettes, 2011), the Moreau envelope is convex differentiable with gradient reads
[TABLE]
which is -Lipschitz continuous. Clearly, (12) admits a “non-smooth + smooth” structure, and in literature one prevailing algorithm to apply is the proximal gradient method (Lions and Mercier, 1979), a.k.a. Forward–Backward splitting. In this paper, we consider an accelerated version of the method, see Algorithm 1, which is based on inertial technique.
Remark 2.1**.**
**
- •
If we choose and , Algorithm 1 becomes the Backward–Backward splitting, which is the method of alternating projections for the considered feasibility problem (8). Therefore, we recover the method from (Dutta et al., 2018a) as a special case.
- •
From (8) to (12), we can also consider the Moreau envelope of the non-convex set , that is
[TABLE]
which also works well in practice.
- •
Algorithm 1 is a special cases of the multi-step inertial proximal gradient descent method considered in (Liang et al., 2016) for general non-convex composite optimization.
Note that the two projection operators are very easy to compute. Given X=\Big{(}{{\footnotesize\begin{matrix}S\\ L\end{matrix}}}\Big{)}, since is an affine subspace, the projection of onto reads \mathrm{P}_{\mathcal{X}}(X)={{\frac{\raisebox{0.87495pt}{\small1}}{\raisebox{-1.45833pt}{\small2}}}}\Big{(}{{\footnotesize\begin{matrix}A+S-L\\ A-S+L\end{matrix}}}\Big{)}. If where is the binary mask defined in (6), then for the partial observed case, we have
[TABLE]
For the projection which contains a low-rank projection and sparsity projection, we refer to (Dutta et al., 2018a) for more details.
2.1 Global convergence
Since set is semi-algebraic (Bolte et al., 2010), our global convergence guarantees of Algorithm 1 is based on Kurdyka-Łojasiewicz property.
Kurdyka-Łojasiewicz property. Let be a proper lsc function. For such that , define the set
[TABLE]
Definition 2.2**.**
Function is said to have the Kurdyka-Łojasiewicz property at if there exists , a neighbourhood of and a continuous concave function such that
- (i)
, is on , and for all , ; 2. (ii)
for all , the Kurdyka-Łojasiewicz inequality holds
[TABLE]
Proper lsc functions which satisfy the Kurdyka-Łojasiewicz property at each point of are called KL functions.
KL functions include the class of semi-algebraic functions, see (Bolte et al., 2007, 2010). For instance, the pseudo-norm and the rank function are KL.
Global convergence. To deliver the convergence result, we rewrite (12) into the following generic form
[TABLE]
where we assume that
- (A.1)
is proper lower semi-continuous, and bounded from below; 2. (A.2)
is convex differentiable and its gradient is -Lipschitz continuous.
Let be a constant. Define the following quantities,
[TABLE]
Theorem 2.3** (Global convergence).**
For problem (15), assume \reftagform@A.1-\reftagform@A.2 hold, and that is a proper lsc KL function which is bounded from below. For Algorithm 1, choose such that
[TABLE]
Then each bounded sequence satisfies
- (i)
* has finite length, i.e. ;* 2. (ii)
There exists a critical point such that . 3. (iii)
If has the KL property at a global minimizer , then starting sufficiently close from , any sequence converges to a global minimum of and satisfies \reftagform@i.
The proof of the above theorem can be found in the supplementary material. We also refer to (Liang et al., 2016) and the reference therein for more results on non-convex proximal gradient method.
2.2 Local linear convergence
Now we turn to the local perspective and present a local linear convergence analysis for Algorithm 1. For the constraint set define in (8), consider the following decomposition of it
[TABLE]
For the sequence generated by (13), suppose Y_{k}=\Big{(}{{\footnotesize\begin{matrix}S_{k}\\ L_{k}\end{matrix}}}\Big{)}. It is immediate that holds for all . For , though it is always -sparse, the locations of non-zero elements change along the course of iteration. In the following, we first show that after a finite number of iterations the locations of non-zero elements of stop changing, that is will have the same support as that of to which converges, and then Algorithm 1 enters a linear convergence regime.
Support identification of . Let Y^{\star}=\Big{(}{{\footnotesize\begin{matrix}S^{\star}\\ L^{\star}\end{matrix}}}\Big{)} be a critical point of (12) to which converges. Let be the subspace extended by the support of . Clearly, and we have the result below concerning the relation between and .
Theorem 2.4** (Support identification).**
*For Algorithm 1, suppose Theorem 2.3 holds. Then converges to a critical point of (12). For all large enough, we have . *
Let be the point that converges to, the above result simply means that after finite number of iterations, holds for all large enough.
Local linear convergence. Given a critical point , let , we have
[TABLE]
Note that the first two sets, are (affine) subspaces, hence smooth, and is the set of fixed-rank matrices which is -smooth manifold (Lee, 2003). To derive the local linear rate, we need to utilize the smoothness of these sets. Let be a -smooth manifold and let the tangent space of at , we have the following lemma which is crucial for our local linear convergence analysis.
Lemma 2.5** ((Liang et al., 2014, Lemma 5.1)).**
Let be a -smooth manifold around . Then for any , where is a neighbourhood of , the projection operator is uniquely valued and around , and thus . If moreover, is an affine subspace, then .
Denote the tangent spaces of at as and , respectively. We refer to the supplementary material for detailed expressions of these tangent spaces. Denote and the projections onto the tangent spaces. Define the matrix \mathcal{P}\stackrel{{\scriptstyle{\mathrm{def}}}}{{=}}\mathrm{P}_{T_{\mathcal{Y}}^{Y^{\star}}}\big{(}{(1-\gamma)\mathrm{Id}+\gamma\mathrm{P}_{T_{\mathcal{X}}^{X^{\star}}}}\big{)}\mathrm{P}_{T_{\mathcal{Y}}^{Y^{\star}}}, and
[TABLE]
Denote the spectral radiuses of , respectively.
Theorem 2.6** (Local linear convergence).**
For Algorithm 1, suppose Theorem 2.4 holds. Then converges to a critical point of (12). Suppose , there exists a such that for all ,
[TABLE]
*Moreover, if , then so is , and for all large enough we have . *
Remark 2.7**.**
**
- •
If , then it can be shown that .
- •
Given , can be expressed explicitly in terms of and . For the case that and , we refer to (Liang, 2016, Chapter 6) for detailed discussion on the local linear convergence analysis.
An numerical illustration on our theoretical rate estimation and practical observation is provided in the supplementary material Section C-Figure 13.
3 Numerical experiments
In this section, we extensively tested our best-pair formulation on both real and synthetic data against a vast genre of PCP algorithms. The first set of algorithms that we tested against, e.g. iEALM and APG, determine the target rank and sparsity robustly from the given set of hyperparameters. On the other hand, for the second set of algorithms, e.g. RPCA gradient descent (RPCA GD), Go decomposition (GoDec), and RPCA nonconvex feasibility (RPCA NCF), the target rank and sparsity are user-inferred. Although our accelerated proximal gradient algorithm belongs to the second class, to show its effectiveness, we compare it with both classes of state-of-the-art robust PCP algorithms (see Table 2 in the supplementary material) on several computer vision applications—removal of shadows and specularities from face images, Background estimation or tracking from video sequences, and inlier detection from a grossly corrupted dataset (see Section A.3.1 in the supplementary material)222In all experiments, we use the approximate projection (Dutta et al., 2018a; Yi et al., 2016; Zhang and Yang, 2018) onto as the exact one is expensive:
If the sparsity constraint was defined only along rowsc (or only columns), the exact projection would be cheap. However, the approximate projection produces better results, thus we stick with it..
Results on synthetic data.
The primary goal of these set of experiments is to understand the behavior of our proposed method on some well-understood data and to test against some state-of-the-art algorithms. To construct our test matrix , for these experiments, we used the idea proposed by Wright et al. (Wright et al., 2009). First, we generate the low-rank matrix, , as a product of two independent full-rank matrices of size with such that elements are independent and identically distributed (i.i.d.) and sampled from a normal distribution—. We generate the sparse matrix, , such that its elements are chosen from the interval . We create the sparse support set by using the operator (2). Finally, we write as . We fix and define , where varies. We choose the sparsity level .
Phase transition experiments. For each pair of , we apply iEALM, RPCA NCF, and our algorithm to recover the pair . For iEALM, we set and use and , where is the spectral norm (maximum singular value) of . For a given , if the recovered matrix pair , satisfies the relative error then we consider the construction is viable. In Figure 1, we produce the phase transition diagrams to show the fraction of perfect recovery of , where white denotes success and black denotes failure. We run the experiments for 5 times and plot the results. The success of iEALM is approximately below the line On the other hand, we note that the performance of our best pair RPCA is almost similar to that of (Dutta et al., 2018a), when the sparsity level is small and both approaches can efficiently provide a feasible reconstruction for any in that case. We also note that for low sparsity level, iEALM can only provide a feasible reconstruction for . Due to their robustness to any low-rank structure when is low, RPCA NCF and best pair RPCA can be proved to be very effective in many real-world applications. In many real-world problems, involving the video/image data can ideally have any inherent low-rank structure and are generally corrupted by very sparse outliers of arbitrary large magnitudes. In those instances, RPCA NCF and our best pair RPCA could be very useful. We show more justification in the later section.
Root mean square error measure. To validate our performance against RPCA GD of Yi et al. (Yi et al., 2016), we use a different metric—root mean square error (RMSE). Since RPCA GD does not explicitly recover a sparse matrix, , it is unjustified to test it against the same relative error. Therefore, for the true low-rank, , and a low-rank recovery, , we use the metric as the measure of RMSE. From Figure 2, we can conclude that our best pair RPCA has less RMSE compare to that of RPCA GD. Moreover, the RMSE remains unaltered as the cardinality of support set, increases. Also, see Figure 9 in the Appendix.
Removal of shadows and specularities.
Set of images of an object under unknown pose and arbitrary lighting conditions, form a convex cone in the space of all possible images which may have unbounded dimension (Basri and Jacobs, 2003; Belhumeur and Kriegman, 1998). However, the images under distant, isotropic lighting can be approximated by a 9-dimensional linear subspace which is popularly referred to as the harmonic plane. We used three subjects B11,B12, and B13 from the Extended Yale Face Database (Georghiades et al., 2001) for our simulations. We used 63 downsampled images of resolution of of each subject. For APG and iEALM, we set the parameters the same as in the previous section. For RPCA GD, RPCA NCF, and our method, we set target rank and sparsity level to . The qualitative analysis on the recovered images from Figure 3 shows while RPCA GD recovers patchy and granular face images, our best pair reformulation provides comparable reconstruction to that of iEALM, APG, and RPCA NCF.
Background estimation from video sequences.
Background estimation or moving object tracking (Bouwmans et al., 2017; Dutta, 2016; Bouwmans et al., 2016; Dutta et al., 2017a; Bouwmans, 2014; Dutta et al., 2017b, 2018b; Dutta and Richtárik, 2019; Dutta and Li, 2017) is considered as one of the classic problems in computer vision and is used as a crucial component in human activity recognition, tracking, and video analysis from surveillance cameras. When the video is captured by a static camera, minimizing the rank of the matrix , that concatenates video frames (after converting them into vectors) represents the structure of the linear subspace, , that contains the background and an error, , that emphasizes the foreground components. However, the exact desired rank is often tuned empirically, as the ideal rank-one background is often unrealistic as the changing illumination, occluded foreground/background objects, reflection, and noise are typically also a part of the video frames. Based on the above observation, we note that the problem can be cast typically as (4). However, as we explained in some cases, when the target rank and the sparsity level is user-inferred hyperparameters, one might use a different approach as in (Zhou and Tao, 2011; Dutta et al., 2018a; Yi et al., 2016) as well. Additionally, there might be missing/unobserved pixels in the video and that makes the problem more complex and only a few methods, such as RPCA NCF, GRASTA (He et al., 2012), RPCA GD remedy to that. Therefore, we tested our best pair RPCA to a wide range of methods. In our experiments, we use two different video sequences: (i) the Basic sequence from Stuttgart synthetic dataset (Brutzer et al., 2011), (ii) the waving tree sequence (Toyama et al., 1999). We extensively use the Stuttgart video sequence as it is a challenging sequence that comprises both static and dynamic foreground objects and varying illumination in the background. Additionally, it comes with foreground ground truth for each frame. For iEALM and APG, we set the parameters the same as in the previous sections. For Best pair RPCA, RPCA GD, RPCA NCF, and GoDec, we set , target sparsity 10% and additionally, for GoDec, we set . For GRATSA, we set the parameters the same as those mentioned in the authors’ website (gra, 2012). The qualitative analysis on the background and foreground recovered on both, full observation (in Figure 4) and partial observation (in Figure 5), suggest that our method recovers a visually better quality background and foreground compare to the other methods. Note that, RPCA GD recovers a fragmentary foreground with more false positives compare to our method; moreover, RPCA GD, GRASTA, iEALM, and APG cannot remove the static foreground object. We provide a detailed quantitative evaluation of our best pair RPCA with respect to the -proximity metric– as in (Dutta et al., 2018a) and the mean structural similarity index measure (SSIM) by (Wang et al., 2004) in recovering the foreground objects in Figures 10 and 11 in Appendix.
Supplementary Material
The organization of this supplementary material is: extra supporting numerical experiments are reported in Section A; Proofs for the global convergence result of Algorithm 1 is provided in Section B; The proof of local linear convergence and a numerical example are provided in Section C. Lastly, we provide a comprehensive table to list all baselines we compare to in Section D.
Appendix A Extra Experiments
In this section, we empirically study convergence properties of Algorithm 1 on synthetic, well-understood data. In particular, we examine its sensitivity to user-specified parameters , target sparsity level , target rank and lastly the sensitivity to initialization. Moreover, we provide extra phase transition diagrams and both quantitative and qualitative results on the inlier detection problem.
A.1 Sensitivity to the choice of
In this experiment, we compare different choices of algorithm parameters on instances of (9) with various target sparsity level and target rank . In each experiment, we make sure that the solution exists; we generate random matrices (with independent entries ), project them onto low rank and sparse constraint set respectively to obtain and set . For simplicity we consider only and . Figure 6 shows the result. We see that parameter choice is the most reliable.
A.2 Sensitivity to the choice of
In this experiment, we examine how sensitive is Algorithm 1 on the correct choice of the target sparsity level and the target rank .
In each experiment, we generate random matrices (with independent entries ), project them onto -low rank and -sparse constraint set respectively to obtain and set . Then, we run Algorithm 1 with various choices of and report the results. For simplicity we consider only (from the previous experiment) and . Figure 7 shows the result. We can see that if sparsity level is underestimated, the method converges very slowly. Moreover, the method is more sensitive to the correct choices of target sparsity than target rank. The last take-away from this experiment is that over-estimation of target parameters usually leads to slightly slower convergence.
A.3 Sensitivity to the choice of the starting point
In the last experiment, we examine how the starting point influences the convergence rate. For each problem instance, we perform 50 independent runs of Algorithm 1 and report the best, worst and median performance.
For simplicity, we consider only problems with known target rank and sparsity – we generate random matrices (with independent entries ), project them onto low rank and sparse constraint set respectively to obtain and set . Further, we set , and . Figure 8 shows the result. We can see that the convergence speed of Algorithm 1 is, in most cases, not influenced significantly by the starting point. Thus, the non-convex nature of the problem is surprisingly not causing any issues. Lastly, the convergence rate of Algorithm 1 is faster for small values of , which is often the most interesting case in terms of the practical application.
A.3.1 Inlier detection
Historically, PCA and RPCA are used in detecting the inliers and the outliers from a composite dataset. We infused 400 random, grayscale, downsampled ( pixels) natural images from the BACKGROUND/Google folder of the Caltech101 database (Fei-Fei et al., 2007) with the Yale Extended Face Database to construct the data set. The inliers are the grayscale images of faces (of the same resolution) under different illuminations while the 400 random natural images serve as outliers. The goal is to consider a low-dimensional model and to project the inliers to a 9-dimensional linear subspace where the images of the same face lie. Goes et al. in (Goes et al., 2014) designed seven algorithms to explicitly find a low-rank subspace. To this end, Goes et al. used the classical SGD, an incremental approach, and mirror descent algorithms to find the 9-dimensional subspace. However, we split the dataset, , into a 9-dimensional low-rank subspace and expect the outliers to be in the sparse set, . Once we find , we find the basis of via orthogonalization and project the faces on it. In Figure 12, we show the qualitative results of our experiments333The codes and datasets for experiments in Section A.3.1 are obtained from https://github.com/jwgoes/RSPCA.
As proposed in (Goes et al., 2014), we use the normalized error term , where is subspace fitted by the PCA to the set of inliers and be the subspace fitted by different algorithms. Note that, the metric is expected to lie between 0 and 1 where the smaller is the better. We refer to Table 1 for our quantitative results.
Appendix B Proof of the global convergence
For convenience, define .
Lemma B.1**.**
For the update of in (13), given any , define
[TABLE]
Then, we have , and
[TABLE]
- Proof.
From the definition of proximity operator and the update of (13), we have . Adding to both sides, we obtain
[TABLE]
Applying further the triangle inequality together with the Lipschitz continuity of , we get
[TABLE]
which concludes the proof. ∎
Lemma B.2**.**
For Algorithm 1, given the parameters , the following inequality holds:
[TABLE]
- Proof.
Define the function
[TABLE]
It can be shown that the update of in (13) is equivalent to
[TABLE]
which means that , which means
[TABLE]
Therefore, we get
[TABLE]
Since is -Lipschitz, then
[TABLE]
For the inner product , applying the Pythagorean relation , we get
[TABLE]
Using further Young’s inequality with we obtain
[TABLE]
Combining the above inequalities with (21) yields
[TABLE]
which leads to
[TABLE]
Owing to the definition of and we conclude the proof. ∎
Define the product space and . Then given , define the function
[TABLE]
which is is a KL function if is. Denote the set of cluster points of sequences and respectively, and .
Lemma B.3**.**
For Algorithm 1, choose such that (17) holds. If is bounded from below, then
- (i)
; 2. (ii)
The sequence is monotonically decreasing and convergent; 3. (iii)
The sequence is convergent.
- Proof.
Define , from Lemma B.2, we have
[TABLE]
Let and the above inequality over :
[TABLE]
which means, as is bounded,
[TABLE]
From Lemma B.2, by pairing terms on both sides of (19), we get
[TABLE]
Since we assume , hence is monotonically non-increasing. The convergence of is straightforward. ∎
Lemma B.4**.**
For Algorithm 1, choose such that (17) holds. If is bounded from below and is bounded, then converges to a critical point of .
- Proof.
Since is bounded, there exists a subsequence and cluster point such that as . Next we show that and that is a critical point of .
Since is lsc, then . From (20), we have and thus
[TABLE]
Taking the limit of the above inequality and using , , we get . As a result, . Since is continuous, then , hence .
Furthermore, owing to Lemma B.1, , and (i) of Lemma B.3 we have as . Therefore, as , we have
[TABLE]
Hence , i.e. is a critical point. ∎
- Proof of Theorem 2.3.
Putting together the above lemmas, we draw the following useful conclusions:
- (C.1)
Denote , then ; 2. (C.2)
Define
[TABLE]
then we have . Owing to Lemma B.1, there exists a such that ; 3. (C.3)
if is a subsequence such that , then where . 4. (C.4)
; 5. (C.5)
; 6. (C.6)
is non-empty, compact and connected; 7. (C.7)
is finite and constant on .
Next we prove the claims of Theorem 2.3.
- (i)
Consider a critical point of , , such that . Then owing to \reftagform@C.3, we have .
Suppose there exists such that . Then, the descent property \reftagform@C.1 implies that holds for all . Thus, is constant for , hence has finite length.
On the other hand, suppose that . Owing to \reftagform@C.6, \reftagform@C.7 and Definition 2.2, the KL property of implies that there exist and a concave function , and
[TABLE]
such that for all :
[TABLE]
Let be such that holds for all . Owing to \reftagform@C.5, there exists another such that holds for all . Let . Then holds for all . Furthermore using (26), we have for
[TABLE]
Note that since is concave, is decreasing. As is decreasing too, we have
[TABLE]
From \reftagform@C.1, since , we get
[TABLE]
Moreover, \reftagform@C.2 yields and thus
[TABLE]
which yields
[TABLE]
Taking the square root of both sides and applying Young’s inequality we further obtain
[TABLE]
Summing up both sides over , and using , we get
[TABLE]
which concludes the finite length property of . 2. (ii)
Then the convergence of the sequence follows from the fact that is a Cauchy sequence, hence convergent. Owing to Lemma B.4, there exists a critical point such that . 3. (iii)
We now turn to prove local convergence to a global minimizer. Note that if is a global minimizer of , then is a global minimizer of . Let such that and . Suppose that the initial point is chosen such that following conditions hold,
[TABLE]
The descent property \reftagform@C.1 of together with (29) imply that for any , , and
[TABLE]
Therefore, given any , if we have , then
[TABLE]
which means that .
For any , define the following partial sum . Note that for , and . Next we prove the following claims through induction: for
[TABLE]
From (31) we have
[TABLE]
Applying the triangle inequality we then obtain
[TABLE]
which means . Now, taking in (28) yields, for any ,
[TABLE]
Let . Since , we have
[TABLE]
Therefore, (33) and (34) hold for .
Now assume that they hold for some . Using the triangle inequality and (34),
[TABLE]
As and for , and in view of (30), we arrive at
[TABLE]
whence we deduce that (33) holds at . Now, taking (36) at gives
[TABLE]
Adding both sides of (37) and (34) we get
[TABLE]
meaning that (34) holds at . This concludes the induction proof.
In summary, the above result shows that if we start close enough from (so that (29)-(30) hold), then the sequence will remain in the neighbourhood and thus converges to a critical point owing to Lemma B.4. Moreover, by virtue of \reftagform@C.3. Now we need to show that . Suppose that . As has the KL property at , we have
[TABLE]
But this is impossible since for , and \mathrm{dist}\big{(}{0,\partial\Psi(\overline{Z})}\big{)}=0 as is a critical point. Hence we have , which means , i.e. the cluster point is actually a global minimizer. This concludes the proof. ∎
Appendix C Proof of local linear convergence
Before presenting the proof for local linear convergence, in Figure 13 below we provide the comparison of theoretical estimation and practical observation. The size of the problem is , which is small as larger size will make the rate estimation very slow. It can be observed that our theoretical rate estimation is very tight given that the red line and the black one are parallel to each other.
Since we are in the non-convex setting, we need the prox-regularity of the non-convexity. A lower semi-continuous function is -prox-regular at for if such that near , near and near .
To prove Theorem 2.4, we rely on a so-called partial smoothness concept. Let be a -smooth submanifold, let the tangent space of at any point .
Definition C.1**.**
The function is -partly smooth at relative to for if is a -submanifold around , and
- (i)
(Smoothness): restricted to is around ; 2. (ii)
(Regularity): is regular at all near and is -prox-regular at for ; 3. (iii)
(Sharpness): ; 4. (iv)
(Continuity): The set-valued mapping is continuous at relative to .
We denote the class of partly smooth functions at relative to for as . Partial smoothness was first introduced in (Lewis, 2003) and its directional version stated here is due to (Lewis and Zhang, 2013; Drusvyatskiy and Lewis, 2013). Prox-regularity is sufficient to ensure that the partly smooth submanifolds are locally unique (Lewis and Zhang, 2013, Corollary 4.12), (Drusvyatskiy and Lewis, 2013, Lemma 2.3 and Proposition 10.12).
- Proof of Theorem 2.4.
First we have
- –
is a the set of fixed-rank matrices, hence it is partly smooth.
- –
Since is a subspace, hence it is partly smooth at relative to any .
Under the conditions of Theorem 2.3, there exists a critical point such that and .
Convergence properties of (Theorem 2.3) entails and . In turn,
[TABLE]
Altogether, this shows that the conditions of (Lewis and Zhang, 2013, Theorem 4.10) or (Drusvyatskiy and Lewis, 2013, Proposition 10.12) are fulfilled on at for , and the identification result follows, that is
[TABLE]
for all large enough, and we conclude the proof. ∎
Tangent space
Given , the tangent space simply reads . Let be the kernel of , then we have the projection operator onto reads
[TABLE]
Tangent space of
Let be the space of matrices with the classical inner product . The set of matrices with fixed rank ,
[TABLE]
is a smooth manifold around any matrix . Given , with the help of the singular value decomposition , the tagent space at to is
[TABLE]
Let , and be diagonal matrix with singular value written in decreasing order.
Denote
[TABLE]
then forms the basis of and , there for define
[TABLE]
and
[TABLE]
then is the explicit form of the projection operator of projecting onto subspace .
Tangent space of
Given , denote the tangent space as . Let be the vector form of , then we haves
[TABLE]
Finally, we have
[TABLE]
- Proof of Theorem 2.6.
From (13), when , we have thats
[TABLE]
Let be a critical point that converges to, then
[TABLE]
Denote and , we have
[TABLE]
Consider the difference of the above two equations, owing to Lemma 2.5, we get
[TABLE]
which means
[TABLE]
Note that is symmetric positive semi-definite, hence all its eigenvalues are real and lie in .
Now, assume that , then we have from (13)
[TABLE]
Follow the derivation of above, we get
[TABLE]
Plus the definition of and the fact that , we obtain
[TABLE]
Owing to (Liang, 2016, Chapter 6), if , then so is , and the linear convergence result follows. ∎
Appendix D Table of baseline methods
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1gra (2012) 2012. https://sites.google.com/site/hejunzz/grasta.
- 2Basri and Jacobs (2003) R. Basri and D. Jacobs. Lambertian reflection and linear subspaces. IEEE Transaction on Pattern Analysis and Machine Intelligence , 25(3):218–233, 2003.
- 3Bauschke and Combettes (2011) H. H. Bauschke and P. L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces . Springer, 2011.
- 4Belhumeur and Kriegman (1998) P. N. Belhumeur and D. J. Kriegman. What is the set of images of an object under all possible illumination conditions? International Journal of Computer Vision , 28(3):245–260, 1998.
- 5Bolte et al. (2007) J. Bolte, A. Daniilidis, and A. Lewis. The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization , 17(4):1205–1223, 2007.
- 6Bolte et al. (2010) J. Bolte, A. Daniilidis, O. Ley, and L. Mazet. Characterizations of Łojasiewicz inequalities: subgradient flows, talweg, convexity. Transactions of the American Mathematical Society , 362(6):3319–3363, 2010.
- 7Bouwmans (2014) T. Bouwmans. Traditional and recent approaches in background modeforeground detection: An overview. Computer Science Review , 11–12:31 – 66, 2014.
- 8Bouwmans and Zahzah (2014) T. Bouwmans and E.-H. Zahzah. Robust PCA via principal component pursuit: A review for a comparative evaluation in video surveillance. Computer Vision and Image Understanding , 122:22–34, 2014.
