A Dictionary-Based Generalization of Robust PCA with Applications to   Target Localization in Hyperspectral Imaging

Sirisha Rambhatla; Xingguo Li; Jineng Ren; and Jarvis Haupt

arXiv:1902.08304·cs.LG·July 1, 2020

A Dictionary-Based Generalization of Robust PCA with Applications to Target Localization in Hyperspectral Imaging

Sirisha Rambhatla, Xingguo Li, Jineng Ren, and Jarvis Haupt

PDF

TL;DR

This paper introduces a convex demixing method for decomposing data matrices into low-rank and dictionary-sparse components, enabling effective target localization in hyperspectral images by leveraging spectral signatures.

Contribution

It presents a unified theoretical framework for dictionary-based robust PCA accommodating both undercomplete and overcomplete dictionaries, with analysis of recovery conditions.

Findings

01

Successful recovery of constituent matrices under mild conditions.

02

Effective target localization in hyperspectral imaging using spectral signatures.

03

Experimental validation demonstrating the approach's advantages.

Abstract

We consider the decomposition of a data matrix assumed to be a superposition of a low-rank matrix and a component which is sparse in a known dictionary, using a convex demixing method. We consider two sparsity structures for the sparse factor of the dictionary sparse component, namely entry-wise and column-wise sparsity, and provide a unified analysis, encompassing both undercomplete and the overcomplete dictionary cases, to show that the constituent matrices can be successfully recovered under some relatively mild conditions on incoherence, sparsity, and rank. We leverage these results to localize targets of interest in a hyperspectral (HS) image based on their spectral signature(s) using the a priori known characteristic spectral responses of the target. We corroborate our theoretical results and analyze target localization performance of our approach via experimental evaluations and…

Tables17

Table 1. TABLE I : Entry-wise sparsity model for the Indian Pines Dataset. Simulation results are presented for our proposed approach ( D-RPCA(E) ), robust-PCA based approach on transformed data 𝐃 † 𝐌 superscript 𝐃 † 𝐌 {\mathbf{D^{\dagger}M}} ( RPCA † ), matched filtering ( MF ) on original data 𝐌 𝐌 {\mathbf{M}} , and matched filtering on transformed data 𝐃 † 𝐌 superscript 𝐃 † 𝐌 {\mathbf{D^{\dagger}M}} ( MF † ), across dictionary elements d 𝑑 d , and the regularization parameter for initial dictionary learning procedure ρ 𝜌 \rho ; See ( 31 ). Threshold selects columns with column-norm greater than threshold such that AUC is maximized. For each case, the best performing metrics are reported in bold for readability. Further, ` ` ∗ " ` ` " ``*" denotes the case where ROC curve was “flipped” (i.e. classifier output was inverted to achieve the best performance).

$d$	$ρ$	Method	Threshold	Performance at best operating point		AUC
	$ρ$	Method	Threshold	TPR	FPR	AUC
4	0.01	D-RPCA(E)	0.300	0.979	0.023	0.989
		RPCA^†	0.650	0.957	0.049	0.974
		MF_∗	N/A	0.957	0.036	0.994
		MF ${}^{†}_{*}$	N/A	0.914	0.104	0.946
	0.1	D-RPCA(E)	0.800	0.989	0.017	0.997
		RPCA^†	0.800	0.989	0.014	0.997
		MF	N/A	0.989	0.016	0.998
		MF^†	N/A	0.989	0.010	0.998
	0.5	D-RPCA(E)	0.600	0.968	0.031	0.991
		RPCA^†	0.600	0.935	0.067	0.988
		MF	N/A	0.548	0.474	0.555
		MF ${}^{†}_{*}$	N/A	0.849	0.119	0.939

Table 2. (a) Learned dictionary, d = 4 𝑑 4 d=4

$d$	$ρ$	Method	Threshold	Performance at best operating point		AUC
	$ρ$	Method	Threshold	TPR	FPR	AUC
4	0.01	D-RPCA(E)	0.300	0.979	0.023	0.989
		RPCA^†	0.650	0.957	0.049	0.974
		MF_∗	N/A	0.957	0.036	0.994
		MF ${}^{†}_{*}$	N/A	0.914	0.104	0.946
	0.1	D-RPCA(E)	0.800	0.989	0.017	0.997
		RPCA^†	0.800	0.989	0.014	0.997
		MF	N/A	0.989	0.016	0.998
		MF^†	N/A	0.989	0.010	0.998
	0.5	D-RPCA(E)	0.600	0.968	0.031	0.991
		RPCA^†	0.600	0.935	0.067	0.988
		MF	N/A	0.548	0.474	0.555
		MF ${}^{†}_{*}$	N/A	0.849	0.119	0.939

Table 3. (b) Learned dictionary, d = 10 𝑑 10 d=10

$d$	$ρ$	Method	Threshold	Performance at best operating point		AUC
	$ρ$	Method	Threshold	TPR	FPR	AUC
10	0.01	D-RPCA(E)	0.600	0.935	0.060	0.972
		RPCA^†	0.700	0.978	0.023	0.990
		MF_∗	N/A	0.624	0.415	0.681
		MF ${}_{*}^{†}$	N/A	0.569	0.421	0.619
	0.1	D-RPCA(E)	0.500	0.968	0.029	0.993
		RPCA^†	0.500	0.871	0.144	0.961
		MF_∗	N/A	0.688	0.302	0.713
		MF^†	N/A	0.527	0.469	0.523
	0.5	D-RPCA(E)	1.000	0.978	0.031	0.996
		RPCA^†	2.200	0.849	0.113	0.908
		MF	N/A	0.807	0.309	0.781
		MF ${}_{*}^{†}$	N/A	0.527	0.465	0.539

Table 4. (c) Dictionary by sampling voxels, d = 15 𝑑 15 d=15

$d$	Method	Threshold	Performance at best operating point		AUC
	Method	Threshold	TPR	FPR
15	D-RPCA(E)	0.300	0.989	0.021	0.998
	RPCA^†	3.000	0.849	0.146	0.900
	MF	N/A	0.957	0.085	0.978
	MF^†	N/A	0.796	0.217	0.857

Table 5. (d) Average performance

Method	TPR		FPR		AUC
	Mean	St.Dev.	Mean	St.Dev.	Mean	St.Dev.
D-RPCA(E)	0.972	0.019	0.030	0.014	0.991	0.009
RPCA^†	0.919	0.061	0.079	0.055	0.959	0.040
MF	0.796	0.179	0.234	0.187	0.814	0.178
MF^†	0.739	0.195	0.258	0.192	0.775	0.207

Table 6. TABLE II : Column-wise sparsity model and Indian Pines Dataset. Simulation results are presented for the proposed approach ( D-RPCA(C) ), Outlier Pursuit (OP) based approach on transformed data ( OP † ), matched filtering ( MF ) on original data 𝐌 𝐌 {\mathbf{M}} , and matched filtering on transformed data 𝐃 † 𝐌 superscript 𝐃 † 𝐌 {\mathbf{D^{\dagger}M}} ( MF † ), across dictionary elements d 𝑑 d , and the regularization parameter for initial dictionary learning step ρ 𝜌 \rho ; See ( 31 ). Threshold selects columns with column-norm greater than threshold such that AUC is maximized. For each case, the best performing metrics are reported in bold for readability. Further, ` ` ∗ " ` ` " ``*" denotes the case where ROC curve was “flipped” (i.e. classifier output was inverted to achieve the best performance).

$d$	$ρ$	Method	Threshold	Performance at best operating point		AUC
	$ρ$	Method	Threshold	TPR	FPR	AUC
4	0.01	D-RPCA(C)	0.905	0.989	0.014	0.998
		OP^†	0.895	0.989	0.015	0.998
		MF_∗	N/A	0.656	0.376	0.611
		MF ${}^{†}_{*}$	N/A	0.624	0.373	0.639
	0.1	D-RPCA(C)	0.805	0.989	0.013	0.998
		OP ${}_{*}^{†}$	1.100	0.720	0.349	0.682
		MF_∗	N/A	0.742	0.256	0.780
		MF^†	N/A	0.828	0.173	0.905
	0.5	D-RPCA(C)	1.800	0.989	0.010	0.998
		OP^†	1.300	0.989	0.012	0.998
		MF	N/A	0.548	0.474	0.556
		MF ${}^{†}_{*}$	N/A	0.849	0.146	0.939

Table 7. (a) Learned dictionary, d = 4 𝑑 4 d=4

$d$	$ρ$	Method	Threshold	Performance at best operating point		AUC
	$ρ$	Method	Threshold	TPR	FPR	AUC
4	0.01	D-RPCA(C)	0.905	0.989	0.014	0.998
		OP^†	0.895	0.989	0.015	0.998
		MF_∗	N/A	0.656	0.376	0.611
		MF ${}^{†}_{*}$	N/A	0.624	0.373	0.639
	0.1	D-RPCA(C)	0.805	0.989	0.013	0.998
		OP ${}_{*}^{†}$	1.100	0.720	0.349	0.682
		MF_∗	N/A	0.742	0.256	0.780
		MF^†	N/A	0.828	0.173	0.905
	0.5	D-RPCA(C)	1.800	0.989	0.010	0.998
		OP^†	1.300	0.989	0.012	0.998
		MF	N/A	0.548	0.474	0.556
		MF ${}^{†}_{*}$	N/A	0.849	0.146	0.939

Table 8. (b) Learned dictionary, d = 10 𝑑 10 d=10

$d$	$ρ$	Method	Threshold	Performance at best operating point		AUC
	$ρ$	Method	Threshold	TPR	FPR	AUC
10	0.01	D-RPCA(C)	0.800	0.946	0.016	0.993
		OP^†	1.300	0.946	0.060	0.988
		MF_∗	N/A	0.946	0.060	0.987
		MF ${}_{*}^{†}$	N/A	0.527	0.468	0.511
	0.1	D-RPCA(C)	0.550	0.979	0.029	0.997
		OP^†	0.800	0.893	0.112	0.928
		MF_∗	N/A	0.688	0.302	0.714
		MF^†	N/A	0.527	0.470	0.523
	0.5	D-RPCA(C)	1.400	0.989	0.037	0.997
		OP^†	0.800	0.807	0.148	0.847
		MF	N/A	0.807	0.309	0.781
		MF ${}_{*}^{†}$	N/A	0.527	0.468	0.539

Table 9. (c) Dictionary by sampling voxels, d = 15 𝑑 15 d=15

$d$	Method	Threshold	Performance at best operating point		AUC
	Method	Threshold	TPR	FPR
15	D-RPCA(C)	0.800	0.989	0.018	0.998
	OP^†	2.200	0.882	0.126	0.900
	MF	N/A	0.957	0.085	0.978
	MF^†	N/A	0.796	0.217	0.857

Table 10. (d) Average performance

Method	TPR		FPR		AUC
	Mean	St.Dev.	Mean	St.Dev.	Mean	St.Dev.
D-RPCA(C)	0.981	0.016	0.020	0.010	0.997	0.002
OP^†	0.889	0.099	0.117	0.115	0.906	0.114
MF	0.763	0.151	0.266	0.149	0.772	0.166
MF^†	0.668	0.151	0.331	0.148	0.702	0.192

Table 11. TABLE III : Entry-wise sparsity model and Pavia University Dataset. Simulation results are presented for the proposed approach ( D-RPCA(E) ), robust-PCA based approach on transformed data ( RPCA † ), matched filtering ( MF ) on original data 𝐌 𝐌 {\mathbf{M}} , and matched filtering on transformed data 𝐃 † 𝐌 superscript 𝐃 † 𝐌 {\mathbf{D^{\dagger}M}} ( MF † ), across dictionary elements d 𝑑 d , and the regularization parameter for initial dictionary learning step ρ 𝜌 \rho ; See ( 31 ). Threshold selects columns with column-norm greater than threshold such that AUC is maximized. For each case, the best performing metrics are reported in bold for readability. Further, ` ` ∗ " ` ` " ``*" denotes the case where ROC curve was “flipped” (i.e. classifier output was inverted to achieve the best performance).

$d$	$ρ$	Method	Threshold	Performance at best operating point		AUC
	$ρ$	Method	Threshold	TPR	FPR	AUC
30	0.01	D-RPCA(E)	0.150	0.989	0.015	0.992
		RPCA^†	0.700	0.849	0.146	0.925
		MF	N/A	0.929	0.073	0.962
		MF^†	N/A	0.502	0.498	0.498
	0.1	D-RPCA(E)	0.050	0.982	0.019	0.992
		RPCA^†	3.000	0.638	0.374	0.664
		MF	N/A	0.979	0.053	0.986
		MF^†	N/A	0.620	0.381	0.660
	0.5	D-RPCA(E)	0.080	0.982	0.019	0.992
		RPCA^†	2.500	0.635	0.381	0.671
		MF	N/A	0.980	0.159	0.993
		MF ${}^{†}_{*}$	N/A	0.555	0.447	0.442

Table 12. (a) Learned dictionary, d = 30 𝑑 30 d=30

$d$	$ρ$	Method	Threshold	Performance at best operating point		AUC
	$ρ$	Method	Threshold	TPR	FPR	AUC
30	0.01	D-RPCA(E)	0.150	0.989	0.015	0.992
		RPCA^†	0.700	0.849	0.146	0.925
		MF	N/A	0.929	0.073	0.962
		MF^†	N/A	0.502	0.498	0.498
	0.1	D-RPCA(E)	0.050	0.982	0.019	0.992
		RPCA^†	3.000	0.638	0.374	0.664
		MF	N/A	0.979	0.053	0.986
		MF^†	N/A	0.620	0.381	0.660
	0.5	D-RPCA(E)	0.080	0.982	0.019	0.992
		RPCA^†	2.500	0.635	0.381	0.671
		MF	N/A	0.980	0.159	0.993
		MF ${}^{†}_{*}$	N/A	0.555	0.447	0.442

Table 13. (b) Dictionary by sampling voxels, d = 60 𝑑 60 d=60

$d$	Method	Threshold	Performance at best operating point		AUC
	Method	Threshold	TPR	FPR	AUC
60	D-RPCA(E)	0.060	0.986	0.016	0.995
	RPCA^†	1.000	0.799	0.279	0.793
	MF	N/A	0.980	0.011	0.994
	MF^†	N/A	0.644	0.355	0.700

Table 14. TABLE IV : Column-wise sparsity model and Pavia University Dataset. Simulation results for the proposed approach ( D-RPCA(C) ), Outlier Pursuit (OP) based approach ( OP † ), matched filtering ( MF ) on original data 𝐌 𝐌 {\mathbf{M}} , and matched filtering on transformed data 𝐃 † 𝐌 superscript 𝐃 † 𝐌 {\mathbf{D^{\dagger}M}} ( MF † ), across dictionary elements d 𝑑 d , and the regularization parameter for initial dictionary learning step ρ 𝜌 \rho ; See ( 31 ). Threshold selects columns with column-norm greater than threshold such that AUC is maximized. For each case, the best performing metrics are reported in bold for readability. Further, ` ` ∗ " ` ` " ``*" denotes the case where ROC curve was “flipped” (i.e. classifier output was inverted to achieve the best performance).

$d$	$ρ$	Method	Threshold	Performance at best operating point		AUC
	$ρ$	Method	Threshold	TPR	FPR	AUC
30	0.01	D-RPCA(C)	0.065	0.990	0.015	0.991
		OP^†	0.800	0.7581	0.3473	0.705
		MF	N/A	0.929	0.073	0.962
		MF^†	N/A	0.502	0.50	0.498
	0.1	D-RPCA(C)	0.070	0.996	0.022	0.994
		OP^†	0.100	0.989	0.3312	0.904
		MF	N/A	0.979	0.053	0.986
		MF^†	N/A	0.62	0.3814	0.66
	0.5	D-RPCA(C)	0.035	0.983	0.017	0.995
		OP^†	0.200	0.940	0.264	0.887
		MF	N/A	0.980	0.160	0.993
		MF ${}^{†}_{*}$	N/A	0.555	0.447	0.442

Table 15. (a) Learned dictionary, d = 30 𝑑 30 d=30

$d$	$ρ$	Method	Threshold	Performance at best operating point		AUC
	$ρ$	Method	Threshold	TPR	FPR	AUC
30	0.01	D-RPCA(C)	0.065	0.990	0.015	0.991
		OP^†	0.800	0.7581	0.3473	0.705
		MF	N/A	0.929	0.073	0.962
		MF^†	N/A	0.502	0.50	0.498
	0.1	D-RPCA(C)	0.070	0.996	0.022	0.994
		OP^†	0.100	0.989	0.3312	0.904
		MF	N/A	0.979	0.053	0.986
		MF^†	N/A	0.62	0.3814	0.66
	0.5	D-RPCA(C)	0.035	0.983	0.017	0.995
		OP^†	0.200	0.940	0.264	0.887
		MF	N/A	0.980	0.160	0.993
		MF ${}^{†}_{*}$	N/A	0.555	0.447	0.442

Table 16. (b) Dictionary by sampling voxels, d = 60 𝑑 60 d=60

$d$	Method	Threshold	Performance at best operating point		AUC
	Method	Threshold	TPR	FPR	AUC
60	D-RPCA(C)	0.020	0.993	0.022	0.994
	OP^†	0.250	0.963	0.264	0.907
	MF	N/A	0.980	0.011	0.994
	MF^†	N/A	0.644	0.355	0.700

Table 17. TABLE V : Summary of important notation and parameters

Matrices
$𝐌 \in ℝ^{n \times m}$	The data matrix
$𝐋 \in ℝ^{n \times m}$	The low-rank matrix with rank- $r$ and singular value decomposition $𝐋 = 𝐔 𝚺 𝐕^{⊤}$
$𝐃 \in ℝ^{n \times d}$	The known dictionary either thin ( $d \leq n$ ) or fat ( $d > n$ )
$𝐒 \in ℝ^{d \times m}$	The sparse component with the following properties –(1) in case of entry-wise sparsity: $s_{e}$ non-zero entries and when $d > n$ has at most $k$ non-zeros per column, and (2) in case of column-wise sparsity: $s_{c}$ non-zero columns
Regularization Parameters
$λ_{e} \in ℝ$	The regularization parameter for the entry-wise sparsity case
$λ_{c} \in ℝ$	The regularization parameter for the column sparsity case
Subspaces
$ℒ$	The set of matrices which span the same column or row space as $𝐋$ , i.e., $ℒ := {{𝐔𝐖}_{1}^{⊤} + 𝐖_{2} 𝐕^{⊤}, 𝐖_{1} \in ℝ^{m \times r}, 𝐖_{2} \in ℝ^{n \times r}$ for $𝐖_{1} \neq 0$ or $𝐖_{2} \neq 0}$ .
$𝒮_{e}$	The set of matrices with the same support as $𝐒$ (for the entry-wise sparse case).
$𝒮_{c}$	The set of matrices with the same column support as $𝐒$ (for the column-wise sparse case).
$𝒟$	The set of matrices whose columns span the subspace spanned by columns of $𝐃$ , i.e. $𝒟 := {𝐙 = 𝐑𝐇, 𝐇 \in 𝒮_{e} or 𝐇 \in 𝒮_{c}}$
$𝒰$	The column space of $𝐋$
$𝒱$	The row space of $𝐋$
Index Sets
$ℐ_{𝒮_{e}}$	Support of matrix $𝐒$ (entry-wise case)
$ℐ_{𝒮_{c}}$	Column support of matrix $𝐒$ (the outliers)
$ℐ_{𝐋}$	Index set of the inliers (column-wise case)
Projection
$𝒫_{𝒢} (\cdot)$	Projection operator corresponding to any subspace $𝒢$
$𝐏_{𝐆}$	Projection matrix corresponding to the operator $𝒫_{𝒢} (\cdot)$
Parameters for analysis
$μ$	The incoherence parameter between the low-rank component and the dictionary, defined as $μ := \max_{𝐙 \in 𝒟 \ {𝟎_{d \times m}}} \frac{{‖ 𝒫_{ℒ} (𝐙) ‖}_{F}}{{‖ 𝐙 ‖}_{F}}$
$γ_{𝐕}$	Defined as $γ_{𝐕} := \max_{𝑖} {‖ 𝐏_{𝐕} 𝐞_{i} ‖}^{2}$
$γ_{𝐔}$	Defined as $γ_{𝐔} := \max_{𝑖} \frac{{‖ 𝐏_{𝐔} {𝐃𝐞}_{i} ‖}^{2}}{{‖ {𝐃𝐞}_{i} ‖}^{2}}$
$β_{𝐔}$	Defined as $β_{𝐔} := \max_{‖ 𝐮 ‖ = 1} \frac{{‖ (𝐈 - 𝐏_{𝐔}) 𝐃𝐮 ‖}^{2}}{{‖ 𝐃𝐮 ‖}^{2}}$
$ξ_{e}$	Defined as $ξ_{e} := {‖ 𝐃^{⊤} {𝐔𝐕}^{⊤} ‖}_{\infty}$
$ξ_{c}$	Defined as $ξ_{c} := {‖ 𝐃^{⊤} {𝐔𝐕}^{⊤} ‖}_{\infty, 2}$
$α_{ℓ}$	Lower generalized frame bound
$α_{u}$	Upper generalized frame bound

Equations348

M = L + DS,

M = L + DS,

L, S min ∥ L ∥_{*} + λ_{e} ∥ S ∥_{1} s.t. M = L + DS,

L, S min ∥ L ∥_{*} + λ_{e} ∥ S ∥_{1} s.t. M = L + DS,

L, S min ∥ L ∥_{*} + λ_{c} ∥ S ∥_{1, 2} s.t. M = L + DS,

L, S min ∥ L ∥_{*} + λ_{c} ∥ S ∥_{1, 2} s.t. M = L + DS,

D^{†} M = D^{†} L + S,

D^{†} M = D^{†} L + S,

D^{†} M = D^{†} L + S .

D^{†} M = D^{†} L + S .

α_{ℓ} ∥ v ∥_{2}^{2} \leq ∥ Dv ∥_{2}^{2} \leq α_{u} ∥ v ∥_{2}^{2},

α_{ℓ} ∥ v ∥_{2}^{2} \leq ∥ Dv ∥_{2}^{2} \leq α_{u} ∥ v ∥_{2}^{2},

L = UΣ V^{⊤},

L = UΣ V^{⊤},

L := {UW_{1}^{⊤} + W_{2} V^{⊤}, W_{1} \in R^{m \times r}, W_{2} \in R^{n \times r}} .

L := {UW_{1}^{⊤} + W_{2} V^{⊤}, W_{1} \in R^{m \times r}, W_{2} \in R^{n \times r}} .

D := {DH},

D := {DH},

P_{U} (X) = P_{U} X and P_{V} (X) = X P_{V},

P_{U} (X) = P_{U} X and P_{V} (X) = X P_{V},

P_{L} (X) = P_{U} X + X P_{V} - P_{U} X P_{V},

P_{L} (X) = P_{U} X + X P_{V} - P_{U} X P_{V},

P_{L^{⊥}} (X) = (I - P_{U}) X (I - P_{V}) .

P_{L^{⊥}} (X) = (I - P_{U}) X (I - P_{V}) .

μ := Z \in D \ {0} max \frac{∥ P _{L} ( Z ) ∥ _{F}}{∥ Z ∥ _{F}} .

μ := Z \in D \ {0} max \frac{∥ P _{L} ( Z ) ∥ _{F}}{∥ Z ∥ _{F}} .

β_{U} := ∥ u ∥ = 1 max \frac{∥ ( I - P _{U} ) D u ∥ ^{2}}{∥ Du ∥ ^{2}},

β_{U} := ∥ u ∥ = 1 max \frac{∥ ( I - P _{U} ) D u ∥ ^{2}}{∥ Du ∥ ^{2}},

(a) γ_{U} := i max \frac{∥ P _{U} D e _{i} ∥ ^{2}}{∥ De _{i} ∥ ^{2}} and (b) γ_{V} := i max ∥ P_{V} e_{i} ∥^{2} .

(a) γ_{U} := i max \frac{∥ P _{U} D e _{i} ∥ ^{2}}{∥ De _{i} ∥ ^{2}} and (b) γ_{V} := i max ∥ P_{V} e_{i} ∥^{2} .

ξ_{e} := ∥ D^{⊤} UV^{⊤} ∥_{\infty} and ξ_{c}

ξ_{e} := ∥ D^{⊤} UV^{⊤} ∥_{\infty} and ξ_{c}

λ_{e}^{m i n} := \frac{1 + C _{e}}{1 - C _{e}} ξ_{e} and λ_{e}^{m a x} := \frac{α _{ℓ} ( 1 - μ ) - r α _{u} μ}{s _{e}},

λ_{e}^{m i n} := \frac{1 + C _{e}}{1 - C _{e}} ξ_{e} and λ_{e}^{m a x} := \frac{α _{ℓ} ( 1 - μ ) - r α _{u} μ}{s _{e}},

C_{e} := \frac{c}{α _{ℓ} ( 1 - μ ) ^{2} - c},

C_{e} := \frac{c}{α _{ℓ} ( 1 - μ ) ^{2} - c},

c := ⎩ ⎨ ⎧ c_{t} = \frac{α _{u} ( ( 1 + 2 γ _{U} ) ( m i n ( s _{e} , d ) + s _{e} γ _{V} ) + 2 γ _{V} m i n ( s _{e} , m ) )}{2} - \frac{α _{ℓ} ( m i n ( s _{e} , d ) + s _{e} γ _{V} )}{2}, for d \leq n, c_{f} = \frac{α _{u} ( ( 1 + 2 γ _{U} ) ( k + s _{e} γ _{V} ) + 2 γ _{V} m i n ( s _{e} , m ) )}{2} - \frac{α _{ℓ} ( k + s _{e} γ _{V} )}{2}, for d > n

c := ⎩ ⎨ ⎧ c_{t} = \frac{α _{u} ( ( 1 + 2 γ _{U} ) ( m i n ( s _{e} , d ) + s _{e} γ _{V} ) + 2 γ _{V} m i n ( s _{e} , m ) )}{2} - \frac{α _{ℓ} ( m i n ( s _{e} , d ) + s _{e} γ _{V} )}{2}, for d \leq n, c_{f} = \frac{α _{u} ( ( 1 + 2 γ _{U} ) ( k + s _{e} γ _{V} ) + 2 γ _{V} m i n ( s _{e} , m ) )}{2} - \frac{α _{ℓ} ( k + s _{e} γ _{V} )}{2}, for d > n

γ_{U} \leq {\frac{( 1 - μ ) ^{2} - 2 s _{e} γ _{V}}{2 s _{e} ( 1 + γ _{V} )}, for s_{e} \leq min (d, s_{e}^{m a x}) \frac{( 1 - μ ) ^{2} - 2 s _{e} γ _{V}}{2 ( d + s _{e} γ _{V} )}, for d < s_{e} \leq s_{e}^{m a x};

γ_{U} \leq {\frac{( 1 - μ ) ^{2} - 2 s _{e} γ _{V}}{2 s _{e} ( 1 + γ _{V} )}, for s_{e} \leq min (d, s_{e}^{m a x}) \frac{( 1 - μ ) ^{2} - 2 s _{e} γ _{V}}{2 ( d + s _{e} γ _{V} )}, for d < s_{e} \leq s_{e}^{m a x};

γ_{U} \leq \frac{( 1 - μ ) ^{2} - 2 s _{e} γ _{V}}{2 ( k + s _{e} γ _{V} )} .

γ_{U} \leq \frac{( 1 - μ ) ^{2} - 2 s _{e} γ _{V}}{2 ( k + s _{e} γ _{V} )} .

r <

r <

λ_{c}^{m i n} := \frac{ξ _{c} + r s _{c} α _{u} μ C _{c}}{1 - s _{c} C _{c}} and λ_{c}^{m a x} := \frac{α _{ℓ} ( 1 - μ ) - r α _{u} μ}{s _{c}} .

λ_{c}^{m i n} := \frac{ξ _{c} + r s _{c} α _{u} μ C _{c}}{1 - s _{c} C _{c}} and λ_{c}^{m a x} := \frac{α _{ℓ} ( 1 - μ ) - r α _{u} μ}{s _{c}} .

r < (\frac{α _{ℓ}}{α _{u}} \frac{1 - μ}{μ} - \frac{ξ _{c}}{α _{u} μ} s_{c})^{2} .

r < (\frac{α _{ℓ}}{α _{u}} \frac{1 - μ}{μ} - \frac{ξ _{c}}{α _{u} μ} s_{c})^{2} .

Γ = U V^{⊤} + (I - P_{U}) X (I - P_{V}),

Γ = U V^{⊤} + (I - P_{U}) X (I - P_{V}),

P_{S_{e}} (D^{⊤} U V^{⊤}) + P_{S_{e}} (D^{⊤} (I - P_{U})

P_{S_{e}} (D^{⊤} U V^{⊤}) + P_{S_{e}} (D^{⊤} (I - P_{U})

= λ_{e} sign (S_{0}) .

B_{S_{e}} := λ_{e} sign (S_{0}) - P_{S_{e}} (D^{⊤} U V^{⊤}),

B_{S_{e}} := λ_{e} sign (S_{0}) - P_{S_{e}} (D^{⊤} U V^{⊤}),

vec (Z) = [(I - P_{V}) \otimes D^{⊤} (I - P_{U})] vec (X) .

vec (Z) = [(I - P_{V}) \otimes D^{⊤} (I - P_{U})] vec (X) .

vec (X) = A_{S_{e}}^{⊤} (A_{S_{e}} A_{S_{e}}^{⊤})^{- 1} b_{S_{e}} .

vec (X) = A_{S_{e}}^{⊤} (A_{S_{e}} A_{S_{e}}^{⊤})^{- 1} b_{S_{e}} .

∥ P_{L^{⊥}} (Γ) ∥

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\externaldocument

Appendix_final

A Dictionary-Based Generalization of Robust PCA with Applications to Target Localization in Hyperspectral Imaging

Sirisha Rambhatla, Xingguo Li, Jineng Ren and Jarvis Haupt

Department of Electrical and Computer Engineering,

University of Minnesota – Twin Cities, Minneapolis, MN-55455

{rambh002, lixx1661, renxx282, jdhaupt}@umn.edu.

Sirisha Rambhatla†, Xingguo Li‡, Jineng Ren†, and

Jarvis Haupt† †Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN, 55455, USA, e-mail: {rambh002, renxx282, jdhaupt}@umn.edu, respectively. ‡Computer Science Department, Princeton University, Princeton, NJ 08540, USA, email: [email protected]. The work was done when S. Rambhatla was at the University of Minnesota-Twin Cities.This work was supported by the DARPA YFA, Grant N66001-14-1-4047. Preliminary versions appeared in the proceedings of the 2016 IEEE Global Conference on Signal & Information Processing (GlobalSIP), 2017 Asilomar Conference on Signals, Systems, & Computers, and the 2018 IEEE International Conference on Acoustics, Speech & Signal Processing (ICASSP).

Abstract

We consider the decomposition of a data matrix assumed to be a superposition of a low-rank matrix and a component which is sparse in a known dictionary, using a convex demixing method. We consider two sparsity structures for the sparse factor of the dictionary sparse component, namely entry-wise and column-wise sparsity, and provide a unified analysis, encompassing both undercomplete and the overcomplete dictionary cases, to show that the constituent matrices can be successfully recovered under some relatively mild conditions on incoherence, sparsity, and rank. We leverage these results to localize targets of interest in a hyperspectral (HS) image based on their spectral signature(s) using the a priori known characteristic spectral responses of the target. We corroborate our theoretical results and analyze target localization performance of our approach via experimental evaluations and comparisons to related techniques.

Index Terms:

Low-rank, dictionary learning, target localization, Robust PCA, hyperspectral imaging, sparse representation.

I Introduction

Leveraging the structure of a given dataset is at the heart of machine learning and data analysis tasks. A priori knowledge about the structure often makes the problem well-posed, leading to improvements in the solutions. Perhaps the most common of these, one that is often encountered in practice, is approximate low-rankness of the dataset, which is exploited by the popular principal component analysis (PCA)[1]. The low-rank structure encapsulates the model assumption that the data in fact spans a lower dimensional subspace than the ambient dimension of the data. However, in a number of applications, the data may not be inherently low-rank, but may be decomposed as a superposition of a low-rank component, and a component which has a sparse representation in a known dictionary. This scenario is encountered in target identification applications in hyperspectral (HS) imaging [2, 3], where the a priori knowledge of the target signatures (dictionary), can be leveraged for localization.

Hyperspectral (HS) imaging is an imaging modality which senses the intensities of the reflected electromagnetic waves (responses) corresponding to different wavelengths of the electromagnetic spectra, often invisible to the human eye. As the spectral response associated with an object/material is dependent on its composition, HS imaging can be used to identify the said target objects/materials via their characteristic spectra or signature responses, also referred to as endmembers.

Typical applications of HS imaging range from monitoring agricultural use of land, catchment areas of rivers and water bodies, food processing, surveillance, and climate science applications, to detecting various minerals, chemicals, and for presence of life sustaining compounds on distant planets; see [4, 5, 6], and references therein for details. However, these spectral signatures are often highly correlated, which makes it difficult to detect regions of interest.

In this work, we present two techniques for target localization in HS images by posing it as a matrix demixing task. Here, we first analyze a matrix demixing problem where a data matrix ${\mathbf{M}}\in\mathbb{R}^{n\times m}$ is assumed to be formed via a superposition of a low-rank component ${\mathbf{L}}\in\mathbb{R}^{n\times m}$ of rank- $r$ for $r<\min(n,m)$ , and a dictionary sparse part ${\mathbf{DS}}\in\mathbb{R}^{n\times m}$ . Here, the matrix ${\mathbf{D}}\in\mathbb{R}^{n\times d}$ is an a priori known dictionary, and ${\mathbf{S}}\in\mathbb{R}^{d\times m}$ is an unknown sparse coefficient matrix. Specifically, we will study the following model for ${\mathbf{M}}$ :

[TABLE]

and identify the conditions under which components ${\mathbf{L}}$ and ${\mathbf{S}}$ can be recovered given ${\mathbf{M}}$ and ${\mathbf{D}}$ by solving appropriate convex formulations. We then leverage these theoretical results for the target localization task in HS images; see Section VI.

We consider the demixing problem described above for two different sparsity models on the matrix ${\mathbf{S}}$ . First, we consider a case where ${\mathbf{S}}$ has at most $s_{e}$ total non-zero entries (entry-wise sparse case), and second where ${\mathbf{S}}$ has $s_{c}$ non-zero columns (column-wise sparse case). To this end, we develop the conditions under which solving

[TABLE]

for the entry-wise sparsity case, and

[TABLE]

for the column-wise sparse case, will recover ${\mathbf{L}}$ and ${\mathbf{S}}$ for regularization parameters $\lambda_{e}\geq 0$ and $\lambda_{c}\geq 0$ , respectively, given the data ${\mathbf{M}}$ and the dictionary ${\mathbf{D}}$ . The known dictionary ${\mathbf{D}}$ here can be overcomplete (fat, i.e., $d>n$ ) or undercomplete (thin, i.e., $d\leq n$ ). Here, “D-RPCA” refers to “Dictionary based Robust Principal Component Analysis”, while “E” and “C” indicate the entry-wise and column-wise sparsity patterns, respectively. In addition, $\|.\|_{*}$ , $\|.\|_{1}$ , and $\|.\|_{1,2}$ refer to the nuclear norm, $\ell_{1}$ - norm of the vectorized matrix, and $\ell_{1,2}$ norm (sum of column $\ell_{2}$ norms), respectively, which serve as convex relaxations of rank, sparsity, and column sparsity inducing regularization, respectively.

These two types of sparsity patterns capture different structural properties of the dictionary sparse component. The entry-wise sparsity model allows individual data points to span low-dimensional subspaces, still allowing the dataset to span the entire space. While in the column-wise sparsity setting, the component ${\mathbf{DS}}$ is also column-wise sparse. As a result, this model effectively captures the structured (dictionary dependent) corruptions in the otherwise low-rank structured columns of ${\mathbf{M}}$ . Note that the columns of ${\mathbf{S}}$ are not restricted to be sparse in the column-wise sparsity model.

I-A Background

A wide range of problems can be expressed in the form described in (1). Perhaps the most celebrated of these is principal component analysis (PCA) [1], which can be viewed as a special case of (1), with the matrix ${\mathbf{D}}$ set to zero. In the absence of ${\mathbf{L}}$ , the problem reduces to that of sparse recovery [7, 8, 9]; see [10] and references therein for an overview of related works. Further, the popular framework of Robust PCA tackles a case when the dictionary ${\mathbf{D}}$ is identity [11, 12], i.e., ${\mathbf{D}}={\mathbf{I}}$ for an identity matrix ${\mathbf{I}}$ , Outlier Pursuit (OP) [13] ( ${\mathbf{D}}={\mathbf{I}}$ and ${\mathbf{S}}$ is column-wise sparse,) and others [14, 15, 16, 17, 18, 19, 20, 21, 22].

The model in (1) is also closely related to the one in [23], which explores the overcomplete dictionary setting with applications to network traffic anomaly detection. However, the analysis therein applies to a case where the ${\mathbf{D}}$ is overcomplete with orthogonal rows, and the coefficient matrix ${\mathbf{S}}$ has a small number of non-zero elements per row and column, which may be restrictive assumptions in some applications. To this end, in recent works we analyze the extension of [23] to include a case where the dictionary has more rows than columns, i.e., is thin, while removing the orthogonality constraint for both the thin and the fat dictionary cases, for entry-wise sparsity [24] and column-wise sparsity [3] cases, respectively.

In particular, the entry-wise case (1) is propitious in a number of applications. For example, it can be used for target identification in hyperspectral imaging [2, 3], and in topic modeling applications to identify documents with certain properties, on similar lines as [25]. Further, in source separation tasks, a variant of this model was used in singing voice separation in [26, 27]. In addition, we can also envision source separation tasks where ${\mathbf{L}}$ is not low-rank, but can in turn be modeled as being sparse in a known [28] or unknown [29] dictionary. The column-wise setting, model (1) is also closely related to outlier identification [13, 18, 19, 30], which is motivated by a number of contemporary “big data” applications. Here, the sparse matrix ${\mathbf{S}}$ (known as “outliers”) can be used to identify malicious responses in collaborative filtering applications [31], finding anomalous patterns in network traffic [32] or estimating visually salient regions of images [33, 34, 35]; see also [36].

In Section VI we also analyze and demonstrate the application of the model shown in (1) for a hyperspectral (HS) demixing task. HS image analysis using sparse recovery-based techniques were explored in [37, 38, 39, 40]. Applications of compressive sampling have been explored in [41, 42], while [43] analyzes the case where HS images are noisy and incomplete. Further, in a recent work [44], the authors study a case where ${\mathbf{L}}$ is absent and the sparse matrix ${\mathbf{S}}$ is also low-rank for the demixing task (1). However, the techniques discussed above focus on identifying all materials in a given HS image. Although sparsity-based target detection was considered in [45, 46, 47, 48], the approaches use training samples from both background and the targets for detection, while possessing no recovery guarantees. However, for target localization, the task is to identify only specific target(s) in a given HS image, while the background may be unknown/irrelevant. As a result, there is a need for techniques which localize targets based on their a priori known spectral signatures; see also [49] and [50].

I-B Our Contributions

As described above, we propose and analyze a dictionary based generalization of robust PCA as shown in (1). Here, we consider two distinct sparsity patterns of ${\mathbf{S}}$ , i.e., entry-wise and column-wise sparse ${\mathbf{S}}$ , arising from different structural assumptions on the dictionary sparse component. Our specific contributions are summarized below.

Entry-wise case:

We make the following contributions towards guaranteeing the recovery of ${\mathbf{L}}$ and ${\mathbf{S}}$ via the convex optimization problem in D-RPCA(E). First, we analyze the thin case (i.e. $d\leq n$ ), where we assume that the matrix ${\mathbf{S}}$ has at most $s_{e}=\mathcal{O}(\tfrac{m}{r})$ non-zero elements globally, i.e., $\|{\mathbf{S}}\|_{0}\leq s_{e}$ , where $\|\cdot\|_{0}$ represents the number of non-zero entries in ${\mathbf{S}}$ . Next, for the fat case, we first extend the analysis presented in [23] to eliminate the orthogonality constraint on the rows of the dictionary ${\mathbf{D}}$ . Further, we relax the sparsity constraints required by [23] on rows and columns of the sparse coefficient matrix ${\mathbf{S}}$ , to study the case when $\|{\mathbf{S}}\|_{0}\leq s_{e}$ with at most $k=\mathcal{O}(d/\log(n))$ non-zero elements per column [24]. Hence, we provide a unified analysis for both the thin and the fat case, making the model (1) amenable to a wide range of applications.

Column-wise case: We propose and analyze a dictionary based generalization of Outlier Pursuit (OP) [13], wherein the coefficient matrix ${\mathbf{S}}$ admits a column sparse structure, referred to as “outliers”; see also [3]. Note that, in this case there is an inherent ambiguity regarding the recovery of the true component pair $({\mathbf{L}},{\mathbf{S}})$ corresponding to the low-rank part and the dictionary sparse component, respectively. Specifically, any pair $({\mathbf{L}}_{0},{\mathbf{S}}_{0})$ satisfying ${\mathbf{M}}={\mathbf{L}}_{0}+{\mathbf{D}}{\mathbf{S}}_{0}={\mathbf{L}}+{\mathbf{D}}{\mathbf{S}}$ , where ${\mathbf{L}}_{0}$ and ${\mathbf{L}}$ have the same column space, and ${\mathbf{S}}_{0}$ and ${\mathbf{S}}$ have the identical column support, is a solution of D-RPCA(C). To this end, we develop the sufficient conditions for the convex optimization task D-RPCA(C) to recover the column space of the low-rank component ${\mathbf{L}}$ , while identifying the outlier columns of ${\mathbf{S}}$ ; see Section II-A for details. Here, the difference between D-RPCA(C) and OP being the inclusion of the known dictionary. Next, we demonstrate the advantages of leveraging the knowledge of the dictionary via phase transitions in rank and sparsity for recovery of the outlier columns. Specifically, we show that as compared to OP, D-RPCA(C) works for potentially higher ranks of ${\mathbf{L}}$ , when $s_{c}$ is a fixed proportion of $m$ .

The thin dictionary case – an interesting result: [23] suggests that when the dictionary is thin, i.e., $d<n$ , one can envision a pseudo-inverse based technique wherein we pre-multiply both sides in (1) with the Moore-Penrose pseudo-inverse ${\mathbf{D}}^{\dagger}\in\mathbb{R}^{d\times n}$ , i.e., ${\mathbf{D}}^{\dagger}{\mathbf{D}}={\mathbf{I}}$ (this is not applicable for the fat case due to the non-trivial null space). This operation leads to a formulation which resembles the robust PCA (RPCA) [11, 12] model for the entry-wise case and Outlier Pursuit (OP) [13] for the column-wise case, i.e.,

[TABLE]

An interesting finding of our work is that although this transformation algebraically reduces the entry-wise and column-wise sparsity cases to Robust PCA and OP settings, respectively, the specific model assumptions of Robust PCA and OP may not hold for all choices of dictionary size $d$ and rank $r$ . Specifically, we find that in cases where $d<r$ , this pre-multiplication may not lead to a “low-rank” ${\mathbf{D}}^{\dagger}{\mathbf{L}}$ . This suggests that the notion of “low” or “high” rank is relative to the maximum possible rank of ${\mathbf{D}}^{\dagger}{\mathbf{L}}$ , which in this case is $\min(d,r)$ . Therefore, if $d<r$ , ${\mathbf{{\mathbf{D}}^{\dagger}{\mathbf{L}}}}$ can be full-rank, and the low-rank assumptions of RPCA and OP may no longer hold. As a result, these two models (the pseudo inversed case and the current work) cannot be used interchangeably for the thin dictionary case. We corroborate these via experimental evaluations presented in Section V111The code is made available at github.com/srambhatla/Diction ary-based-Robust-PCA, and the results are reproducible..

Techniques for HS demixing: Building on our theoretical results, we present two techniques for target detection in a HS image, depending upon different sparsity assumptions on the matrix ${\mathbf{S}}$ . Our techniques operate by forming the dictionary ${\mathbf{D}}$ using the a priori known spectral signatures of the target of interest, and leveraging the approximate low-rank structure of the data matrix ${\mathbf{M}}$ [24, 3, 2]. We then analyze the performance of these techniques via extensive experimental evaluations on real-world demixing tasks over different datasets and dictionary choices, and compare the performance of the proposed techniques with related works.

The choice of a particular sparsity model, i.e., entry-wise and column-wise for this task depends on the properties of the dictionary matrix ${\mathbf{D}}$ . In particular, if the target signature admits a sparse representation in the dictionary, entry-wise sparsity structure is preferred. This is likely to be the case when the dictionary is overcomplete ( $n<d$ ) or fat, and also when the target spectral responses admit a sparse representation in the dictionary. On the other hand, the column-wise sparsity structure is amenable to cases where the representation can use all columns of the dictionary. This potentially arises in the cases when the dictionary is undercomplete ( $n\geq d$ ) or thin. Note that, in the column-wise sparsity case, the non-zero columns need not be sparse themselves. The applicability of these two modalities is also exhibited in our experimental analysis; see Section VI-B for further details.

Demixing Despite Correlated Signatures: Since the spectral signatures of even distinct classes are highly correlated to each other this demixing task is particularly challenging. For instance, we plot the spectral signatures of different classes of the “Indian Pines” Dataset [51] in Fig. 1. The shaded region here shows the upper and lower ranges of different classes. For instance, in Fig. 1 we observe that the spectral signature of the “Stone-Steel” class is similar to that of class “Wheat”. This correlation between the spectral signatures of different classes results in an approximate low-rank structure of the data, captured by the low-rank component ${\mathbf{L}}$ , while the dictionary-sparse component ${\mathbf{DS}}$ is used to identify the target of interest; see also Fig 8. We specifically show that such a decomposition successfully localizes the target despite the high correlation between spectral signatures. It is worth noting that although we consider thin dictionaries ( $n\geq d$ ) for the purposes of this demixing task, our theoretical results are also applicable for the fat case ( $n<d$ ) [24],[3].

The rest of the paper is organized as follows222Notation: Given a matrix ${\mathbf{X}}$ and vector ${\mathbf{v}}$ , we use $\|{\mathbf{X}}\|:=\sigma_{\max}({\mathbf{X}})$ for the spectral norm, where $\sigma_{\max}({\mathbf{X}})$ denotes the maximum singular value of the matrix, $\|{\mathbf{v}}\|_{\infty}=\underset{i}{\max}~{}|{\mathbf{v}}_{i}|$ , $\|{\mathbf{X}}\|_{\infty}:=\underset{i,~{}j}{\max}|{\mathbf{X}}_{ij}|$ , $\|{\mathbf{X}}\|_{\infty,\infty}=\underset{\|{\mathbf{v}}\|_{\infty}=1}{\max}~{}\|{\mathbf{X}}{\mathbf{v}}\|_{\infty}=\underset{i}{\max}~{}\|{\mathbf{e}}^{\top}_{i}{\mathbf{X}}\|_{1}$ , and $\|{\mathbf{X}}\|_{\infty,2}:=\underset{i}{\max}\|{\mathbf{X}}{\mathbf{e}}_{i}\|$ . Here, ${\mathbf{X}}_{i,j}$ denotes the $(i,j)$ element of ${\mathbf{X}}$ and ${\mathbf{e}}_{i}$ denotes the canonical basis vector with $1$ at the $i$ -th location. We also use $\|\cdot\|$ to denote the $\ell_{2}$ -norm in case of vectors and spectral norm for matrices.. We formalize the problem and describe various considerations on the structure of the component matrices in Section II. In Section III, we present our main theorems for the entry-wise and column-wise cases along with discussion on the implication of the results, followed by an outline of the analysis in Section IV. Numerical evaluations on synthetic data are provided in Section V, while we explore the application to target localization in HS images in Section VI. Finally, we summarize our contributions and conclude this discussion in Section VII with future directions.

II Preliminaries

We start formalizing the problem set-up and introduce model parameters pertinent to our analysis. We begin our discussion with our notion of optimality for the two sparsity modalities; we also summarize the notation in Table V in the appendix.

II-A Optimality of the Solution Pair

For the entry-wise case, we recover the low-rank component ${\mathbf{L}}$ , and the sparse coefficient matrix ${\mathbf{S}}$ , given the dictionary ${\mathbf{D}}$ , and data ${\mathbf{M}}$ generated according to the model described in (1). Recall that $s_{e}$ is the global sparsity, $k$ denotes the number of non-zero entries in a column of ${\mathbf{S}}$ when the dictionary is fat.

In the the column-wise sparsity setting, due to the inherent ambiguity in the model (1), as discussed in Section I-B, we can only hope to recover the column-space for the low-rank matrix and the identities of the non-zero columns for the sparse matrix. Therefore, in this case any solution in the Oracle Model (defined below) is deemed to be optimal.

Definition D.1 (Oracle Model for Column-wise Sparsity Case).

Let the pair $({\mathbf{L}},{\mathbf{S}})$ be the matrices forming the data ${\mathbf{M}}$ as per (1), and define the oracle model $\{{\mathbf{M}},{\mathcal{U}},{\mathcal{I}}_{{\mathcal{S}}_{c}}\}$ . Then, any pair $({\mathbf{L}}_{0},{\mathbf{S}}_{0})$ is in the Oracle Model $\{{\mathbf{M}},{\mathcal{U}},{\mathcal{I}}_{{\mathcal{S}}_{c}}\}$ , if ${\mathcal{P}}_{{\mathcal{U}}}({\mathbf{L}}_{0})={\mathbf{L}}$ , ${\mathcal{P}}_{{\mathcal{S}}_{c}}({\mathbf{D}}{\mathbf{S}}_{0})={\mathbf{D}}{\mathbf{S}}$ and ${\mathbf{L}}_{0}+{\mathbf{D}}{\mathbf{S}}_{0}={\mathbf{L}}+{\mathbf{D}}{\mathbf{S}}={\mathbf{M}}$ hold simultaneously, where ${\mathcal{P}}_{{\mathcal{U}}}$ and ${\mathcal{P}}_{{\mathcal{S}}_{c}}$ are projections onto the column space ${\mathcal{U}}$ of ${\mathbf{L}}$ and column support ${\mathcal{I}}_{{\mathcal{S}}_{c}}$ of ${\mathbf{S}}$ , respectively.

II-B Conditions on the Dictionary

We require that the dictionary ${\mathbf{D}}$ follows the generalized frame property (GFP) defined as follows.

Definition D.2.

A matrix ${\mathbf{D}}$ satisfies the generalized frame property (GFP), on vectors ${\mathbf{v}}\in{\mathcal{R}}$ , if for any fixed vector ${\mathbf{v}}\in{\mathcal{R}}$ where ${\mathbf{v}}\neq{\mathbf{0}}$ and some ${\mathcal{R}}$ , we have

[TABLE]

where $\alpha_{\ell}$ and $\alpha_{u}$ are the lower and upper generalized frame bounds with $0<\alpha_{\ell}\leq\alpha_{u}<\infty$ .

The GFP shown above is met as long as the vectors ${\mathbf{v}}$ are not in the null-space of the matrix ${\mathbf{D}}$ for finite $\|{\mathbf{D}}\|$ . Therefore, for the thin dictionary setting $d\leq n$ for both entry-wise and column-wise cases ${\mathcal{R}}$ can be the entire space, and GFP is satisfied as long as ${\mathbf{D}}$ has full column rank. For example, ${\mathbf{D}}$ being a frame[52] suffices; see also [53]. On the other hand, for the fat dictionary setting, we need the space ${\mathcal{R}}$ to have a union-of-subspace structure such that GFP is met for both the entry-wise and column-wise sparsity cases. Specifically, for the entry-wise sparsity case, we also require that the frame bounds $\alpha_{u}$ and $\alpha_{\ell}$ be close to each other. To this end, we assume that ${\mathbf{D}}$ satisfies the restricted isomtery property (RIP) [9] of order $k=\mathcal{O}(d/\log(n))$ with a restricted isometric constant (RIC) of $\delta$ in this case, and that $\alpha_{u}=(1+\delta)$ and $\alpha_{\ell}=(1-\delta)$ .

II-C Relevant Subspaces

We now define the subspaces relevant for our discussion. For the following discussion, let the pair $({\mathbf{L_{0}}},{\mathbf{S_{0}}})$ denote the solution to D-RPCA(E) in the entry-wise sparse case. Further, for the column-wise sparse setting, let $({\mathbf{L_{0}}},{\mathbf{S_{0}}})$ denote a solution pair in the oracle model $\{{\mathbf{M}},{\mathcal{U}},{\mathcal{I}}_{{\mathcal{S}}_{c}}\}$ D.1, obtained by solving D-RPCA(C). For the low-rank matrix ${\mathbf{L}}$ , let the compact singular value decomposition (SVD) be defined as

[TABLE]

where ${\mathbf{U}}\in\mathbb{R}^{n\times r}$ and ${\mathbf{V}}\in\mathbb{R}^{m\times r}$ are the left and right singular vectors of ${\mathbf{L}}$ , respectively, and ${\mathbf{\Sigma}}$ is the diagonal matrix with singular values on the diagonal. Here, matrices ${\mathbf{U}}$ and ${\mathbf{V}}$ each have orthogonal columns, and the non-negative entries ${\mathbf{\Sigma}}_{ii}=\sigma_{i}$ are arranged in descending order. We define ${\mathcal{L}}$ as the linear subspace consisting of matrices spanning the same row or column space as ${\mathbf{L}}$ , i.e., for ${\mathbf{W}}_{1}\neq 0$ or ${\mathbf{W}}_{2}\neq 0$ ,

[TABLE]

Next, let ${\mathcal{S}}_{e}$ ( ${\mathcal{S}}_{c}$ for the column-wise sparsity setting) be the space spanned by $d\times m$ matrices with the same non-zero support (column support, denoted as $\rm csupp(\cdot)$ ) as ${\mathbf{S}}$ , and let ${\mathcal{I}}_{{\mathcal{S}}_{c}}$ denote the index set containing the non-zero column index set of ${\mathbf{S}}$ for the column-wise case, then we denote the space spanned by the dictionary sparse component ${\mathcal{D}}$ as

[TABLE]

where ${\mathbf{H}}\in{\mathcal{S}}_{e}$ for entry-wise case and $\rm csupp({\mathbf{H}})\subseteq{\mathcal{I}}_{{\mathcal{S}}_{c}}$ for column-wise case. Also, we denote the corresponding complements of the spaces described above by appending ‘ $\perp$ ’. In addition, we use calligraphic ‘ ${\mathcal{P}}_{{\mathcal{G}}}(\cdot)$ ’ to denote the projection operator onto a subspace ${\mathcal{G}}$ , and ‘ ${\mathbf{P}}_{{\mathbf{G}}}$ ’ to denote the corresponding projection matrix. For instance, we define ${\mathcal{P}}_{{\mathcal{U}}}(\cdot)$ and ${\mathcal{P}}_{{\mathcal{V}}}(\cdot)$ as the projection operators corresponding to the column space ${\mathcal{U}}$ and row space ${\mathcal{V}}$ of the low-rank component ${\mathbf{L}}$ . Therefore, for a given matrix ${\mathbf{X}}\in\mathbb{R}^{n\times m}$ ,

[TABLE]

where ${\mathbf{P}}_{{\mathbf{U}}}={\mathbf{UU^{\top}}}$ and ${\mathbf{P}}_{{\mathbf{V}}}={\mathbf{VV^{\top}}}$ . With this, the projection operators onto, and orthogonal to, the subspace ${\mathcal{L}}$ are respectively defined as

[TABLE]

and

[TABLE]

II-D Incoherence Measures and Parameters

We employ various notions of incoherence to identify the conditions under which our procedures succeed. To this end, we first define the incoherence parameter $\mu$ , which characterizes the relationship between the low-rank part ${\mathbf{L}}$ and the dictionary sparse part ${\mathbf{DS}}$ as

[TABLE]

The parameter $\mu\in[0,1]$ is the measure of degree of similarity between the low-rank part and the dictionary sparse component. Here, a larger $\mu$ implies that the dictionary sparse component is close to the low-rank part, while a small $\mu$ indicates otherwise. In addition, we also define the parameter $\beta_{{\mathbf{U}}}$ as

[TABLE]

which measures the similarity between the orthogonal complement of the column-space ${\mathcal{U}}$ and the dictionary ${\mathbf{D}}$ .

The next two measures of incoherence can be interpreted as a ways to identify the cases where for ${\mathbf{L}}$ with SVD as ${\mathbf{L}}={\mathbf{U\Sigma V^{\top}}}$ : (a) ${\mathbf{U}}$ resembles the dictionary ${\mathbf{D}}$ , and/or (b) ${\mathbf{V}}$ resembles the sparse coefficient matrix ${\mathbf{S}}$ . In these cases, the low-rank part may mimic the dictionary sparse component. To this end, similar to [23], we define the following to measure these properties respectively as

[TABLE]

Here, $\gamma_{{\mathbf{U}}}\in[0,1]$ , and achieves the upper bound when a dictionary element is exactly aligned with the column space ${\mathcal{U}}$ of ${\mathbf{L}}$ . Moreover, ${\mathbf{\gamma}}_{{\mathbf{V}}}{\in[r/m,1]}$ achieves the upper bound when the row-space of ${\mathbf{L}}$ is “spiky,” i.e., a certain row of ${\mathbf{V}}$ is $1$ -sparse, meaning that a column of ${\mathbf{L}}$ is supported by (can be expressed as a linear combination of) a column of ${\mathbf{U}}$ . The lower bound here is attained when it is “spread-out,” i.e., each column of ${\mathbf{L}}$ is a linear combination of all columns of ${\mathbf{U}}$ . In general, our recovery of the two components is easier when the incoherence parameters $\gamma_{{\mathbf{U}}}$ and ${\mathbf{\gamma}}_{{\mathbf{V}}}$ are closer to their lower bounds.

Further, for notational convenience, we define

[TABLE]

Here, $\xi_{e}$ is the maximum absolute entry of ${\mathbf{D}}^{\top}{\mathbf{UV}}^{\top}$ for the entry-wise case, which measures how close columns of ${\mathbf{D}}$ are to the singular vectors of ${\mathbf{L}}$ . Similarly, for the column-wise case, $\xi_{c}$ measures the closeness of columns of ${\mathbf{D}}$ to the singular vectors of ${\mathbf{L}}$ under a column-wise maximum $\ell_{2}$ -norm metric.

III Main Results

We present the main results corresponding to each sparsity structure of ${\mathbf{S}}$ in this section.

III-A Exact Recovery for Entry-wise Sparsity Case

Our main result establishes the existence of a regularization parameter $\lambda_{e}$ for which solving the optimization problem D-RPCA(E) will recover the components ${\mathbf{L}}$ and ${\mathbf{S}}$ exactly. To this end, we will show that such a $\lambda_{e}$ belongs to a non-empty interval $[\lambda_{e}^{\min},\lambda_{e}^{\max}]$ with $\lambda_{e}^{\min}$ and $\lambda_{e}^{\max}$ defined as

[TABLE]

where $0\leq C_{e}<1$ is a constant that captures the relationship between different model parameters, and is defined as

[TABLE]

and $c$ is defined as

[TABLE]

Given these definitions, we formalize the theorem for the entry-wise case as following; a proof sketch is provided in Section IV-A.

Theorem 1.

Suppose ${\mathbf{M}}={\mathbf{L}}+{\mathbf{DS}}$ , where ${\rm rank}({\mathbf{L}})=r$ and ${\mathbf{S}}$ has at most $s_{e}$ non-zeros, i.e., $\|{\mathbf{S}}\|_{0}\leq s_{e}\leq s_{e}^{\max}:=\tfrac{(1-\mu)^{2}}{2}\tfrac{m}{r}$ . Given $\mu\in[0,1)$ , $\gamma_{{\mathbf{U}}}\in[0,1]$ , $\gamma_{{\mathbf{V}}}\in[r/m,1]$ , $\xi_{e}$ defined in (2), (4), (5), and any $\lambda_{e}\in[\lambda_{e}^{\min},\lambda_{e}^{\max}]$ with $\lambda_{e}^{\max}>\lambda_{e}^{\min}\geq 0$ defined in (6), and asssuming the dictionary ${\mathbf{D}}\in\mathbb{R}^{n\times d}$ obeys the generalized frame property D.2 with frame bounds $[\alpha_{\ell},\alpha_{u}]$ , solving D-RPCA(E) will recover matrices ${\mathbf{L}}$ and ${\mathbf{S}}$ in the following cases:

$\bullet$ * For $d\leq n$ , ${\mathcal{R}}$ may contain the entire space and $\gamma_{{\mathbf{U}}}$ follows*

[TABLE]

$\bullet$ * For $d>n>C_{1}~{}k\log(n)$ for a constant $C_{1}$ , ${\mathcal{R}}$ consists of all $k$ sparse vectors, and $\gamma_{{\mathbf{U}}}$ follows*

[TABLE]

Theorem 1 establishes the sufficient conditions for the existence of $\lambda_{e}$ to guarantee recovery of $({\mathbf{L,S}})$ for both the thin and the fat cases. The conditions on $\gamma_{{\mathbf{U}}}$ dictated by (8) and (9), for the thin and fat case, respectively, arise from ensuring that $\lambda_{e}^{\min}\geq 0$ . Further, the condition $\lambda_{e}^{\min}<\lambda_{e}^{\max}$ , translates to the following sufficient condition on rank $r$ in terms of the sparsity $s_{e}$ for $\mu>0$ ,

[TABLE]

for the recovery of $({\mathbf{L,S}})$ . This relationship matches with our empirical evaluations and will be revisited in Section V-A.

For both, thin and fat dictionary cases, smaller incoherence measures ( $\mu$ , $\gamma_{{\mathbf{V}}}$ , and $\gamma_{{\mathbf{U}}}$ ) between the low-rank part, ${\mathbf{L}}$ , the dictionary, ${\mathbf{D}}$ , and the sparse component ${\mathbf{S}}$ are sufficient for recovery. Our theoretical results for the fat case are similar to [23] without its restrictions (e.g. orthogonality of rows and columns of ${\mathbf{D}}$ , and sparsity requirements). By extending the analysis to thin dictionaries, we consider the worst case deterministic setting as opposed to Robust PCA analysis such as [12] which imposes randomness assumptions on the components. The algorithm works beyond these constrains in practice since we consider sufficient conditions under the worst-case deterministic setting; see Section V. One sanity check is to consider the case when the low-rank part is orthogonal to the dictionary, i.e., $\mu,\gamma_{{\mathbf{U}}},\xi_{e}=0$ . From (6), we see that the condition $\lambda_{e}^{\min}<\lambda_{e}^{\max}$ , no longer constraints rank and sparsity, and we need $s_{e}\leq s_{e}^{\max}={\mathcal{O}}(\tfrac{m}{r})$ . However, the rank and sparsity are still restricted, i.e., with increase in rank the dictionary choice may be restricted to maintain orthogonality.

III-B Exact Recovery for Column-wise Sparsity Case

Recall that we consider the oracle model in this case as described in D.1 owing to the intrinsic ambiguity in recovery of $({\mathbf{L}},{\mathbf{S}})$ ; see our discussion in Section I-B. To demonstrate its recoverability, the following lemma establishes the sufficient conditions for the existence of an optimal pair $({\mathbf{L}}_{0},{\mathbf{S}}_{0})$ . The proof is provided in Appendix A-B.

Lemma 2.

Given ${\mathbf{M}}$ , ${\mathbf{D}}$ , and $\left({\mathcal{L}},{{\mathcal{S}}_{c}},{\mathcal{D}}\right)$ , any pair $({\mathbf{L}}_{0},{\mathbf{S}}_{0})\in\{{\mathbf{M}},{\mathcal{U}},{\mathcal{I}}_{{\mathcal{S}}_{c}}\}$ satisfies ${\rm span}\{\text{col}({\mathbf{L}}_{0})\}={\mathcal{U}}$ and $\rm csupp({\mathbf{S}}_{0})={\mathcal{I}}_{{\mathcal{S}}_{c}}$ if $\mu<1$ .

Analogous to the entry-wise case, we show the existence of a non-empty interval $[\lambda_{c}^{\min},\lambda_{c}^{\max}]$ for the regularization parameter $\lambda_{c}$ , for which solving D-RPCA(C) recovers an optimal pair as per Lemma 2. Here, for a constant $C_{c}:=\tfrac{\alpha_{u}}{\alpha_{\ell}}\tfrac{1}{{(1-\mu)^{2}}}\gamma_{{\mathbf{V}}}\beta_{{\mathbf{U}}}$ , $\lambda_{c}^{\min}$ and $\lambda_{c}^{\max}$ are defined as

[TABLE]

Then, our main result for the column-wise case is as follows; a proof sketch is provided in Section IV-B.

Theorem 3.

Suppose ${\mathbf{M}}={\mathbf{L}}+{\mathbf{D}}{\mathbf{S}}$ with $({\mathbf{L}},{\mathbf{S}})$ defining the oracle model $\{{\mathbf{M}},{\mathcal{U}},{\mathcal{I}}_{{\mathcal{S}}_{c}}\}$ , where ${\rm rank}({\mathbf{L}})=r$ , and $|{\mathcal{I}}_{{\mathcal{S}}_{c}}|=s_{c}$ for $s_{c}\leq s_{c}^{\max}:=\tfrac{\alpha_{\ell}}{\alpha_{u}\gamma_{{\mathbf{V}}}}\cdot\tfrac{(1-\mu)^{2}}{\beta_{{\mathbf{U}}}}$ . Given $\mu\in[0,1)$ , $\beta_{{\mathbf{U}}}$ , $\gamma_{{\mathbf{V}}}\in[r/m,1]$ , $\xi_{c}$ defined in (2), (3), (4), (5), and any $\lambda_{c}\in[\lambda_{c}^{\min},\lambda_{c}^{\max}]$ , for $\lambda_{c}^{\max}>\lambda_{c}^{\min}\geq 0$ defined in (11), solving D-RPCA(C) will recover a pair of components $({\mathbf{L}}_{0},{\mathbf{S}}_{0})\in\{{\mathbf{M}},{\mathcal{U}},{\mathcal{I}}_{{\mathcal{S}}_{c}}\}$ , if the space ${\mathcal{R}}$ is structured such that the dictionary ${\mathbf{D}}\in\mathbb{R}^{n\times d}$ obeys the generalized frame property D.2 with frame bounds $[\alpha_{\ell},\alpha_{u}]$ , for $\alpha_{\ell}>0$ .

Theorem 3 states the conditions under which the solution to the optimization problem D-RPCA(C) will be in the oracle model defined in D.1. The condition on the column sparsity $s_{c}\leq s_{c}^{\max}$ is a result of the constraint that $\lambda_{c}^{\min}\geq 0$ . Similar to (10), requiring $\lambda^{\max}_{c}>\lambda_{c}^{\min}$ leads to the following sufficient condition on the rank $r$ in terms of the sparsity $s_{c}$ for $\mu>0$ ,

[TABLE]

For $\mu=0$ the conditions are similar to the entry-wise case, namely, that $s_{c}\leq s_{c}^{\max}$ . Moreover, suppose that $\alpha_{l}$ and $\alpha_{u}$ are both close to $1$ , which can be easily met by a tight frame when $d<n$ , or a RIP type condition when $d>n$ . Then, if $\tfrac{(1-\mu)^{2}}{\beta_{{\mathbf{U}}}}$ is a constant, since $\gamma_{{\mathbf{V}}}=\Theta(\tfrac{r}{m})$ , we have that $s_{c}^{\max}={\mathcal{O}}(\tfrac{m}{r})$ . This is of the same order with the upper bound of $s_{c}$ in the Outlier Pursuit (OP) [13]. Our numerical results in Section V further show that D-RPCA(C) can be much more robust than OP, and may recover $\{{\mathcal{U}},{\mathcal{I}}_{{\mathcal{S}}_{c}}\}$ even when the rank of ${\mathbf{L}}$ is high and outliers $s_{c}$ are a constant proportion of $m$ .

Remark: In essence, Theorems 1 and 3 guarantee recovery of the components as long as the incoherence parameters, $\mu$ , $\gamma_{{\mathbf{V}}}$ , and $\gamma_{{\mathbf{U}}}$ are small. As stated in Section II-C, these parameters measure if the low-rank component and the dictionary sparse component can be teased apart from the given data. Specifically, here $\mu$ measures how close the low-rank component is to the dictionary sparse component. Both $\gamma_{{\mathbf{U}}}$ and $\beta_{{\mathbf{U}}}$ measure how close the column space of the low-rank part ${\mathcal{U}}$ is to the dictionary ${\mathbf{D}}$ , while $\gamma_{{\mathbf{V}}}$ measures if the row space of ${\mathbf{L}}$ is sparse. These measures ensure that the components can be identified successfully. Furthermore, we see that the global sparsity in the column-wise case can be higher than the entry-wise case.

IV Proof of Main Results

IV-A Proof of Theorem 1

We use dual certificate construction procedure to prove the main result in Theorem. 1; the proofs of all lemmata used here are given in Appendix A-A. To this end, we start by constructing a dual certificate for the convex problem shown in D-RPCA(E). Here, we first show the conditions the dual certificate needs to satisfy via the following lemma.

Lemma 4.

If there exists a dual certificate ${\mathbf{\Gamma}}\in\mathbb{R}^{n\times m}$ satisfying

[TABLE]

then the pair $({\mathbf{L}}_{0},~{}{\mathbf{S}}_{0})$ is the unique solution of D-RPCA(E).

We will now proceed with the construction of the dual certificate which satisfies the conditions outlined by (C1)-(C4) by Lemma 4. Using the analysis similar to [23] (Section V. B.), we construct the dual certificate as

[TABLE]

for arbitrary ${\mathbf{X}}\in\mathbb{R}^{n\times m}$ . The condition (C1) is readily satisfied by our choice of ${\mathbf{\Gamma}}$ . For (C2), we substitute the expression for ${\mathbf{\Gamma}}$ to arrive at

[TABLE]

Letting ${\mathbf{Z}}:={\mathbf{D^{\top}}}({\mathbf{I-P_{U}}}){\mathbf{X}}{\mathbf{(I-P_{V})}}$ and

[TABLE]

we can write (13) as ${\mathcal{P}}_{{\mathcal{S}}_{e}}({\mathbf{Z}})={\mathbf{B_{{\mathcal{S}}_{e}}}}$ . Further, we can vectorize the equation above as ${\mathcal{P}}_{{\mathcal{S}}_{e}}(\text{vec}({\mathbf{Z}}))=\text{vec}({\mathbf{B_{{\mathcal{S}}_{e}}}})$ . Let ${\mathbf{b}}_{{\mathcal{S}}_{e}}$ be a length $s_{e}$ vector containing elements of ${\mathbf{B}}_{{\mathcal{S}}_{e}}$ corresponding to the support of ${\mathbf{S}}_{0}$ . Now, note that $\text{vec}({\mathbf{Z}})$ can be represented in terms of a Kronecker product as follows,

[TABLE]

On defining ${\mathbf{A}}:={\mathbf{(I-P_{V})}}\otimes{\mathbf{D^{\top}}}({\mathbf{I-P_{U}}})\in\mathbb{R}^{md\times mn}$ , we have $\text{vec}({\mathbf{Z}})={\mathbf{A}}\text{vec}({\mathbf{X}})$ .

Further, let ${\mathbf{A_{{\mathcal{S}}_{e}}}}\in\mathbb{R}^{s\times nm}$ denote the rows of ${\mathbf{A}}$ that correspond to support of ${\mathbf{S}}_{0}$ , and let ${\mathbf{A_{{\mathcal{S}}_{e}^{\perp}}}}$ correspond to the remaining rows of ${\mathbf{A}}$ . Using these definitions and results, we have ${\mathbf{A}}_{{\mathcal{S}}_{e}}\text{vec}({\mathbf{X}})={\mathbf{b}}_{{\mathcal{S}}_{e}}$ . Thus, for conditions (C1) and (C2) to be satisfied, we need

[TABLE]

Here, the following result ensures the existence of the inverse.

Lemma 5.

If $\mu<1$ and $\alpha_{\ell}>0$ , $\sigma_{\min}{({\mathbf{A}}_{{\mathcal{S}}_{e}})}$ satisfies the bound $\sigma_{\min}{({\mathbf{A}}_{{\mathcal{S}}_{e}})}\geq\sqrt{\alpha_{\ell}}(1-\mu)$ .

Now, we look at the condition (C3) $\|{\mathcal{P}}_{{\mathcal{L}}^{\perp}}({\mathbf{\Gamma}})\|<1$ . This is where our analysis departs from [23]; we write

[TABLE]

where we have used the fact that $\|({\mathbf{I-P_{U}}})\|\leq 1$ and $\|{\mathbf{(I-P_{V})}}\|\leq 1$ . Now, as ${\mathbf{A}}_{{\mathcal{S}}_{e}}^{\top}({\mathbf{A}}_{{\mathcal{S}}_{e}}{\mathbf{A}}_{{\mathcal{S}}_{e}}^{\top})^{-1}$ is the pseudo-inverse of ${\mathbf{A}}_{{\mathcal{S}}_{e}}$ , i.e., ${\mathbf{A}}_{{\mathcal{S}}_{e}}{\mathbf{A}}_{{\mathcal{S}}_{e}}^{\top}({\mathbf{A}}_{{\mathcal{S}}_{e}}{\mathbf{A}}_{{\mathcal{S}}_{e}}^{\top})^{-1}={\mathbf{I}}$ , we have that $\|{\mathbf{A}}_{{\mathcal{S}}_{e}}^{\top}({\mathbf{A}}_{{\mathcal{S}}_{e}}{\mathbf{A}}_{{\mathcal{S}}_{e}}^{\top})^{-1}\|=1/{\sigma_{\min}{({\mathbf{A}}_{{\mathcal{S}}_{e}})}}$ , where $\sigma_{\min}{({\mathbf{A}}_{{\mathcal{S}}_{e}})}$ is the smallest singular value of ${\mathbf{A}}_{{\mathcal{S}}_{e}}$ . Therefore, we have

[TABLE]

The following lemma establishes an upper bound on $\|{\mathbf{b}}_{{\mathcal{S}}_{e}}\|_{2}$ .

Lemma 6.

An upper-bound on $\|{\mathbf{b}}_{{\mathcal{S}}_{e}}\|_{2}$ is given by $\|{\mathbf{b}}_{{\mathcal{S}}_{e}}\|_{2}\leq\lambda_{e}\sqrt{s_{e}}+\sqrt{r\alpha_{u}}\mu$ .

Combining (15), Lemma 5, and Lemma 6, we have

[TABLE]

Now, combining (16) and the upper bound on $\lambda_{e}$ defined in (6), we have that (C3) holds. Now, we move on to finding conditions under which (C4) is satisfied by our dual certificate. For this we will bound $\|{\mathcal{P}}_{{{\mathcal{S}}_{e}}^{\perp}}({\mathbf{D^{\top}\Gamma}})\|_{\infty}$ . Our analysis follows the similar procedure as employed in deriving (16) in [23], reproduced here for completeness. First, by the definition of ${\mathbf{\Gamma}}$ and properties of the $\|.\|_{\infty}$ norm, we have

[TABLE]

We now focus on simplifying the term $\|{\mathcal{P}}_{{{\mathcal{S}}_{e}}^{\perp}}({\mathbf{Z}})\|_{\infty}$ . By definition of ${\mathbf{A}}$ , and using the fact that $\text{vec}({\mathbf{Z}})={\mathbf{A}}\text{vec}({\mathbf{X}})$ , we have ${\mathcal{P}}_{{{\mathcal{S}}_{e}}^{\perp}}({\mathbf{Z}})={\mathbf{A}}_{{{\mathcal{S}}_{e}}^{\perp}}\text{vec}({\mathbf{X}})$ , which implies

[TABLE]

where we have used the result on $\text{vec}({\mathbf{X}})$ shown in (14).

Further, we can write $\|{\mathbf{b}}_{{\mathcal{S}}_{e}}\|_{\infty}$ as

[TABLE]

Moving on, we derive an upper bound on $\|{\mathbf{b}}_{{\mathcal{S}}_{e}}\|_{\infty}$ .

Lemma 7.

An upper-bound on $\|{\mathbf{b}}_{{\mathcal{S}}_{e}}\|_{\infty}$ is given by $\|{\mathbf{b}}_{{\mathcal{S}}_{e}}\|_{\infty}\leq\lambda_{e}+\|{\mathcal{P}}_{{\mathcal{S}}_{e}}({\mathbf{D}}^{\top}{\mathbf{UV}}^{\top})\|_{\infty}$ .

Then, on defining ${\mathbf{Q}}:={\mathbf{A}}_{{{\mathcal{S}}_{e}}^{\perp}}{\mathbf{A}}_{{\mathcal{S}}_{e}}^{\top}({\mathbf{A}}_{{\mathcal{S}}_{e}}{\mathbf{A}}_{{\mathcal{S}}_{e}}^{\top})^{-1},$ we have

[TABLE]

where we have the following bound for $\|{\mathbf{Q}}\|_{\infty,\infty}$ .

Lemma 8.

An upper-bound on $\|{\mathbf{Q}}\|_{\infty,\infty}$ is given by $\|{\mathbf{Q}}\|_{\infty,\infty}\leq C_{e}(\alpha_{u},\alpha_{\ell},\gamma_{{\mathbf{U}}},\gamma_{{\mathbf{V}}},s_{e},d,k,\mu)$ , where

[TABLE]

where $0\leq C_{e}<1$ and $c$ is defined in (7).

Combining this with (17) and Lemma 8, we have

[TABLE]

By simplifying (18), we arrive at the lower bound $\lambda_{e}^{\min}$ for $\lambda_{e}$ as in (6), from which (C4) holds. Gleaning from the expressions for $\lambda^{\max}_{e}$ and $\lambda^{\min}_{e}$ , we observe that $\lambda^{\max}_{e}>\lambda^{\min}_{e}\geq 0$ for the existence of $\lambda_{e}$ that can recover the desired matrices. This completes the proof. ∎

Characterizing $\lambda_{e}^{\min}$ : In the previous section, we characterized the $\lambda_{e}^{\min}$ and $\lambda_{e}^{\max}$ based on the dual certificate construction procedure. For the recovery of the true pair $({\mathbf{L}},{\mathbf{S}})$ , we require $\lambda_{e}^{\max}>\lambda_{e}^{\min}\geq 0$ . Since $\xi_{e}\geq 0$ and $c\geq 0$ by definition, we need $0\leq C_{e}<1$ for $\lambda_{e}^{\min}>0$ , i.e.,

[TABLE]

Conditions for thin ${\mathbf{D}}$ : To simplify the analysis we assume, without loss of generality, that $d<m$ . Specifically, we will assume that $d\leq\tfrac{m}{\alpha r}$ , where $\alpha>1$ is a constant. With this assumption in mind, we will analyze the following cases for the global sparsity, when $s_{e}\leq d$ and $d<s_{e}\leq m$ .

Case 1: $s_{e}\leq d$ .

From (7) and (19), we have $\alpha_{\ell}(1-\mu)^{2}-2c_{t}>0$ ,

which leads to

[TABLE]

As per the GFP of D.2, we also require that ${\alpha_{u}}/{\alpha_{\ell}}\geq 1$ . Therefore we arrive at

[TABLE]

Further, since $\gamma_{{\mathbf{U}}}\geq 0$ , we require the numerator to be positive, and since the lower bound on $\gamma_{{\mathbf{V}}}\geq\tfrac{r}{m}$ , we have

[TABLE]

which also implies $s_{e}\leq m$ . Now, the condition $c_{t}\geq 0$ implies

[TABLE]

Since, the R.H.S. of this inequality is upper bounded by $1$ (achieved when $\gamma_{{\mathbf{U}}}$ and $\gamma_{{\mathbf{V}}}$ are zero). This condition on $c_{t}$ is satisfied by our assumption that ${\alpha_{u}}/{\alpha_{\ell}}\geq 1$ .

Case 2: $d<s_{e}\leq m$ .

Again, due to the requirement that ${\alpha_{u}}/{\alpha_{\ell}}\geq 1$ , following a similar argument as in the previous case we conclude that

[TABLE]

Conditions for fat ${\mathbf{D}}$ : To simplify the analysis, we suppose that $k<m$ . Note that in this case, we require that the coefficient matrix ${\mathbf{S}}$ has $k$ -sparse columns. Now, $c=c_{f}$ . Using similar arguments as above

[TABLE]

Characterizing $\lambda_{e}^{\max}$ : Further, the condition $\lambda_{e}^{\min}<\lambda_{e}^{\max}$ translates to a relationship between rank $r$ , and the sparsity $s_{e}$ , as shown in (10) for $s_{e}\leq s_{e}^{\max}$ .

IV-B Proof of Theorem 3

In this section we prove Theorem 3; the proofs of lemmata are provided in Appendix A-B. The Lagrangian of the nonsmooth optimization problem D-RPCA(C) is

[TABLE]

where ${\mathbf{\Lambda}}\in\mathbb{R}^{n\times m}$ is a dual variable. The subdifferentials of (20) with respect to $({\mathbf{L}},{\mathbf{S}})$ are

[TABLE]

We claim that a pair $({\mathbf{L}},{\mathbf{S}})$ is an optimal point of D-RPCA(C) if and only if the following hold by the optimality conditions:

[TABLE]

The following lemma states the optimality conditions for the optimal solution pair $({\mathbf{L}},{\mathbf{S}})$ .

Lemma 9.

*Given ${\mathbf{M}}$ and ${\mathbf{D}}$ , let $({\mathbf{L}},{\mathbf{S}})$ define the oracle model $\{{\mathbf{M}},{\mathcal{U}},{\mathcal{I}}_{{\mathcal{S}}_{c}}\}$ . Then any solution $({\mathbf{L}}_{0},{\mathbf{S}}_{0})\in\{{\mathbf{M}},{\mathcal{U}},{\mathcal{I}}_{{\mathcal{S}}_{c}}\}$ is the an optimal solution pair of D-RPCA(C), if there exists a dual certificate ${\mathbf{\Gamma}}\in\mathbb{R}^{n\times m}$ that satisfies

$({\mathbf{C1}})$ ${\mathcal{P}}_{{\mathcal{L}}}({\mathbf{\Gamma}})={\mathbf{U}}{\mathbf{V}}^{\top}$ , $({\mathbf{C2}})$ ${\mathcal{P}}_{{\mathcal{S}}_{c}}({\mathbf{D}}^{\top}{\mathbf{\Gamma}})=\lambda_{c}{\mathbf{H}}$ , where ${\mathbf{H}}_{:,j}={\mathbf{S}}_{:,j}/\|{\mathbf{S}}_{:,j}\|_{2}$ for all $j\in{\mathcal{I}}_{{\mathcal{S}}_{c}}$ ; ${\mathbf{0}}$ , otherwise,

$({\mathbf{C3}})$ $\|{\mathcal{P}}_{{\mathcal{L}}^{\perp}}({\mathbf{\Gamma}})\|_{2}<1$ , and $({\mathbf{C4}})$ $\|{\mathcal{P}}_{{{\mathcal{S}}_{c}}^{\perp}}({\mathbf{D}}^{\top}{\mathbf{\Gamma}})\|_{\infty,2}<\lambda_{c}$ .*

We first propose ${\mathbf{\Gamma}}$ as the dual certificate, where

[TABLE]

Hence, the condition (C1) is readily satisfied by our choice of ${\mathbf{\Gamma}}$ . Now, the condition (C2), defined as ${\mathcal{P}}_{{\mathcal{S}}_{c}}({\mathbf{D}}^{\top}{\mathbf{\Gamma}})=\lambda_{c}\tilde{{\mathbf{S}}}$ , where $\tilde{{\mathbf{S}}}_{:,j}=\tfrac{{\mathbf{S}}_{:,j}}{\|{\mathbf{S}}_{:,j}\|_{2}}$ for all $j\in{\mathcal{I}}_{{\mathcal{S}}_{c}}$ ; ${\mathbf{0}}$ , otherwise. Substituting the expression for ${\mathbf{\Gamma}}$ , we need the following condition to hold

[TABLE]

Letting ${\mathbf{Z}}:={\mathbf{D}}^{\top}\left({\mathbf{I}}-{\mathbf{P}}_{{\mathbf{U}}}\right){{\mathbf{X}}}\left({\mathbf{I}}-{\mathbf{P}}_{{\mathbf{V}}}\right)$ and ${\mathbf{B}}_{{\mathcal{S}}_{c}}:=\lambda_{c}\tilde{{\mathbf{S}}}-{\mathcal{P}}_{{\mathcal{S}}_{c}}({\mathbf{D}}^{\top}{\mathbf{U}}{\mathbf{V}}^{\top})$ , we have ${\mathcal{P}}_{{\mathcal{S}}_{c}}\left({\mathbf{Z}}\right)={\mathbf{B}}_{{\mathcal{S}}_{c}}$ . Further, vectorizing the equation above, we have

[TABLE]

where ${\mathbf{b}}_{{\mathcal{S}}_{c}}:=\text{vec}({\mathbf{B}}_{{\mathcal{S}}_{c}})$ . Next, by letting ${\mathbf{A}}:={\mathbf{(I-P_{V})}}\otimes{\mathbf{D^{\top}}}({\mathbf{I-P_{U}}})$ , using the definition of ${\mathbf{Z}}$ and the properties of the Kronecker product we have $\text{vec}({\mathbf{Z}})={\mathbf{A}}\text{vec}({\mathbf{X}})$ . Now, let ${\mathbf{A}}_{{\mathcal{S}}_{c}}$ denote the rows of ${\mathbf{A}}$ corresponding to the non-zero rows of $\text{vec}({\mathbf{S}})$ and ${\mathbf{A}}_{{\mathcal{S}}_{c}^{\perp}}$ denote the remaining rows, then

[TABLE]

From (25) and (26), we have ${\mathbf{A}}_{{\mathcal{S}}_{c}}\text{vec}({\mathbf{X}})={\mathbf{b}}_{{\mathcal{S}}_{c}}$ . Therefore, we need the following

[TABLE]

which corresponds to the least norm solution i.e., ${\mathbf{X}}=\text{argmin}_{{\mathbf{X}}}~{}\|{{\mathbf{X}}}\|_{\rm F}$ , s.t. ${\mathbf{A}}_{{\mathcal{S}}_{c}}\text{vec}({{\mathbf{X}}})={\mathbf{b}}_{{\mathcal{S}}_{c}}$ . For this choice of ${\mathbf{X}}$ (24) is satisfied and consequently so is the condition (C2). Here, the existence of the inverse is ensured by the following.

Lemma 10.

If $\mu<1$ and $\alpha_{\ell}>0$ , the minimum singular value of ${\mathbf{A}}_{{\mathcal{S}}_{c}}$ is bounded away from [math] and is given by $\sqrt{\alpha_{\ell}}(1-\mu)$

Upon the existence of such ${\mathbf{X}}$ as defined in (27), (C3) is satisfied if the following condition holds

[TABLE]

From (27), this condition translates to

[TABLE]

Now, since $\|{\mathbf{A}}_{{\mathcal{S}}_{c}}^{\top}({\mathbf{A}}_{{\mathcal{S}}_{c}}{\mathbf{A}}_{{\mathcal{S}}_{c}}^{\top})^{-1}\|=1/\sigma_{\min}({\mathbf{A}}_{{\mathcal{S}}_{c}})$ (see the analogous analysis for the entry-wise case), we need

[TABLE]

Now, using Lemma 10 and the following bound on $\|{\mathbf{b}}_{{\mathcal{S}}_{c}}\|_{2}$ ,

Lemma 11.

An upper-bound on $\|{\mathbf{b}}_{{\mathcal{S}}_{c}}\|_{2}$ is given by $\lambda_{c}\sqrt{s_{c}}+\sqrt{r\alpha_{u}}\mu$ .

we have that the condition (C3) holds if

[TABLE]

which is satisfied by our choice of $\lambda_{c}^{\max}$ (11). Now, for the condition (C4) we need the following condition to hold true:

[TABLE]

Note that, here $\|{\mathcal{P}}_{{\mathcal{S}}_{c}}({\mathbf{D}}^{T}{\mathbf{U}}{\mathbf{V}}^{T})\|_{\infty,2}\leq\xi_{c}$ . Further, the following result establishes an upper-bound on $\|{\mathcal{P}}_{{\mathcal{S}}_{c}^{\perp}}({\mathbf{Z}})\|_{\infty,2}$ .

Lemma 12.

An upper bound on $\|{\mathcal{P}}_{{\mathcal{S}}_{c}^{\perp}}({\mathbf{Z}})\|_{\infty,2}$ is given by $(\lambda_{c}s_{c}+\sqrt{r\alpha_{u}s_{c}}\mu)C_{c}.$

In light of this, the condition (C4) implies that,

[TABLE]

To this end, if we let $C_{c}:=\tfrac{\alpha_{u}}{\alpha_{\ell}(1-\mu)^{2}}\gamma_{{\mathbf{V}}}\beta_{{\mathbf{U}}}$ , (C4) is satisfied by $\lambda_{c}^{\min}$ defined in (11). This completes the proof. ∎

Characterizing $\lambda_{c}^{\min}$ : From (11), we need $\lambda_{c}^{\min}:=\tfrac{\xi_{c}+\sqrt{rs_{c}\alpha_{u}}\mu C_{c}}{1-s_{c}C_{c}}\geq 0$ , where $C_{c}:=\tfrac{\alpha_{u}}{\alpha_{\ell}(1-\mu)^{2}}\gamma_{{\mathbf{V}}}\beta_{{\mathbf{U}}}\geq 0$ . Then from $s_{c}C_{c}<1$ , we require $s_{c}<s_{c}^{\max}:=\tfrac{\alpha_{\ell}(1-\mu)^{2}}{\alpha_{u}\gamma_{{\mathbf{V}}}\beta_{{\mathbf{U}}}}$ .

Characterizing $\lambda_{c}^{\max}$ : Since we need $\lambda_{c}^{\min}<\lambda_{c}^{\max}$ , substituting the expressions for $\lambda_{c}^{\min}$ and $\lambda_{c}^{\max}$ , and using the fact that $s_{c}C_{c}<1$ , we arrive at (12).

V Numerical Simulations on Synthetic Data

In this section, we empirically evaluate the properties of D-RPCA(E) and D-RPCA(C) via phase transition in rank and sparsity, and compare its performance to related techniques, and to the behavior predicted by Theorem 1 and Theorem 3 in (10) and (12), respectively.

V-A Entry-Wise Sparsity Case

Experimental Set-up: We employ the accelerated proximal gradient (APG) algorithm outlined in Algorithm 1 to solve the optimization problem D-RPCA(E). For these evaluations, we fix $n=m=100$ , and generate the low-rank part ${\mathbf{L}}$ by outer product of two column normalized random matrices of sizes $n\times r$ and $m\times r$ , with entries drawn from the standard normal distribution. In addition, we choose $s_{e}$ non-zero locations of the sparse component ${\mathbf{S}}$ randomly, and draw the values at these non-zero entries from the Rademacher distribution, and the dictionary ${\mathbf{D}}$ from the standard normal distribution with normalized columns. We then run $10$ Monte-Carlo trials for each pair of rank and sparsity, and for each of these, we scan across $100$ values of $\lambda_{e}$ s in the range of $[\lambda_{e}^{\min},\lambda_{e}^{\max}]$ to find the best pair of $({\mathbf{L}}_{0},{\mathbf{S}}_{0})$ to compile the results. For ease of computation we run on modest values of $n$ and $m$ . Here, the white and dark region correspond to correct recovery and failure, respectively.

Discussion: Phase transition in rank and sparsity averaged over $10$ trials for dictionaries of sizes $d=5$ (thin) and $d=150$ (fat), are shown in Fig. 2 and Fig. 3, respectively. We note from Fig. 2 that indeed the empirical relationship between rank and sparsity for the recovery of $({\mathbf{L_{0},S_{0}}})$ has the same trend as predicted by (10) in Section III for $s_{e}\leq s_{e}^{\max}$ . Here, the parameters corresponding to the predicted trend (shown in red) have been hand-tuned for best fit. In fact, as shown in Fig. 3, this trend continues for sparsity levels much greater than $s_{e}^{\max}$ . This can be potentially attributed to the worst case deterministic analysis considered here.

Further, Fig. 4 shows the results of RPCA† (in green, shows the area where at least one of the $10$ Monte-Carlo simulations succeeds) in comparision to the results obtained by D-RPCA(E) for $d=5$ and $d=50$ . We observe that D-RPCA(E) outperforms RPCA† across the board. In fact, we notice that the RPCA† technique only succeeds when $r<d$ . We believe that this is because when $d<r$ the component ${\mathbf{D}}^{\dagger}{\mathbf{L}}$ is not low-rank (full-rank in this case) w.r.t. the maximum potential rank of ${\mathbf{D}}^{\dagger}{\mathbf{L}}$ . As a result, the model assumptions of the robust PCA problem do not apply; see Section I-B. In contrast, the proposed framework of D-RPCA(E) can handle these cases effectively (see Fig. 4) since ${\mathbf{L}}$ is low-rank irrespective of the dictionary size. This highlights the applicability of the our approach to cases where $d<r$ , and simultaneous recovery of the low-rank component in one-shot.

V-B Column-wise Sparsity Case

We now present phase transition in rank $r$ and number of outliers $s_{c}$ to evaluate the performance of D-RPCA(C). In particular, we compare with Outlier Pursuit (OP) [13] that solves D-RPCA(C) with ${\mathbf{D}}={\mathbf{I}}$ , and OP† to demonstrate that the a priori knowledge of the dictionary provides superior recovery properties.

Experimental Set-up:

Again, we employ a variant of the APG algorithm outlined in Algorithm 1 to solve the optimization problem D-RPCA(C). We set $n=100$ , $m=1000$ , and for each pair of $r$ and $s_{c}$ we run $10$ Monte-Carlo trials for $r\in\{5,10,15\dots,100\}$ and $s_{c}\in\{50,100,150,\dots,900\}$ . For our experiments, we form ${\mathbf{L}}=[{\mathbf{U}}{\mathbf{V}}^{\top}~{}|~{}{\mathbf{0}}_{n\times s_{c}}]\in\mathbb{R}^{n\times m}$ , where ${\mathbf{U}}\in\mathbb{R}^{n\times r}$ , ${\mathbf{V}}\in\mathbb{R}^{(m-s_{c})\times r}$ have i.i.d. ${\mathcal{N}}(0,1)$ entries, which are then normalized column-wise. Next, we generate ${\mathbf{S}}=[{\mathbf{0}}_{d\times(m-s_{c})}~{}|~{}{\mathbf{W}}]\in\mathbb{R}^{d\times m}$ where each entry of ${\mathbf{W}}\in\mathbb{R}^{d\times s_{c}}$ is i.i.d. ${\mathcal{N}}(0,1)$ . Also, the known dictionary ${\mathbf{D}}\in\mathbb{R}^{n\times d}$ is formed by normalizing the columns of a random matrix with i.i.d. ${\mathcal{N}}(0,1)$ entries. For each method, we scan through $100$ values of the regularization parameter $\lambda_{c}\in[\lambda_{c}^{\min},\lambda_{c}^{\max}]$ to find a solution pair $({\mathbf{L}}_{0},{\mathbf{S}}_{0})$ with the best precision, i.e. $\rm(True~{}Positives/(True~{}Positives+False~{}Positives))$ . We declare an experiment successful if it acheives a precision of $0.99$ or higher. Here, we threshold the column norms at $2\times 10^{-3}$ before we evaluate the precision.

Discussion: Fig. 5 (a)-(c) shows the phase transition in rank $r$ and column-sparsity $s_{c}$ for the outlier identification performance (in terms of precison) of OP for $d=50$ , D-RPCA(C) for $d=50$ (and OP† in green, marking the region where precision is greater than [math]), and D-RPCA(C) for $d=150$ , respectively. We observe that the a priori knowledge of the dictionary ${\mathbf{D}}$ significantly boosts the performance of D-RPCA(C) as compared to OP. This showcases the superior outlier identification properties of the proposed technique D-RPCA(C). Further, similar to the entry-wise case, we note that the pseudo-inverse based technique OP† (in green) fails when $r>d$ . For the $d=150$ case the proposed technique D-RPCA(C) is able to identify the outlier columns with high precision. Meaning that our technique succeeds even when the outlier columns are not sparse.

VI Evaluation of Real-World Dataset: Target Localization in Hyperspectral Imaging

A HS sensor records the response of a region to different frequencies of the electromagnetic spectrum. As a result, each HS image $\mathbf{I}\in\mathbb{R}^{h\times w\times n}$ , can be viewed as a data-cube formed by stacking $n$ matrices of size $h\times w$ , as shown in Fig. 6. Here, $n$ is determined by the number of channels or frequency bands across which measurements of the reflectances are made. Therefore, each volumetric element or voxel, of a HS image is a vector of length $n$ corresponding to response of the material to $n$ measurement channels.

HS images (when represented as a matrix) are approximately low-rank since a particular scene is composed of only a limited type of objects/materials [54]. For instance, while imaging an agricultural area, one would expect to record responses from materials like biomass, farm vehicles, roads, houses, water bodies, and so on. Moreover, the spectra of complex materials can be assumed to be a linear mixture of the constituent materials [54, 55], i.e. the received HS responses can be viewed as being generated by a linear mixture model [43]. For the target localization task at hand, this approximate low-rank structure is used to decompose a given HS image into a low-rank part, and a component that is sparse in a known dictionary – a dictionary sparse part – wherein the dictionary is composed of the spectral signatures of the target of interest. We consider the thin dictionary setting for the rest of this discussion, since often we aim to localize targets based on a few a priori known spectral signatures, although a similar analysis applies for the fat case; see Section III and [24].

Formally, let $\mathbf{M}\in\mathbb{R}^{n\times m}$ , where $m=hw$ be formed by unfolding the HS image $\mathbf{I}$ , such that, each column of $\mathbf{M}$ corresponds to a voxel of the data-cube. We then model $\mathbf{M}$ as a superposition of a low-rank component $\mathbf{L}\in\mathbb{R}^{n\times m}$ with rank $r$ , and a dictionary-sparse component, $\mathbf{DS}$ , i.e.,

[TABLE]

Here, $\mathbf{D}\in\mathbb{R}^{n\times d}$ represents an a priori known dictionary composed of appropriately normalized characteristic responses of the material/object (or the constituents of the material), we wish to localize, and $\mathbf{S}\in\mathbb{R}^{d\times m}$ refers to the sparse coefficient matrix (also referred to as abundances in the literature). Note that $\mathbf{D}$ can also be constructed by learning a dictionary based on the known spectral signatures of a target; see [56, 57, 58, 59, 60].

We now discuss the implementation specifics corresponding to the target localization task. We begin by presenting the algorithm used to solve the optimization problems D-RPCA(E) and D-RPCA(C), before discussing the experimental details.

VI-A Algorithmic Considerations

The optimization problems of interest, D-RPCA(E) and D-RPCA(C), for the entry-wise and column-wise case, respectively, are convex but non-smooth. To solve for the components of interest, we adopt the accelerated proximal gradient (APG) algorithm, as shown in Algorithm 1. We here present a unified APG-based algorithm for D-RPCA(E) for the both sparsity and dictionary cases, which includes the case considered by [23].

VI-A1 Discussion of Algorithm 1

For the optimization problem of interest, we solve an unconstrained problem by transforming the equality constraint to a least-square term which penalizes the fit. In particular, we will accomplish the demixing task by solving the following via the APG-based Algorithm 1.

[TABLE]

for the entry-wise sparsity case, and

[TABLE]

for the column-wise sparsity case.

We note that although for the HS application at hand, the thin dictionary case with ( $n\geq d$ ) might be more useful in practice, Algorithm 1 allows for the use of fat dictionaries ( $n<d$ ) as well. Specifically, the APG algorithm requires that the gradient of the smooth part,

[TABLE]

of the convex objectives shown in (29) and (30) is Lipschitz continuous with minimum Lipschitz constant $L_{f}$ . Now, since the gradient $\nabla f({\mathbf{L}},{\mathbf{S}})$ with respect to $\begin{bmatrix}{\mathbf{L}}&{\mathbf{S}}\end{bmatrix}^{\top}$ is given by

[TABLE]

we have that the gradient $\nabla f$ is Lipschitz continuous as

[TABLE]

for all $({\mathbf{L}}_{1},{\mathbf{S}}_{1}),({\mathbf{L}}_{2},{\mathbf{S}}_{2})$ in the domain of $f$ , where

[TABLE]

The update of the low-rank component and the sparse matrix ${\mathbf{S}}$ for the entry-wise case both involve a soft thresholding step, $\mathcal{S}_{\tau}(.)$ , where for a matrix ${\mathbf{Y}}$ , $\mathcal{S}_{\tau}({\mathbf{Y}}_{ij})$ is defined as

[TABLE]

In case of the low-rank part we apply this function to the singular values (therefore referred to as singular value thresholding) [61], while for the update of the dictionary sparse component, we apply it to the sparse coefficient matrix ${\mathbf{S}}$ .

The low-rank update step remains the same as for the entry-wise case. For the update of the column-wise case, we threshold the columns of ${\mathbf{S}}$ based on their column norms, i.e., for a column ${\mathbf{Y}}_{j}$ of a matrix ${\mathbf{Y}}$ , the column-norm based soft-thresholding function, $\mathcal{C}_{\tau}(.)$ is defined as

[TABLE]

VI-A2 Parameter Selection

We adopt a grid-search strategy over the range of admissible values to find the best values of the regularization parameters.

Selecting parameters for the entry-wise case: The choice of parameters $\nu$ and $\lambda_{e}$ in Algorithm 1 is based on the optimality conditions of the optimization problem shown in (29). As presented in [23], the range of parameters $\nu$ and $\nu\lambda_{e}$ associated with the low-rank part ${\mathbf{L}}$ and the sparse coefficient matrix ${\mathbf{S}}$ , respectively, lie in $\nu\in\{0,\|{\mathbf{M}}\|\}$ and $\nu\lambda_{e}\in\{0,\|{\mathbf{D}}^{\top}{\mathbf{M}}\|_{\infty}\}$ , i.e., for Algorithm 1 $\nu_{0}=\|{\mathbf{M}}\|$ .

These ranges for are derived using the optimization problem shown in (29). Specifically, we find the largest values of these regularization parameters which yield a $({\mathbf{0}},{\mathbf{0}})$ solution for the pair $({\mathbf{L}}_{0},{\mathbf{S}}_{0})$ by analyzing the optimality conditions of (29). This value of the regularization parameter then defines the upper bound on the range. For instance, the optimality condition for $\lambda_{*}:=\nu$ and $\lambda_{1}:=\nu\lambda_{e}$ , is given by

[TABLE]

where the sub-differential set $\partial_{\mathbf{L}}\|{\mathbf{L}}\|_{*}$ is defined as

[TABLE]

Therefore, for a zero solution pair $({\mathbf{L}}_{0},{\mathbf{S}}_{0})$ we have that

[TABLE]

which yields the condition that $\|{\mathbf{M}}\|\leq\lambda_{*}$ . Therefore, the maximum value of $\lambda_{*}$ which drives the low-rank part to an all-zero solution is $\|{\mathbf{M}}\|$ . Similarly, the optimality condition for the dictionary sparse component to choose $\lambda_{1}$ is given by

[TABLE]

where the the sub-differential set $\partial_{\mathbf{S}}\|{\mathbf{S}}\|_{1}$ is defined as

[TABLE]

Again, for a zero solution pair $({\mathbf{L}}_{0},{\mathbf{S}}_{0})$ we need that

[TABLE]

which implies that $\|{\mathbf{D}}^{\top}{\mathbf{M}}\|_{\infty}\leq\lambda_{1}$ , i.e. the maximum value of $\lambda_{1}$ that drives the dictionary sparse part to zero is $\|{\mathbf{D}}^{\top}{\mathbf{M}}\|_{\infty}$ .

Selecting parameters for the column-wise case: Again, the choice of parameters $\nu$ and $\lambda_{c}$ is derived from the optimization problem shown in (30). In this case, the range of parameters $\nu$ and $\nu\lambda_{c}$ associated with the low-rank part ${\mathbf{L}}$ and the sparse coefficient matrix ${\mathbf{S}}$ , respectively, lie in $\nu\in\{0,\|{\mathbf{M}}\|\}$ and $\nu\lambda_{e}\in\{0,\|{\mathbf{D}}^{\top}{\mathbf{M}}\|_{\infty,2}\}$ , i.e., for Algorithm 1 $\nu_{0}=\|{\mathbf{M}}\|$ . The range of regularization parameters are evaluated using the analysis similar to the entry-wise case, we use the optimality conditions for (30), instead of (29).

VI-B Experimental Evaluation

We now evaluate the performance of the proposed technique on real-world HS data. We begin by introducing the dataset333Available via http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes. used for these simulations, following which we describe the experimental set-up and present the results.

Data

Indian Pines Dataset: We first consider the “Indian Pines” dataset [51], which was collected over the Indian Pines test site in North-western Indiana in the June of 1992 using the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) [63] sensor, a popular choice for collecting HS images for various remote sensing applications. This dataset consists of spectral reflectances across $224$ bands in wavelength of ranges $400-2500$ nm from a scene which is composed mostly of agricultural land along with two major dual lane highways, a rail line and some built structures, as shown in Fig. 7(a). The dataset is further processed by removing the bands corresponding to those of water absorption, which results in a HS data-cube with dimensions $\{145\times 145\times 200\}$ is as visualized in Fig. 6. Here, $n=200$ , $h=w=145$ , and therefore $m=hw=145\times 145$ . This modified dataset is available as “corrected Indian Pines” dataset [51], with the ground-truth containing $16$ classes; henceforth, referred to as the “Indian Pines Dataset". We form the data matrix ${\mathbf{M}}\in\mathbb{R}^{n\times m}$ by stacking each voxel of the image side-by-side, which results in a $\{200\times 145^{2}\}$ data matrix ${\mathbf{M}}$ . We will analyze the performance of the proposed technique for the identification of the stone-steel towers (class $16$ in the dataset), shown in Fig. 7(a), constituting $93$ voxels.

Pavia University Dataset: Acquired using Reflective Optics System Imaging Spectrometer (ROSIS) sensor, the Pavia University Dataset [62] consists of spectral reflectances across $103$ bands (in the range $430-860$ nm) of an urban landscape over northern Italy. The selected subset of the scene, a $\{201\times 131\times 103\}$ data-cube, mainly consists of buildings, roads, painted metal sheets and trees, as shown in Fig. 7(b). Note that class- $3$ corresponding to “Gravel” is not present in the selected data-cube considered here. For our demixing task, we will analyze the localization of target class $5$ , corresponding to the painted metal sheets, which constitutes $707$ voxels in the scene. Note that for this dataset $h=201$ , $w=131$ , $m=hw=201\times 131$ and $n=103$ .

Further, in Fig. 8 we show the decay of singular values of the Indian Pines and the Pavia University dataset. We note that indeed the presence of a limited number of materials makes the these datasets approximately low-rank.

Dictionary: We form the known dictionary ${\mathbf{D}}$ two ways: 1) where a (thin) dictionary is learned based on the voxels by solving (31), and 2) when the dictionary is formed by randomly sampling voxels from the target class. This is to emulate the ways in which we can arrive at the dictionary corresponding to a target – 1) where the exact signatures are not available, and/or there is noise, and 2) where we have access to the exact signatures of the target, respectively.

In our experiments for case 1), we learn a dictionary using the target class data ${\mathbf{Y}}\in\mathbb{R}^{n\times p}$ by alternating between updating the sparse coefficients via FISTA [64] and dictionary via the Newton method [65], approximately solving the following optimization problem [56, 57, 58, 59].

[TABLE]

For case 2), the columns of the dictionary are set as the known data voxels of the target class. Specifically, instead of learning a dictionary based on a target class of interest, we set it as the exact signatures observed previously. Note that for this case, the dictionary is not normalized at this stage since the specific normalization depends on the particular demixing problem of interest, discussed shortly. In practice, we can store the un-normalized dictionary ${\mathbf{D}}$ (formed from the voxels), consisting of actual signatures of the target material, and can normalize it after the HS image has been acquired.

Experimental Setup

Normalization: For normalizing the data, we divide each element of the data matrix ${\mathbf{M}}$ by $\|{\mathbf{M}}\|_{\infty}$ to preserve the inter-voxel scaling. For the dictionary, in the learned dictionary case, i.e., case 1), the dictionary already has unit-norm columns. Further, when the dictionary is formed from the data directly, i.e., for case 2), we divide each element of ${\mathbf{D}}$ by $\|{\mathbf{M}}\|_{\infty}$ , and then normalize the columns of ${\mathbf{D}}$ , such that they are unit-norm.

Dictionary selection for the Indian Pines Dataset: For the learned dictionary case, we evaluate the performance of the aforementioned techniques for both entry-wise and column-wise settings for two dictionary sizes, $d=4$ and $d=10$ , for three values of the regularization parameter $\rho$ , used for the initial dictionary learning step, i.e., $\rho=0.01,~{}0.1$ and $0.5$ . Here, the parameter $\rho$ controls the sparsity during the initial dictionary learning step (31). For the case when dictionary is selected from the voxels directly, we randomly select $15$ voxels from the target class- $16$ to form our dictionary.

Dictionary selection for the Pavia University Dataset: Here, for the learned dictionary case, we evaluate the performance of the aforementioned techniques for both entry-wise and column-wise settings for a dictionary of size $d=30$ for three values of the regularization parameter $\rho$ , used for the initial dictionary learning step, i.e., $\rho=0.01,~{}0.1$ and $0.5$ . Further, we randomly select $60$ voxels from the target class- $5$ , when the dictionary is formed from the data voxels.

Comparison with matched filtering (MF)-based approaches: In addition to the robust PCA-based and OP-based techniques introduced in Section I-B, we also compare the performance of our techniques with two MF-based approaches. These MF-based techniques are agnostic to our model assumptions, i.e., entry-wise or column-wise sparsity cases. Therefore, the following description applies to both sparsity cases.

For the first MF-based technique, referred to as MF, we form the inner-product of the column-normalized data matrix ${\mathbf{M}}$ , denoted as ${\mathbf{M}}_{n}$ , with the dictionary ${\mathbf{D}}$ , i.e., ${\mathbf{D^{\top}M}}_{n}$ , and select the maximum absolute inner-product per column. For the second MF-based technique, MF†, we perform matched filtering on the pseudo-inversed data ${\mathbf{\widetilde{M}=D^{\dagger}M}}$ . Here, the matched filtering corresponds to finding maximum absolute entry for each column of the column-normalized ${\mathbf{\widetilde{{\mathbf{M}}}}}$ . Next, in both cases we scan through $1000$ threshold values between $(0,1]$ to generate the results.

Performance Metrics: We evaluate the performance of these techniques via the receiver operating characteristic (ROC) plots. ROC plots are a staple for classification performance analysis of a binary classifier in machine learning; see also [66]. Specifically, it is a plot between the true positive rate (TPR) and the false positive rate (FPR), where a higher TPR (close to $1$ ) and a lower FPR (close to [math]) indicates that the classifiier detects all the elements in the class while rejecting those outside the class.

A natural metric to gauge good performance is the area under the curve (AUC) metric. It indicates the area under the ROC curve, which is maximized when TPR $=1$ and FPR $=0$ , therefore, a higher AUC is preferred. Here, an AUC of $0.5$ indicates that the performance of the classifier is roughly as good as a coin flip on average. As a result, if a classifier has an AUC $<0.5$ , one can improve the performance by simply inverting the result of the classifier. This effectively means that AUC is evaluated after “flipping” the ROC curve. In other words, this means that the classifier is good at rejecting the class of interest, and taking the complement of the classifier decision can be used to identify the class of interest.

In our experiments, MF-based techniques often exhibit this phenomenon. Specifically, when the dictionary contains element(s) which resemble the average behavior of the spectral signatures, the inner-product between the normalized data columns and these dictionary elements may be higher as compared to other distinguishing dictionary elements. Since MF-based techniques rely on the maximum inner-product between the normalized data columns and the dictionary, and further since the spectral signatures of even distinct classes are highly correlated; see, for instance Fig. 1, where MF-based approaches in these cases can effectively reject the class of interest. This leads to an AUC $<0.5$ . Therefore, as discussed above, we invert the result of the classifier (indicated as $(\cdot)_{*}$ in the tables) to report the best performance. If using MF-based techniques, this issue can potentially be resolved in practice by removing the dictionary elements which tend to resemble the average behavior of the spectral signatures.

Parameter Setup for the Algorithms

Entry-wise sparsity case: We evaluate and compare the performance of the proposed method D-RPCA(E) with RPCA† (described in Section I-B), MF, and MF†. Specifically, we evaluate the performance of these techniques via the receiver operating characteristic (ROC) plot for the Indian Pines dataset and the Pavia University dataset, with the results shown in Table I(a)-(d) and Table III(a)-(c), respectively.

For the proposed technique, we employ the accelerated proximal gradient (APG) algorithm shown in Algorithm 1 and discussed in Section VI-A to solve the optimization problem shown in D-RPCA(E). Similarly, for RPCA† we employ the APG algorithm with transformed data matrix $\widetilde{{\mathbf{M}}}$ , while setting ${\mathbf{D=I}}$ .

With reference to selection of tuning parameters for the APG solver for (D-RPCA(E)) (RPCA†, respectively), we choose $v=0.95$ , $\nu=\|\mathbf{M}\|$ ( $\nu=\|{\mathbf{\widetilde{M}}}\|$ ), $\bar{\nu}=10^{-4}$ , and scan through $100$ values of $\lambda_{e}$ in the range $\lambda_{e}\in(0,{\|{\mathbf{D^{\top}M}}\|_{\infty}}/{\|{\mathbf{M}}\|}]$ ( $\lambda_{e}\in(0,{\|{\mathbf{\widetilde{M}}}{\|_{\infty}}/{\|{\mathbf{\widetilde{M}}}\|}}]$ ), to generate the ROCs. We threshold the resulting estimate of the sparse part ${\mathbf{S}}\in\mathbb{R}^{d\times m}$ based on its column norm. We choose the threshold such that the AUC metric is maximized for both cases (D-RPCA(E) and RPCA†).

Column-wise sparsity case: For this case, we evaluate and compare the performance of the proposed method D-RPCA(C) with OP† (as described in Section I-B), MF, and MF†. The results for the Indian Pines dataset and the Pavia University dataset as shown in Table II(a)-(d) and Table IV(a)-(c), respectively. As in the entry-wise sparsity case, we employ the accelerated proximal gradient (APG) algorithm presented in Algorithm 1 to solve the optimization problem shown in D-RPCA(C). Similarly, for OP† we employ the APG with transformed data matrix $\widetilde{{\mathbf{M}}}$ , while setting ${\mathbf{D=I}}$ . For the tuning parameters for the APG solver for (D-RPCA(C)) (OP†, respectively), we choose $v=0.95$ , $\nu=\|\mathbf{M}\|$ ( $\nu=\|{\mathbf{\widetilde{M}}}\|$ ), $\bar{\nu}=10^{-4}$ , and scan through $100$ $\lambda_{c}$ s in the range $\lambda_{c}\in(0,{\|{\mathbf{D^{\top}M}}\|_{\infty,2}}/{\|{\mathbf{M}}\|}]$ ( $\lambda_{c}\in(0,{\|{\mathbf{\widetilde{M}}}{\|_{\infty,2}}/{\|{\mathbf{\widetilde{M}}}\|}}]$ ), to generate the ROCs. We threshold the resulting estimate of the sparse part ${\mathbf{S}}\in\mathbb{R}^{d\times m}$ based on its column norm.

Analysis: Table I–III and Table II–IV show the ROC characteristics and the classification performance of the proposed techniques D-RPCA(E) and D-RPCA(C), for two datasets under consideration, respectively, under various choices of the dictionary ${\mathbf{D}}$ and regularization parameter $\rho$ for (31). We note that both proposed techniques D-RPCA(E) and D-RPCA(C) on an average outperform the competing techniques, emerging as the most reliable techniques across different dictionary choices; see Tables I(d), III(c), II(d), and IV(c).

Further, the performance of D-RPCA(C) is slightly better than D-RPCA(E). This can be attributed to the fact that the column-wise sparsity model does not require the columns of ${\mathbf{S}}$ to be sparse themselves. As alluded to in Section I-B, this allows for higher flexibility in the choice of the dictionary elements for the thin dictionary case.

In addition, we see that the matched filtering-based techniques (and even OP† based technique for $d=4$ and $\rho=0.1$ in Table II) exhibit “flip” or inversion of the ROC curve. As described in Section VI-B, this phenomenon is an indicator that a classifier is better at rejecting the target class. In case of MF-based technique, this is a result of a dictionary that contains an element that resembles the average behavior of the spectral responses. A similar phenomenon is at play in case of the OP† for $d=4$ and $\rho=0.1$ in Table II. Specifically, here the inversion indicates that the dictionary is capable of representing the columns of the data ${\mathbf{M}}$ effectively, which leads to an increase in the corresponding column norms in their representation $\widehat{{\mathbf{M}}}$ . Coupled with the fact that the component ${\mathbf{L}}$ is no longer low-rank for this thin dictionary case (see our discussion in Section I-B), this results in rejection of the target class. On the other hand, our techniques D-RPCA(E) and D-RPCA(C) do not suffer from this issue. Moreover, note that across all the experiments, the thresholds for RPCA† and OP† are higher than their D-RPCA counterparts. This can also be attributed to the pre-multiplication by the pseudo-inverse of the dictionary ${\mathbf{D}}^{\dagger}$ , which increases column norms based on the leading singular values of ${\mathbf{D}}$ . Therefore, using D-RPCA(E), when the target spectral response admits a sparse representation, and D-RPCA(C), otherwise, yield consistent and superior results as compared to related techniques.

There are other interesting recovery results which warrant our attention. Fig. 9 shows the low-rank and the dictionary sparse component recovered by D-RPCA(E) for two different values of $\lambda_{e}$ , for the case where we form the dictionary by randomly sampling the voxels (Table I(c)) for the Indian Pines Dataset [51]. Interestingly, we recover the rail tracks/roads running diagonally on the top-right corner, along with some low-density housing; see Fig 9 (f). This is because the signatures we seek (stone-steel) are similar to the signatures of the materials used in these structures. This further corroborates the applicability of the proposed approach in detecting the presence of particular spectral signatures as long as they are appropriately distinct.

VII Discussion

We analyze a dictionary-based generalization of Robust PCA, and use it for target localization in a hyperspectral (HS) image from the a priori known spectral signature of the material of interest. Here, we consider a case where the acquired data can be modeled as a superposition of a low-rank component and a dictionary sparse component, and analyze this model under two distinct sparsity modalities – entry-wise and column-wise, respectively for both thin and fat dictionary cases.

Our analysis shows that contrary to the existing intuition, in the thin dictionary case, premultiplication with pseudo-inverse of the dictionary may not reduce the problem to that of Robust PCA. To this end, we theoretically analyze the thin dictionary case while extending the analysis for the fat dictionary case, while also analyzing the column-wise sparsity case. As a result, our results, to the best of our knowledge, are the most general for this model and facilitate use of this model for practical settings. Here, we consider the worst case analysis for the deterministic setting. Therefore, analysis of this model with additional randomness assumptions on the constituent factors constitutes the future work. Additionally, the recent results on non-convex low-rank matrix estimation formulations [67, 68] may potentially lead to computationally efficient algorithms by replacing the expensive SVD step.

In this work, we also leverage our theoretical results for a target localization task in hyperspectral imaging to demonstrate the applicability of the proposed approach on real-world demixing tasks. Here, we show how the entry-wise and column-wise sparsity modalities can be used to detect targets depending on the dictionary structure. Future work on this thread will aim to further exploit local similarities (potentially by group sparsity constraints) in HS images to improve localization.

Overall, our algorithm agnostic theoretical guarantees and analysis of the corresponding application in HS image target detection task using the proposed dictionary-based generalization of Robust PCA opens up future theory-backed explorations of the model in various target detection applications.

Appendix A Proofs of Intermediate results

A-A Proofs for Entry-wise Case

We present the details of the proofs in this section for the entry-wise case. We first start by deriving the optimality conditions.

Proof of Lemma 4.

Let $\{{\mathbf{L}}_{0},{\mathbf{S}}_{0}\}$ be a solution of the problem posed above. Notice that this pair is not necessarily unique. For example, as shown in proof of Lemma 2 in [23], $\{{\mathbf{L}}_{0}+{\mathbf{DH}},{\mathbf{S}}_{0}-{\mathbf{H}}\}$ , with arbitrary ${\mathbf{H}}$ , is another feasible solution of the problem satisfying the optimality conditions (derived in this section).

We begin by writing the Lagrangian, ${\mathcal{F}}({\mathbf{L}},{\mathbf{S}},{\mathbf{\Lambda}})$ , for the given problem as follows.

[TABLE]

where ${\mathbf{\Lambda}}\in\mathbb{R}^{n\times m}$ are the Lagrange multipliers.

Let the singular value decomposition (SVD) of ${\mathbf{L}}_{0}$ be represented as ${\mathbf{U\Sigma V^{\top}}}$ . Then the sub-differential set of $\|{\mathbf{L}}\|_{*}$ can be represented as

[TABLE]

as shown in [69]. Also, the subdifferential set corresponding to $\|{\mathbf{S}}\|_{1}$ is given by

[TABLE]

Using these results, we write the sub-differential of the Lagrangian with respect to ${\mathbf{L}}$ and ${\mathbf{S}}$ at $\{{\mathbf{L}}_{0},{\mathbf{S}}_{0}\}$ as

[TABLE]

Then optimality conditions are

[TABLE]

which implies that the dual solution ${\mathbf{\Lambda}}$ must obey the following,

[TABLE]

Our aim here is to find the conditions on ${\mathbf{W}}$ and ${\mathbf{F}}$ such that the pair $\{{\mathbf{L}}_{0},~{}{\mathbf{S}}_{0}\}$ is a unique solution to the problem at hand.

Using these conditions, we see that ${\mathcal{P}}_{{\mathcal{L}}}({\mathbf{\Lambda}})={\mathbf{UV^{\top}}}$ and ${\mathcal{P}}_{{\mathcal{S}}_{e}}({\mathbf{D^{\top}\Lambda}})=\lambda_{e}\text{sign}({\mathbf{S}}_{0})$ ; these correspond to conditions (C1) and (C2), respectively. Now consider a feasible solution $\{{\mathbf{L_{0}+DH}},{\mathbf{S_{0}-H}}\}$ for a non-zero ${\mathbf{H}}\in\mathbb{R}^{d\times m}$ . Now by duality of norms

[TABLE]

We can choose ${\mathbf{W}}:={\mathcal{P}}_{{\mathcal{L}}^{\perp}}(\tilde{{\mathbf{W}}})$ which implies $\|{\mathbf{W}}\|\leq 1$ and ${\mathcal{P}}_{{\mathcal{L}}}({\mathbf{W}})={\mathbf{0}}$ and

[TABLE]

Further, let ${\mathbf{F}}$ , with $\|{\mathbf{F}}\|_{\infty}=1$ and ${\mathcal{P}}_{{\mathcal{S}}_{e}}({\mathbf{F}})={\mathbf{0}}$ , be such that

[TABLE]

where ${\mathbf{F}}_{ij}$ denotes the $(i,j)^{\text{th}}$ element of ${\mathbf{F}}$ . Then, we arrive at the following simplification for $\langle{\mathbf{F}},~{}{\mathbf{H}}\rangle$ by duality of norms,

[TABLE]

We first write the sub-gradient optimality condition,

[TABLE]

Next, we use the relationships derived above to simplify the following term:

[TABLE]

We now simplify $\langle{\mathcal{P}}_{{\mathcal{L}}}({\mathbf{\Lambda}}),~{}{\mathbf{DH}}\rangle-\langle{\mathcal{P}}_{{\mathcal{S}}_{e}}({\mathbf{D^{\top}\Lambda}}),~{}{\mathbf{H}}\rangle$ using Holder’s inequality.

[TABLE]

Finally, we simplify the optimality condition in shown in (A-A),

[TABLE]

Here, we note that if $\|{\mathcal{P}}_{{\mathcal{L}}^{\perp}}({\mathbf{\Lambda}})\|<1$ and $\|{\mathcal{P}}_{{\mathcal{S}}_{e}^{\perp}}({\mathbf{D^{\top}\Lambda}})\|_{\infty}<\lambda_{e}$ , then the pair $\{{\mathbf{L}}_{0},{\mathbf{S}}_{0}\}$ is the unique solution of the problem. Consequently, these are the required necessary conditions (C3) and (C4), respectively. ∎

Proof of Lemma 5.

First, note that we need ${\mathbf{A}}_{{\mathcal{S}}_{e}}$ to have full row rank, i.e, its smallest singular value should be greater than zero. To this end, we first derive a lower bound on the smallest singular value, $\sigma_{\min}{({\mathbf{A}}_{{\mathcal{S}}_{e}})}$ of ${\mathbf{A}}_{{\mathcal{S}}_{e}}$ as follows:

[TABLE]

Now, using the definition of ${\mathbf{A}}^{\top}$ and properties of Kronecker products namely, transpose and vectorization of product of three matrices, we have

[TABLE]

Now, since $({\mathbf{I}}-{\mathbf{P_{U}}}){\mathbf{DH}}({\mathbf{I}}-{\mathbf{P_{V}}})={\mathcal{P}}_{{\mathcal{L}}^{\perp}}({\mathbf{DH}})$ ,

[TABLE]

Using the GFP, we have the following lower bound:

[TABLE]

Further, simplifying using properties of the projection operator, the reverse triangle inequality and the definition of $\mu$ ,

[TABLE]

Therefore, we note that if $\mu<1$ and $\alpha_{\ell}>0$ , ${\mathbf{A}}_{{\mathcal{S}}_{e}}$ has full row rank, and the lower bound on the smallest singular value is given by $\sqrt{\alpha_{\ell}}(1-\mu)$ . ∎

Proof of Lemma 6.

We begin with the definition of ${\mathbf{b}}_{{\mathcal{S}}_{e}}$ . Since $\|{\mathbf{b}}_{{\mathcal{S}}_{e}}\|_{2}=\|{\mathbf{B}}_{{\mathcal{S}}_{e}}\|_{{\rm F}}$ and ${\mathbf{B_{{\mathcal{S}}_{e}}}}:=\lambda_{e}\text{sign}({\mathbf{S}}_{0})-\mathcal{P}_{{\mathcal{S}}_{e}}({\mathbf{D^{\top}UV^{\top}}})$ ,

[TABLE]

Now for an upper bound on $\|{\mathcal{P}}_{{\mathcal{S}}_{e}}({\mathbf{D^{\top}UV^{\top}}})\|_{{\rm F}}$ we start by analyzing $\|{\mathcal{P}}_{{\mathcal{S}}_{e}}({\mathbf{D^{\top}UV^{\top}}})\|_{{\rm F}}^{2}$ ,

[TABLE]

Using properties of the inner products and using the fact that ${\mathcal{P}}_{{\mathcal{L}}}({\mathbf{UV^{\top}}})={\mathbf{UV^{\top}}}$ ,

[TABLE]

Further simplifying using Cauchy Schwarz inequality and the definition of $\mu$ we have

[TABLE]

Now, since $\|{\mathbf{UV^{\top}}}\|_{{\rm F}}=\sqrt{r}$ and using the GFP we have $\|{\mathcal{P}}_{{\mathcal{S}}_{e}}({\mathbf{D^{\top}UV^{\top}}})\|_{{\rm F}}\leq\mu\sqrt{r\alpha_{u}}$ . Therefore, an upper bound for $\|{\mathbf{b}}_{{\mathcal{S}}_{e}}\|_{2}$ is given by $\|{\mathbf{b}}_{{\mathcal{S}}_{e}}\|_{2}\leq\lambda_{e}\sqrt{s_{e}}+\sqrt{r\alpha_{u}}\mu$ . ∎

Proof of Lemma 7.

Since $\|{\mathbf{b}}_{{\mathcal{S}}_{e}}\|_{\infty}=\|{\mathbf{B}}_{{\mathcal{S}}_{e}}\|_{\infty}$ and ${\mathbf{B_{{\mathcal{S}}_{e}}}}:=\lambda_{e}\text{sign}({\mathbf{S}}_{0})-\mathcal{P}_{{\mathcal{S}}_{e}}({\mathbf{D^{\top}UV^{\top}}})$ , we have the upper bound $\|{\mathbf{b}}_{{\mathcal{S}}_{e}}\|_{\infty}\leq\lambda_{e}+\|{\mathcal{P}}_{{\mathcal{S}}_{e}}({\mathbf{D}}^{\top}{\mathbf{UV}}^{\top})\|_{\infty}$ . ∎

Proof of Lemma 8.

We begin by simplifying the quantity of interest as follows:

[TABLE]

Now, we derive appropriate bounds on the numerator and the denominator of (A-A) separately. Consider the numerator $\|{\mathbf{A}}_{{{\mathcal{S}}_{e}}^{\perp}}{\mathbf{A}}_{{\mathcal{S}}_{e}}^{\top}\|_{\infty,\infty}$ . Here, we are interested in the maximum $\ell_{1}$ -norm of the rows of ${\mathbf{A}}_{{{\mathcal{S}}_{e}}^{\perp}}{\mathbf{A}}_{{\mathcal{S}}_{e}}^{\top}$ , i.e.,

[TABLE]

Let ${\mathcal{I}}_{{\mathcal{S}}_{e}}$ refer to the support of ${\mathbf{S}}_{0}$ , and $\bar{{\mathcal{I}}}_{{\mathcal{S}}_{e}}$ to its complement. Then, the expression can be written in terms of ${\mathcal{I}}_{{\mathcal{S}}_{e}}$ and $\bar{{\mathcal{I}}}_{{\mathcal{S}}_{e}}$ :

[TABLE]

Now, ${\mathbf{A}}$ is defined as ${\mathbf{(I-P_{V})}}\otimes{\mathbf{D^{\top}}}({\mathbf{I-P_{U}}})$ , therefore using the property of the product of two Kronecker products and product of projection matrices, ${\mathbf{AA}}^{\top}$ can be written as

[TABLE]

We are interested in the $\{\ell,j\}$ entry of ${\mathbf{AA}}^{\top}$ . Since, ${\mathbf{AA}}^{\top}$ has a Kronecker product structure, an entry of ${\mathbf{AA}}^{\top}$ is given by the product of elements of the matrices in the Kronecker product, therefore

[TABLE]

where $g(j_{1},j_{2},\ell_{1},\ell_{2})$ is given by

[TABLE]

Now, consider $g(j_{1},j_{2},\ell_{1},\ell_{2})$ , which can be simplified as

[TABLE]

Since trace is invariant under cyclic permutations, we have

[TABLE]

Denote $x:={\mathbf{e}}_{\ell_{1}}^{\top}{\mathbf{D^{\top}}}({\mathbf{I-P_{U}}}){\mathbf{D}}{\mathbf{e}}_{j_{1}}$ and $y:={\mathbf{e}}_{j_{2}}^{\top}{\mathbf{P_{V}}}{\mathbf{e}}_{\ell_{2}}$ , then we have

[TABLE]

Now, the following upper bound on $g(j_{1},j_{2},\ell_{1},\ell_{2})$ can be evaluated by squaring both sides and simplifying

[TABLE]

First consider $x$ , which can be written as $x=x\mathbbm{1}_{\{j_{1}=\ell_{1}\}}+x\mathbbm{1}_{\{j_{1}\neq\ell_{1}\}}$ . Here, $x\mathbbm{1}_{\{j_{1}=\ell_{1}\}}$ can be upper bounded as shown below using the GFP

[TABLE]

Further, we can derive an upper bound on $x\mathbbm{1}_{\{j_{1}\neq\ell_{1}\}}$ using the paraflelogram law for inner-products as follows.

[TABLE]

Therefore, we have

[TABLE]

Now, consider $\sqrt{\mathbbm{1}_{\{j_{2}=\ell_{2}\}}+y^{2}}$ , since $y={\mathbf{e}}_{j_{2}}^{\top}{\mathbf{P_{V}}}{\mathbf{P_{V}}}{\mathbf{e}}_{\ell_{2}}$ , and further, since $\sqrt{a^{2}+b^{2}}<(a+b)\text{~{}for~{}}a>0\text{~{}and~{}}b>0$ , we have $\sqrt{\mathbbm{1}_{\{j_{2}=\ell_{2}\}}+y^{2}}\leq\mathbbm{1}_{\{j_{2}=\ell_{2}\}}+\gamma_{{\mathbf{V}}}.$ Now, substituting in (35), i.e., the expression for $g(j_{1},j_{2},\ell_{1},\ell_{2})$ , we have,

[TABLE]

and finally substituting in (34) and noting that since $j_{1},j_{2}\in\bar{{\mathcal{I}}}_{{\mathcal{S}}_{e}}$ and $\ell_{1},\ell_{2}\in\bar{{\mathcal{I}}}_{{\mathcal{S}}_{e}}$ , $\mathbbm{1}_{\{j_{1}=\ell_{1}\}}\mathbbm{1}_{\{j_{2}=\ell_{2}\}}=0$ ,

[TABLE]

Now, for ${\mathbf{A}}_{0}\in\mathbb{R}^{d\times m}$ , the maximum number of non-zeros per row is $\text{min}(s_{e},m)$ , while those in a column are $\text{min}(s_{e},d)$ for the thin case and $\text{min}(s_{e},k)$ for the fat case. Then we have

[TABLE]

Here, the constant $c$ is as defined in (7). Now, to bound the denominator of (A-A), we have

[TABLE]

We proceed to bound $|1-\|{\mathbf{e}}_{j}^{\top}{\mathbf{A}}\|^{2}|$ . For this, we derive a lower bound on $\|{\mathbf{e}}_{j}^{\top}{\mathbf{A}}\|^{2}$ . Note that ${\mathbf{e}}_{j}^{\top}{\mathbf{A}}$ selects the $j$ -th row of ${\mathbf{A}}$ , which has a Kronecker product structure. Therefore,

[TABLE]

Therefore, since $\mu<1$ and $\alpha_{\ell}>0$ , then if $\alpha_{\ell}\leq\tfrac{1}{(1-\mu)^{2}}$ , we have $|1-\|{\mathbf{e}}_{j}^{\top}{\mathbf{A}}\|^{2}|\leq 1-\alpha_{\ell}(1-\mu)^{2}$ . The analysis for deriving an upper bound for the second term in (A-A) closely follows that used in (37), as shown below

[TABLE]

Combining these results, we have the following bound for

[TABLE]

Finally, substituting these results in (A-A) we have $\|{\mathbf{Q}}\|_{\infty,\infty}\leq C_{e}:=\tfrac{c}{\alpha_{\ell}(1-\mu)^{2}-c}$ , where $c$ is given by (7). ∎

A-B Proofs for Column-wise Case

Proof of Lemma 2.

We show that for any $({\mathbf{L}}_{0},{\mathbf{S}}_{0})\in\{{\mathbf{M}},{\mathcal{U}},{\mathcal{I}}_{{\mathcal{S}}_{c}}\}$ , if ${\rm span}\{\text{col}({\mathbf{L}}_{0})\}={\mathcal{U}}$ and $\rm csupp({\mathbf{D}}{\mathbf{S}}_{0})={\mathcal{I}}_{{\mathcal{S}}_{c}}$ do not hold simultaneously, then $\mu=1$ .

Let ${\mathbf{L}}+{\mathbf{DS}}={\mathbf{M}}$ , as per our model shown in (1). Now, let $({\mathbf{L}}_{0},{\mathbf{S}}_{0})$ be any other pair in our Oracle Model $\{{\mathbf{M}},{\mathcal{U}},{\mathcal{I}}_{{\mathcal{S}}_{c}}\}$ ,

[TABLE]

for some ${\mathbf{\Delta}}_{1}$ and ${\mathbf{\Delta}}_{2}$ , then we have that ${\mathbf{\Delta}}_{1}+{\mathbf{\Delta}}_{2}={\mathbf{0}}$ . This implies that $\rm csupp({\mathbf{\Delta}}_{1})\in{\mathcal{S}}_{c}$ . Further, this implies that ${\mathbf{L}}$ and ${\mathbf{L}}_{0}$ at least match in the columns indexed by the inliers, i.e., ${\mathcal{P}}_{{\mathcal{I}}_{{\mathbf{L}}}}({\mathbf{L}})={\mathcal{P}}_{{\mathcal{I}}_{{\mathbf{L}}}}({\mathbf{L}}_{0})$ , and we have

[TABLE]

Therefore, $\rm csupp({\mathbf{D}}{\mathbf{S}}_{0})\subseteq{\mathcal{I}}_{{\mathcal{S}}_{c}}$ . Specifically, this implies that there may exist a $j\in{\mathcal{I}}_{{\mathcal{S}}_{c}}$ for which ${\mathbf{D}}{\mathbf{S}}_{:,j}-({\mathbf{\Delta}}_{1})_{:,j}=0$ , which will imply that ${\mathcal{P}}_{{\mathcal{U}}^{\perp}}({\mathbf{D}}{\mathbf{S}}_{:,j})=0$ . This condition implies that $\mu=1$ . Therefore, we require ${\rm span}\{\text{col}({\mathbf{L}}_{0})\}={\mathcal{U}}$ and $\rm csupp({\mathbf{D}}{\mathbf{S}}_{0})={\mathcal{I}}_{{\mathcal{S}}_{c}}$ to hold simultaneously for $\mu<1$ . ∎

Proof of Lemma 9.

Let $({\mathbf{L}}_{0},{\mathbf{S}}_{0})$ be an optimal solution pair of (D-RPCA(C)). From the optimality conditions (22) and (23), we seek ${\mathbf{\Lambda}}$ such that

[TABLE]

Now consider a feasible solution $\{{\mathbf{L_{0}+D\Delta}},{\mathbf{S_{0}-\Delta}}\}$ for a non-zero ${\mathbf{\Delta}}\in\mathbb{R}^{d\times m}$ . Then by the optimality of $({\mathbf{L}}_{0},{\mathbf{S}}_{0})$ using the subgradient inequality, we have

[TABLE]

Let $G({\mathbf{\Delta}})=\langle{\mathbf{U}}{\mathbf{V}}^{\top}+{\mathbf{W}},{\mathbf{D}}{\mathbf{\Delta}}\rangle-\lambda_{c}\langle{\mathbf{H}}+{\mathbf{F}},{\mathbf{\Delta}}\rangle$ . We will show that if (q1)-(q4) hold, then $G({\mathbf{\Delta}})>0$ , which proves the optimality of $({\mathbf{L}}_{0},{\mathbf{S}}_{0})$ . Rewrite $G({\mathbf{\Delta}})$ as

[TABLE]

Let ${\mathbf{W}}$ , with $\|{\mathbf{W}}\|=1$ and ${\mathcal{P}}_{{\mathcal{L}}}({\mathbf{W}})={\mathbf{0}}$ , then by duality of norms,

[TABLE]

Further, let ${\mathbf{F}}$ , with $\|{\mathbf{F}}\|_{\infty,2}=1$ and ${\mathcal{P}}_{{\mathcal{S}}_{c}}({\mathbf{F}})={\mathbf{0}}$ , be such that

[TABLE]

where ${\mathbf{F}}_{{:,j}}$ denotes the $j^{\text{th}}$ column of ${\mathbf{F}}$ . Then, we arrive at the following simplification for $\langle{\mathbf{F}},~{}{\mathbf{\Delta}}\rangle$ by duality of norms,

[TABLE]

Since ${\mathcal{P}}_{{\mathcal{L}}}({\mathbf{\Lambda}})={\mathbf{UV}}^{\top}$ and ${\mathcal{P}}_{{\mathcal{S}}_{c}}({\mathbf{D}}^{\top}{\mathbf{\Lambda}})=\lambda_{c}{\mathbf{H}}$ by optimality conditions of (39),

[TABLE]

where we use Holder’s inequality in the last step.

Combining (40), (41), (42), and (44), we have

[TABLE]

Since we have an arbitrary ${\mathbf{\Delta}}$ with ${\mathbf{\Delta}}\neq{\mathbf{0}}$ and $({\mathbf{L}}_{0}+{\mathbf{D}}{\mathbf{\Delta}},{\mathbf{S}}_{0}-{\mathbf{\Delta}})\notin\{{\mathcal{U}},{\mathcal{I}}_{{\mathcal{S}}_{c}}\}$ , $\|{\mathcal{P}}_{{\mathcal{L}}^{\perp}}({\mathbf{D}}{\mathbf{\Delta}})\|_{*}=\|{\mathcal{P}}_{{{\mathcal{S}}_{c}}^{\perp}}({\mathbf{\Delta}})\|_{1,2}=0$ does not hold. Therefore, to ensure the uniqueness of the solution $({\mathbf{L}}_{0},{\mathbf{S}}_{0})$ , we need $\|{\mathcal{P}}_{{\mathcal{L}}^{\perp}}({\mathbf{\Lambda}})\|<1$ and $\|{\mathcal{P}}_{{\mathcal{S}}_{c}^{\perp}}({\mathbf{D}}^{\top}{\mathbf{\Lambda}})\|_{\infty,2}<\lambda_{c}$ . Hence, any dual certificate which obeys the conditions (C1)-(C4) guarantees optimality of the solution. ∎

Proof of Lemma 10.

We begin by writing the definition of $\sigma_{\min}({\mathbf{A}}_{{\mathcal{S}}_{c}}^{\top})$ as

[TABLE]

By the definition of ${\mathbf{A}}$ and using the property of Kronecker product for multiplication by a vector we have

[TABLE]

Further $\left({\mathbf{I}}-{\mathbf{P}}_{{\mathbf{U}}}\right){\mathbf{D}}{\mathbf{H}}\left({\mathbf{I}}-{\mathbf{P}}_{{\mathbf{V}}}\right)={\mathcal{P}}_{{\mathcal{L}}^{\perp}}({\mathbf{DH}})$ , and we can write that expression above as follows

[TABLE]

Here (i) is due to the GFP condition D.2 and the reverse triangle inequality, and (ii) from the incoherence property in (2). ∎

Proof of Lemma 11.

We start by using the correspondence between the vector ${\mathbf{b}}_{{\mathcal{S}}_{c}}$ and the matrix ${\mathbf{B}}_{{\mathcal{S}}_{c}}$ , i.e.,

[TABLE]

Now, since $\tilde{{\mathbf{S}}}_{:,j}={\mathbf{S}}_{:,j}/\|{\mathbf{S}}_{:,j}\|_{2}$ for all $j\in{\mathcal{I}}_{{\mathcal{S}}_{c}}$ ; and is ${\mathbf{0}}$ otherwise (i.e., when $j\notin{\mathcal{I}}_{{\mathcal{S}}_{c}}$ ), using triangle inequality, we have

[TABLE]

Since we have

[TABLE]

where (i) is from subspace incoherence property and (ii) is from the GFP D.2. Combining (45) and (46), we have

[TABLE]

∎

Proof of Lemma 12.

We begin by analyzing the quantity of interest – $\|{\mathcal{P}}_{{\mathcal{S}}_{c}^{\perp}}({\mathbf{Z}})\|_{\infty,2}$ , i.e., we are interested in the maximum column norm of the matrix ${\mathcal{P}}_{{\mathcal{S}}_{c}^{\perp}}({\mathbf{Z}})$ . Note that ${\mathbf{Z}}$ is defined as

[TABLE]

and we have $\text{vec}({\mathbf{Z}})={\mathbf{A}}\text{vec}({\mathbf{X}})$ . Further, we have that

[TABLE]

Now, observe that the columns of matrix ${\mathcal{P}}_{{\mathcal{S}}_{c}^{\perp}}({\mathbf{Z}})$ appear as blocks of size $n\times 1$ in the vector ${\mathcal{P}}_{{\mathcal{S}}_{c}^{\perp}}\left(\text{vec}({\mathbf{Z}})\right)$ . Moreover, the elements of vector ${\mathcal{P}}_{{\mathcal{S}}_{c}^{\perp}}\left(\text{vec}({\mathbf{Z}})\right)$ are formed due to the inner product between the rows of Kronecker product structured matrix ${\mathbf{A}}_{{\mathcal{S}}_{c}^{\perp}}$ and $\text{vec}(X)$ . Therefore, to identify a column of ${\mathcal{P}}_{{\mathcal{S}}_{c}^{\perp}}\left({\mathbf{Z}}\right)$ we need to focus on the interaction between correponding rows of ${\mathbf{A}}_{{\mathcal{S}}_{c}^{\perp}}$ and $\text{vec}({\mathbf{X}})$ .

Consider the Kronecker product structured matrix ${\mathbf{A}}_{{\mathcal{S}}_{c}^{\perp}}$ . Since the rows in ${\mathbf{A}}_{{\mathcal{S}}_{c}^{\perp}}$ correspond to all rows outside the column support ${\mathcal{S}}_{c}$ , this corresponds to selecting those rows of $m\times m$ matrix $({\mathbf{I}}-{\mathbf{P}}_{{\mathbf{V}}})$ which correspond to ${\mathcal{S}}_{c}^{\perp}$ , which we denote by $({\mathbf{I}}-{\mathbf{P}}_{{\mathbf{V}}})_{{\mathcal{S}}_{c}^{\perp}}$ i.e.,

[TABLE]

For simplicity of the upcoming analysis, we denote the matrix $({\mathbf{I}}-{\mathbf{P}}_{{\mathbf{V}}})$ as

[TABLE]

Using this notation, the $j$ -th block of vector ${\mathcal{P}}_{{\mathcal{S}}_{c}^{\perp}}\left(\text{vec}({\mathbf{Z}})\right)$ (which is also the $j$ -th column of ${\mathcal{P}}_{{\mathcal{S}}_{c}^{\perp}}\left({\mathbf{Z}}\right)$ ), can be written as

[TABLE]

for some $j\in{\mathcal{I}}_{{\mathcal{S}}_{c}^{\perp}}$ . Now, further since $\text{vec}({\mathbf{X}}):={\mathbf{A}}_{{\mathcal{S}}_{c}}^{\top}({\mathbf{A}}_{{\mathcal{S}}_{c}}{\mathbf{A}}_{{\mathcal{S}}_{c}}^{\top})^{-1}\text{vec}({\mathbf{B}}_{{\mathcal{S}}_{c}})$ , therefore we are interested in maximum $2$ -norm of

[TABLE]

for some $j\in{\mathcal{I}}_{{\mathcal{S}}_{c}^{\perp}}$ . Note that ${\mathbf{A}}_{{\mathcal{S}}_{c}}^{\top}$ itself is a Kronecker product structured matrix given by

[TABLE]

Using the mixed product rule for Kronecker products we have

[TABLE]

for some $j\in{\mathcal{I}}_{{\mathcal{S}}_{c}^{\perp}}$ . Further, since for two matrices ${\mathbf{A}}$ and ${\mathbf{B}}$ , $\|{\mathbf{A}}\otimes{\mathbf{B}}\|=\|{\mathbf{A}}\|\|{\mathbf{B}}\|$ , we have

[TABLE]

where we also use the fact that $v_{j,:}={{\mathbf{e}}_{j}^{\top}}({\mathbf{I}}-{\mathbf{P}}_{{\mathbf{V}}})_{{\mathcal{S}}_{c}^{\perp}}$ . We will now proceed to bound the first term in (47). Note that

[TABLE]

Now, each term in the summation can be bounded as

[TABLE]

This implies $\|{{\mathbf{e}}_{j}^{\top}}({\mathbf{I}}-{\mathbf{P}}_{{\mathbf{V}}})_{{\mathcal{S}}_{c}^{\perp}}({\mathbf{I}}-{\mathbf{P}}_{{\mathbf{V}}})_{{\mathcal{S}}_{c}}^{\top}\|\leq\sqrt{s_{c}}\gamma_{{\mathbf{V}}}$ . Further, note that $\|({\mathbf{A}}_{{\mathcal{S}}_{c}}{\mathbf{A}}_{{\mathcal{S}}_{c}}^{\top})^{-1}\|\leq\|{\mathbf{A}}_{{\mathcal{S}}_{c}}^{-1}\|^{2}=\tfrac{1}{\sigma_{\min}({\mathbf{A}}_{{\mathcal{S}}_{c}})^{2}}$ . Substituting this into (47), for a $j\in{\mathcal{S}}_{c}^{\perp}$ , we have

[TABLE]

We can further write $\|{\mathbf{D}}^{\top}({\mathbf{I}}-{\mathbf{P}}_{{\mathbf{U}}}){\mathbf{D}}\|$ as follows

[TABLE]

Substituting this result in (48), using Lemma 10 and Lemma 11,

[TABLE]

∎

Bibliography69

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] I. Jolliffe, Principal component analysis , Wiley Online Library, 2002.
2[2] S. Rambhatla, X. Li, and J. Haupt, “Target-based hyperspectral demixing via generalized robust PCA,” in 51st Asilomar Conference on Signals, Systems, and Computers, ACSSC , 2017, pp. 420–424.
3[3] X. Li, J. Ren, S. Rambhatla, Y. Xu, and J. Haupt, “Robust PCA via dictionary based outlier pursuit,” in 2018 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), . IEEE, 2018.
4[4] M. Borengasser, W. S. Hungate, and R. Watkins, Hyperspectral remote sensing: principles and applications , CRC press, 2007.
5[5] B. Park and R. Lu, Hyperspectral imaging technology in food and agriculture , Springer, 2015.
6[6] D. Rolnick, P. L. Donti, L. H. Kaack, et al., “Tackling climate change with machine learning,” ar Xiv preprint ar Xiv:1906.05433 , 2019.
7[7] B. K. Natarajan, “Sparse approximate solutions to linear systems,” SIAM Journal on Computing , vol. 24, no. 2, pp. 227–234, 1995.
8[8] D. L. Donoho and X. Huo, “Uncertainty principles and ideal atomic decomposition,” IEEE Transactions on Information Theory , vol. 47, no. 7, pp. 2845–2862, 2001.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

A Dictionary-Based Generalization of Robust PCA with Applications to Target Localization in Hyperspectral Imaging

Abstract

Index Terms:

I Introduction

I-A Background

I-B Our Contributions

II Preliminaries

II-A Optimality of the Solution Pair

Definition D.1** (Oracle Model for Column-wise Sparsity Case).**

II-B Conditions on the Dictionary

Definition D.2**.**

II-C Relevant Subspaces

II-D Incoherence Measures and Parameters

III Main Results

III-A Exact Recovery for Entry-wise Sparsity Case

Theorem 1**.**

III-B Exact Recovery for Column-wise Sparsity Case

Lemma 2**.**

Theorem 3**.**

IV Proof of Main Results

IV-A Proof of Theorem 1

Lemma 4**.**

Lemma 5**.**

Lemma 6**.**

Lemma 7**.**

Lemma 8**.**

IV-B Proof of Theorem 3

Lemma 9**.**

Lemma 10**.**

Lemma 11**.**

Lemma 12**.**

V Numerical Simulations on Synthetic Data

V-A Entry-Wise Sparsity Case

V-B Column-wise Sparsity Case

VI Evaluation of Real-World Dataset: Target Localization in Hyperspectral Imaging

VI-A Algorithmic Considerations

VI-A1 Discussion of Algorithm 1

VI-A2 Parameter Selection

VI-B Experimental Evaluation

VII Discussion

Appendix A Proofs of Intermediate results

A-A Proofs for Entry-wise Case

Proof of Lemma 4.

Proof of Lemma 5.

Proof of Lemma 6.

Proof of Lemma 7.

Proof of Lemma 8.

A-B Proofs for Column-wise Case

Proof of Lemma 2.

Proof of Lemma 9.

Proof of Lemma 10.

Proof of Lemma 11.

Proof of Lemma 12.

Definition D.1 (Oracle Model for Column-wise Sparsity Case).

Definition D.2.

Theorem 1.

Lemma 2.

Theorem 3.

Lemma 4.

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.

Lemma 10.

Lemma 11.

Lemma 12.