Robust subspace clustering by Cauchy loss function

Xuelong Li; Quanmao Lu; Yongsheng Dong; and Dacheng Tao

arXiv:1904.12274·cs.CV·April 30, 2019

Robust subspace clustering by Cauchy loss function

Xuelong Li, Quanmao Lu, Yongsheng Dong, and Dacheng Tao

PDF

TL;DR

This paper introduces a robust subspace clustering method using the Cauchy loss function to effectively handle noisy data, outperforming existing methods on real datasets.

Contribution

It proposes a novel subspace clustering approach based on Cauchy loss, addressing noise influence and proving the grouping effect theoretically.

Findings

01

Outperforms several existing clustering methods on five real datasets.

02

Uses Cauchy loss to suppress large noise in data.

03

Theoretically proves the grouping effect of the method.

Abstract

Subspace clustering is a problem of exploring the low-dimensional subspaces of high-dimensional data. State-of-the-arts approaches are designed by following the model of spectral clustering based method. These methods pay much attention to learn the representation matrix to construct a suitable similarity matrix and overlook the influence of the noise term on subspace clustering. However, the real data are always contaminated by the noise and the noise usually has a complicated statistical distribution. To alleviate this problem, we in this paper propose a subspace clustering method based on Cauchy loss function (CLF). Particularly, it uses CLF to penalize the noise term for suppressing the large noise mixed in the real data. This is due to that the CLF's influence function has a upper bound which can alleviate the influence of a single sample, especially the sample with a large noise,…

Tables9

Table 1. TABLE I: The Contrast Index (CI) (%) of the similarity matrices obtained by different methods.

Method	SSC	LRR	LSR	CASS	MoG	NSSC	Ours
CI	30.39	23.89	25.46	30.30	16.80	32.39	38.12

Table 2. TABLE II: Statistics of five data sets.

dataset	size	dimensionality	$#$ of classes
Hopkins 155	59	296	2 or 3
USPS	9298	256	10
C-Cube	57646	3120	52
FEI	700	768	50
Extended Yale B	2414	1024	68

Table 3. TABLE III: The best parameter for each method on different databases.

dataset	SSC ( $λ$ )	LRR ( $λ$ )	LSR ( $λ$ )	CASS ( $λ$ )	MoG ( $λ$ )	NSSC ( $λ$ )	Ours ( $λ$ , $c$ )
Hopkins 155	0.0001	1000	0.001	0.0001	10	10	(0.0001, 0.5)
USPS	0.5	0.5	0.5	0.5	1000	10	(1,0.1)
C-Cube	0.5	0.5	1	0.5	1000	10	(0.5,0.1)
FEI	0.1	0.001	0.1	0.5	1000	10	(0.01,0.01)
Extended Yale B	0.5	1	0.01	0.001	10	10	(0.1,0.1)

Table 4. TABLE IV: The clustering results of different algorithms on the Hopkins 155 database. The best results are in bold font.

k		Accuracy
k		Kmeans	SSC	LRR	LSR	CASS	MoG	NSSC	Ours
2 motions	Average	87.80	83.40	96.47	96.14	92.01	98.03	88.76	97.81
Acc.(%)	Median	88.10	83.83	99.67	99.54	99.64	100.00	90.23	100.00
3 motions	Average	77.22	74.88	90.38	90.66	89.67	94.25	78.46	95.03
Acc.(%)	Median	80.42	75.45	94.57	92.34	91.43	97.66	79.90	99.17
Total	Average	85.55	81.48	95.08	94.96	91.55	97.21	86.37	97.21
Acc.(%)	Median	85.86	80.84	99.41	99.06	97.76	99.71	88.50	100.00
k		Normalized Mutual Information
k		Kmeans	SSC	LRR	LSR	CASS	MoG	NSSC	Ours
2 motions	Average	53.96	40.09	86.53	79.60	70.42	86.10	57.37	87.24
Acc.(%)	Median	44.11	27.96	96.43	94.92	96.19	100.00	57.88	100.00
3 motions	Average	49.69	43.33	80.19	76.01	77.67	83.21	50.22	86.61
Acc.(%)	Median	47.93	46.90	80.14	76.86	79.47	89.17	47.34	95.41
Total	Average	53.26	40.96	85.17	78.85	72.14	85.50	55.78	87.12
Acc.(%)	Median	45.14	34.65	94.42	92.19	85.59	96.96	56.23	100.00

Table 5. TABLE V: The clustering results of different algorithms on the USPS database. The best results are in bold font.

k	Accuracy
k	Kmeans	SSC	LRR	LSR	CASS	MoG	NSSC	Ours
5 subjects	80.67	82.67	83.33	85.33	73.33	70.66	90.00	92.67
6 subjects	75.00	82.77	83.89	80.00	70.00	62.22	81.67	87.78
7 subjects	77.14	80.00	75.24	80.95	73.81	58.10	81.90	83.33
8 subjects	78.75	79.85	76.25	79.17	71.25	54.17	82.08	86.25
9 subjects	77.78	80.00	69.63	80.74	75.56	55.93	80.00	85.56
10 subjects	73.00	67.67	70.00	76.33	71.00	55.67	77.00	81.33
k	Normalized Mutual Information
k	Kmeans	SSC	LRR	LSR	CASS	MoG	NSSC	Ours
5 subjects	66.10	71.47	72.57	69.00	66.76	45.52	76.86	82.86
6 subjects	60.69	63.86	73.84	65.74	62.42	46.37	70.64	77.56
7 subjects	64.45	64.88	64.49	68.90	63.92	42.24	74.02	74.63
8 subjects	68.48	72.22	66.69	70.29	64.03	42.13	74.96	80.07
9 subjects	67.28	68.63	67.37	71.49	66.23	46.87	74.31	78.87
10 subjects	63.26	59.93	63.95	68.87	62.48	45.54	69.93	74.86

Table 6. TABLE VI: The clustering results of different algorithms on the C-Cube database. The best results are in bold font.

k	Accuracy
k	Kmeans	SSC	LRR	LSR	CASS	MoG	NSSC	Ours
10 subjects	32.50	46.50	43.00	45.50	22.50	33.00	14.50	51.00
20 subjects	26.25	44.00	32.25	33.25	27.75	26.25	21.00	37.50
30 subjects	24.33	32.50	29.67	34.33	27.67	24.67	28.83	35.17
40 subjects	22.87	30.50	28.00	28.75	25.86	24.50	28.62	32.37
50 subjects	23.70	29.50	28.10	32.30	26.60	26.70	25.70	32.40
k	Normalized Mutual Information
k	Kmeans	SSC	LRR	LSR	CASS	MoG	NSSC	Ours
10 subjects	30.22	36.86	37.56	43.70	17.26	28.32	10.45	46.61
20 subjects	35.16	44.36	42.97	41.86	36.53	29.04	29.28	45.19
30 subjects	37.96	40.46	44.96	45.72	41.31	35.24	42.67	48.51
40 subjects	40.77	40.50	47.27	47.26	43.77	40.87	45.19	48.79
50 subjects	43.43	44.50	49.76	51.80	47.75	45.14	45.59	51.58

Table 7. TABLE VII: The clustering results of different algorithms on the FEI database. The best results are in bold font.

k	Accuracy
k	Kmeans	SSC	LRR	LSR	CASS	MoG	NSSC	Ours
5 subjects	81.43	90.00	81.43	88.57	95.71	80.00	84.29	98.57
10 subjects	65.00	71.43	70.71	72.14	80.00	66.43	70.00	85.71
15 subjects	68.57	80.00	69.05	65.23	78.57	62.38	71.90	82.38
20 subjects	66.79	73.93	71.43	70.00	75.36	65.36	71.01	72.50
30 subjects	64.52	76.19	59.29	65.48	67.86	66.67	66.43	69.29
40 subjects	61.07	77.14	57.86	64.46	65.36	63.75	66.07	66.07
k	Normalized Mutual Information
k	Kmeans	SSC	LRR	LSR	CASS	MoG	NSSC	Ours
5 subjects	80.51	82.33	77.65	81.95	93.24	69.24	76.57	96.77
10 subjects	70.03	70.21	76.47	74.20	77.58	63.70	73.02	89.44
15 subjects	79.85	79.40	77.90	69.72	83.74	66.26	78.59	85.85
20 subjects	79.49	77.74	81.34	74.72	81.93	71.80	75.14	80.64
30 subjects	77.37	81.64	76.76	75.27	79.99	75.65	78.59	81.22
40 subjects	78.29	83.48	76.54	76.08	0.7906	76.59	78.67	80.48

Table 8. TABLE VIII: The clustering results of different algorithms on the Extended Yale B database. The best results are in bold font.

k	Accuracy
k	Kmeans	SSC	LRR	LSR	CASS	MoG	NSSC	Ours
5 subjects	24.06	78.75	80.63	84.36	84.06	85.00	88.44	95.00
8 subjects	15.63	60.74	60.55	75.78	72.46	83.59	58.01	83.59
10 subjects	13.59	60.47	60.62	66.09	75.00	62.78	49.69	80.31
k	Normalized Mutual Information
k	Kmeans	SSC	LRR	LSR	CASS	MoG	NSSC	Ours
5 subjects	1.24	69.51	64.39	73.10	73.17	69.22	78.20	90.65
8 subjects	0.69	56.78	55.68	69.27	66.90	76.78	52.72	78.00
10 subjects	1.20	58.66	56.15	57.81	72.50	62.78	47.69	77.37

Table 9. TABLE IX: Computation time of different algorithms on the FEI dataset as a function of the number of subjects.

k	Kmeans	SSC	LRR	LSR	CASS	MoG	NSSC	Ours
5 subjects	0.03	38.27	1.09	0.04	2.27	11.38	0.13	0.32
10 subjects	0.09	84.68	1.22	0.11	10.77	72.01	0.22	0.51
15 subjects	0.18	149.33	1.71	0.17	31.13	292.68	0.36	0.84
20 subjects	0.29	239.44	2.39	0.26	60.34	728.12	0.56	1.40
30 subjects	0.63	495.24	3.87	0.57	159.61	3247.61	1.13	2.45
40 subjects	1.10	893.98	6.08	1.29	321.43	12637.33	2.22	5.25

Equations84

\begin{array}[]{*{20}{l}}{\mathop{\min}\limits_{{\bf{Z}},{\bf{E}}}\varphi({\bf{E}})+\delta({\bf{Z}})}\\ {s.t.{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\bf{X}}={\bf{XZ}}+{\bf{E}},}\end{array}

\begin{array}[]{*{20}{l}}{\mathop{\min}\limits_{{\bf{Z}},{\bf{E}}}\varphi({\bf{E}})+\delta({\bf{Z}})}\\ {s.t.{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\bf{X}}={\bf{XZ}}+{\bf{E}},}\end{array}

\begin{array}[]{l}\mathop{\min}\limits_{\bf{Z},\bf{E}}\left\|\bf{E}\right\|_{F}^{2}+\lambda{\left\|\bf{Z}\right\|_{0}}\\ s.t.{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\bf{X}=\bf{XZ}+\bf{E}},{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}diag({\bf{Z}})=\bf{0},\end{array}

\begin{array}[]{l}\mathop{\min}\limits_{\bf{Z},\bf{E}}\left\|\bf{E}\right\|_{F}^{2}+\lambda{\left\|\bf{Z}\right\|_{0}}\\ s.t.{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\bf{X}=\bf{XZ}+\bf{E}},{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}diag({\bf{Z}})=\bf{0},\end{array}

\begin{array}[]{l}\mathop{\min}\limits_{\bf{Z},\bf{E}}\left\|\bf{E}\right\|_{F}^{2}+\lambda{\left\|\bf{Z}\right\|_{1}}\\ s.t.{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\bf{X=XZ+E}},{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}diag(\bf{Z})=0.\end{array}

\begin{array}[]{l}\mathop{\min}\limits_{\bf{Z},\bf{E}}\left\|\bf{E}\right\|_{F}^{2}+\lambda{\left\|\bf{Z}\right\|_{1}}\\ s.t.{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\bf{X=XZ+E}},{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}diag(\bf{Z})=0.\end{array}

\begin{array}[]{l}\mathop{\min}\limits_{\bf{Z}}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}rank(\bf{Z})\\ s.t.{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}\bf{X=XZ}.\end{array}

\begin{array}[]{l}\mathop{\min}\limits_{\bf{Z}}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}rank(\bf{Z})\\ s.t.{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}\bf{X=XZ}.\end{array}

\begin{array}[]{l}\mathop{\min}\limits_{\bf{Z}}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\left\|\bf{E}\right\|_{21}}+\lambda{\left\|\bf{Z}\right\|_{*}}\\ s.t.{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}\bf{X=XZ+E}.\end{array}

\begin{array}[]{l}\mathop{\min}\limits_{\bf{Z}}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\left\|\bf{E}\right\|_{21}}+\lambda{\left\|\bf{Z}\right\|_{*}}\\ s.t.{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}\bf{X=XZ+E}.\end{array}

\begin{array}[]{l}\mathop{\min}\limits_{\bf{Z}}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\left\|\bf{E}\right\|_{F}^{2}}+\lambda{\left\|\bf{Z}\right\|_{F}^{2}}\\ s.t.{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}\bf{X=XZ+E}.\end{array}

\begin{array}[]{l}\mathop{\min}\limits_{\bf{Z}}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\left\|\bf{E}\right\|_{F}^{2}}+\lambda{\left\|\bf{Z}\right\|_{F}^{2}}\\ s.t.{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}\bf{X=XZ+E}.\end{array}

\begin{array}[]{*{20}{l}}{\mathop{\min}\limits_{{\bf{Z}},{\bf{E}}}\left\|{\bf{E}}\right\|_{F}^{2}+\lambda\sum\limits_{i=1}^{n}{{{\left\|{{\bf{X}}diag({{\bf{z}}_{i}})}\right\|}_{*}}}}\\ {s.t.{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\bf{X}}={\bf{XZ}}+{\bf{E}},}\end{array}

\begin{array}[]{*{20}{l}}{\mathop{\min}\limits_{{\bf{Z}},{\bf{E}}}\left\|{\bf{E}}\right\|_{F}^{2}+\lambda\sum\limits_{i=1}^{n}{{{\left\|{{\bf{X}}diag({{\bf{z}}_{i}})}\right\|}_{*}}}}\\ {s.t.{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\bf{X}}={\bf{XZ}}+{\bf{E}},}\end{array}

Z, E, π, Σ min - i = 1 \sum n ln (k = 1 \sum K π_{k} N (e_{i} ∣0, Σ_{k})) + λ ∥ Z ∥_{F}^{2} s . t . X = XZ + E, d ia g (Z) = 0, π_{k} \geq 0, Σ_{k} \in S^{+}, k = 1 \sum K π_{k} = 1,

Z, E, π, Σ min - i = 1 \sum n ln (k = 1 \sum K π_{k} N (e_{i} ∣0, Σ_{k})) + λ ∥ Z ∥_{F}^{2} s . t . X = XZ + E, d ia g (Z) = 0, π_{k} \geq 0, Σ_{k} \in S^{+}, k = 1 \sum K π_{k} = 1,

min i \sum ρ (r_{i}) .

min i \sum ρ (r_{i}) .

ψ (x) = \frac{\partial ρ ( x )}{\partial x},

ψ (x) = \frac{\partial ρ ( x )}{\partial x},

ρ (x) = lo g (1 + (x / c)^{2})

ρ (x) = lo g (1 + (x / c)^{2})

ψ (x) = \frac{2 x}{x ^{2} + c ^{2}},

ψ (x) = \frac{2 x}{x ^{2} + c ^{2}},

i = 1 \sum n lo g (1 + \frac{∥ x _{i} - X z _{i} ∥ _{2}^{2}}{c ^{2}}),

i = 1 \sum n lo g (1 + \frac{∥ x _{i} - X z _{i} ∥ _{2}^{2}}{c ^{2}}),

Z min i = 1 \sum n lo g (1 + \frac{∥ x _{i} - X z _{i} ∥ _{2}^{2}}{c ^{2}}) + λ ∥ Z ∥_{F}^{2},

Z min i = 1 \sum n lo g (1 + \frac{∥ x _{i} - X z _{i} ∥ _{2}^{2}}{c ^{2}}) + λ ∥ Z ∥_{F}^{2},

Z min lo g (1 + \frac{∥ X - XZ ∥ _{F}^{2}}{c ^{2}}) + λ ∥ Z ∥_{F}^{2} .

Z min lo g (1 + \frac{∥ X - XZ ∥ _{F}^{2}}{c ^{2}}) + λ ∥ Z ∥_{F}^{2} .

Z min J = lo g (1 + \frac{∥ X - XZ ∥ _{F}^{2}}{c ^{2}}) + λ ∥ Z ∥_{F}^{2} .

Z min J = lo g (1 + \frac{∥ X - XZ ∥ _{F}^{2}}{c ^{2}}) + λ ∥ Z ∥_{F}^{2} .

\frac{- 2 X ^{T} ( X - XZ )}{c ^{2} + ∥ X - XZ ∥ _{F}^{2}} + 2 λ Z = 0,

\frac{- 2 X ^{T} ( X - XZ )}{c ^{2} + ∥ X - XZ ∥ _{F}^{2}} + 2 λ Z = 0,

(\frac{X ^{T} X}{c ^{2} + ∥ X - XZ ∥ _{F}^{2}} + λ I) Z = \frac{X ^{T} X}{c ^{2} + ∥ X - XZ ∥ _{F}^{2}} .

(\frac{X ^{T} X}{c ^{2} + ∥ X - XZ ∥ _{F}^{2}} + λ I) Z = \frac{X ^{T} X}{c ^{2} + ∥ X - XZ ∥ _{F}^{2}} .

\left\{{\begin{split}&{\bf{Z}}=Q{\left(Q{{{\bf{X}}^{T}}{\bf{X}}+\lambda{\bf{I}}}\right)^{-1}}{{\bf{X}}^{T}}{\bf{X}}\\ &Q=\frac{1}{{{c^{2}}+\left\|{\bf{R}}\right\|_{F}^{2}}}\\ &{\bf{R}}{\rm{=}}{\bf{X}}{\rm{-}}{\bf{XZ}}&\end{split}}\right.,

\left\{{\begin{split}&{\bf{Z}}=Q{\left(Q{{{\bf{X}}^{T}}{\bf{X}}+\lambda{\bf{I}}}\right)^{-1}}{{\bf{X}}^{T}}{\bf{X}}\\ &Q=\frac{1}{{{c^{2}}+\left\|{\bf{R}}\right\|_{F}^{2}}}\\ &{\bf{R}}{\rm{=}}{\bf{X}}{\rm{-}}{\bf{XZ}}&\end{split}}\right.,

C I = \frac{S _{D}}{S _{D} + S _{N D}} = \frac{S _{D}}{∥ W ∥ _{1}},

C I = \frac{S _{D}}{S _{D} + S _{N D}} = \frac{S _{D}}{∥ W ∥ _{1}},

z min lo g (1 + \frac{∥ x - Xz ∥ _{2}^{2}}{c ^{2}}) + λ ∥ z ∥_{2}^{2} .

z min lo g (1 + \frac{∥ x - Xz ∥ _{2}^{2}}{c ^{2}}) + λ ∥ z ∥_{2}^{2} .

\frac{z ^ ^{i} - z ^ ^{j}}{∥ x ∥ _{2}} \leq \frac{1}{λ c ^{2}} 2 (1 - r),

\frac{z ^ ^{i} - z ^ ^{j}}{∥ x ∥ _{2}} \leq \frac{1}{λ c ^{2}} 2 (1 - r),

L (z) = lo g (1 + \frac{∥ x - Xz ∥ _{2}^{2}}{c ^{2}}) + λ ∥ z ∥_{2}^{2} .

L (z) = lo g (1 + \frac{∥ x - Xz ∥ _{2}^{2}}{c ^{2}}) + λ ∥ z ∥_{2}^{2} .

\frac{\partial L ( z )}{\partial z}_{z = \hat{z}} = 0.

\frac{\partial L ( z )}{\partial z}_{z = \hat{z}} = 0.

\frac{- 2 x _{i} ^{T} ( x - X z ^ )}{c ^{2} + ∥ x - X z ^ ∥ _{2}^{2}} + 2 λ \overset{z}{^}^{i} = 0,

\frac{- 2 x _{i} ^{T} ( x - X z ^ )}{c ^{2} + ∥ x - X z ^ ∥ _{2}^{2}} + 2 λ \overset{z}{^}^{i} = 0,

\frac{- 2 x _{j} ^{T} ( x - X z ^ )}{c ^{2} + ∥ x - X z ^ ∥ _{2}^{2}} + 2 λ \overset{z}{^}^{j} = 0,

\frac{- 2 x _{j} ^{T} ( x - X z ^ )}{c ^{2} + ∥ x - X z ^ ∥ _{2}^{2}} + 2 λ \overset{z}{^}^{j} = 0,

\overset{z}{^}^{i} - \overset{z}{^}^{j} = \frac{( x _{i} ^{T} - x _{j} ^{T} ) ( x - X z ^ )}{λ ( c ^{2} + ∥ x - X z ^ ∥ _{2}^{2} )} \leq \frac{( x _{i} ^{T} - x _{j} ^{T} ) ( x - X z ^ )}{λ c ^{2}} .

\overset{z}{^}^{i} - \overset{z}{^}^{j} = \frac{( x _{i} ^{T} - x _{j} ^{T} ) ( x - X z ^ )}{λ ( c ^{2} + ∥ x - X z ^ ∥ _{2}^{2} )} \leq \frac{( x _{i} ^{T} - x _{j} ^{T} ) ( x - X z ^ )}{λ c ^{2}} .

lo g (1 + \frac{∥ x - X z ^ ∥ _{2}^{2}}{c ^{2}}) \leq lo g (1 + \frac{∥ x - X z ^ ∥ _{2}^{2}}{c ^{2}}) + λ ∥ \hat{z} ∥_{2}^{2} = L (\hat{z}) \leq L (0) = lo g (1 + \frac{∥ x ∥ _{2}^{2}}{c ^{2}}) .

lo g (1 + \frac{∥ x - X z ^ ∥ _{2}^{2}}{c ^{2}}) \leq lo g (1 + \frac{∥ x - X z ^ ∥ _{2}^{2}}{c ^{2}}) + λ ∥ \hat{z} ∥_{2}^{2} = L (\hat{z}) \leq L (0) = lo g (1 + \frac{∥ x ∥ _{2}^{2}}{c ^{2}}) .

\frac{z ^ ^{i} - z ^ ^{j}}{∥ x ∥ _{2}} \leq \frac{1}{λ c ^{2}} 2 (1 - r) .

\frac{z ^ ^{i} - z ^ ^{j}}{∥ x ∥ _{2}} \leq \frac{1}{λ c ^{2}} 2 (1 - r) .

z_{1}, ..., z_{n} min J (Z) = lo g 1 + \frac{i = 1 \sum n ∥ x _{i} - X z _{i} ∥ _{2}^{2}}{c ^{2}} + λ i = 1 \sum n ∥ z_{i} ∥_{2}^{2},

z_{1}, ..., z_{n} min J (Z) = lo g 1 + \frac{i = 1 \sum n ∥ x _{i} - X z _{i} ∥ _{2}^{2}}{c ^{2}} + λ i = 1 \sum n ∥ z_{i} ∥_{2}^{2},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSpectral Clustering

Full text

Robust Subspace Clustering by

Cauchy Loss Function

Xuelong Li, , Quanmao Lu, Yongsheng Dong, , and Dacheng Tao, This work was supported in part by The National Key Research and Development Program of China under Grant 2018YFB1107400, in part by the National Natural Science Foundation of China under Grants 61871470, 61761130079, U1604153,and 61301230, in part by the Program for Science and Technology Innovation Talents in Universities of Henan Province under Grant 19HASTIT026, and in part by the Training Program for the Young-Backbone Teachers in Universities of Henan Province under Grant 2017GGJS065. (Corresponding author: Yongsheng Dong.) X. Li, Q. Lu, and Y. Dong are with the Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, Shaanxi, P. R. China (emails: [email protected], [email protected], [email protected]). Y. Dong is also with the School of Information Engineering, Henan University of Science and Technology, Luoyang 471023, Henan, P. R. China. D. Tao is with the UBTech Sydney Artificial Intelligence Institute and the School of Information Technologies in the Faculty of Engineering and Information Technologies, The University of Sydney, Darlington NSW 2008, Australia (e-mail: [email protected]).

Abstract

Subspace clustering is a problem of exploring the low-dimensional subspaces of high-dimensional data. State-of-the-arts approaches are designed by following the model of spectral clustering based method. These methods pay much attention to learn the representation matrix to construct a suitable similarity matrix and overlook the influence of the noise term on subspace clustering. However, the real data are always contaminated by the noise and the noise usually has a complicated statistical distribution. To alleviate this problem, we in this paper propose a subspace clustering method based on Cauchy loss function (CLF). Particularly, it uses CLF to penalize the noise term for suppressing the large noise mixed in the real data. This is due to that the CLF’s influence function has a upper bound which can alleviate the influence of a single sample, especially the sample with a large noise, on estimating the residuals. Furthermore, we theoretically prove the grouping effect of our proposed method, which means that highly correlated data can be grouped together. Finally, experimental results on five real datasets reveal that our proposed method outperforms several representative clustering methods.

Index Terms:

Subspace clustering, Cauchy loss function, noise suppression, grouping effect, similarity matrix.

I Introduction

Subspace clustering, as an important clustering analysis technique, has gained much attention in recent years and has numerous applications in image processing and computer vision, e.g. image representation [1], motion segmentation [2], saliency detection [3] and image clustering [4, 5]. It aims to explore the low dimensional structure lying in the high-dimensional data. Particularly, conventional PCA [6] can be regarded as a special subspace clustering method which finds a single low-dimensional subspace of the high-dimensional data. However, in practice, data are always drawn from multiple low-dimensional subspaces and each subspace has different dimension. For example, the trajectories of different motion objects usually belong to different affine subspaces, or face images of individuals under varying pose may lie in different linear subspaces. Motivated by these, subspace clustering is designed for seeking the low-dimensional subspace of the raw data and clustering the data into groups with each group fitting a subspace. Furthermore, subspace clustering problem is formally defined as follow:

Definition 1.

(Subspace clustering) Given a set of sufficiently sampled data vectors ${\bf{X}}=[{{\bf{X}}_{1}},...,{{\bf{X}}_{k}}]=[{{\bf{x}}_{1}},...,{{\bf{x}}_{n}}]\in{\mathds{R}^{d\times n}}$ , where $d$ represents the feature dimension and $n$ is the number of data. Assume that the data are drawn from a union of $k$ subspaces $\{{S_{i}}\}_{i=1}^{k}$ , and $X_{i}$ be a collection of $n_{i}$ points drawn from the subspace $S_{i}$ , $n=\sum\nolimits_{i=1}^{k}{{n_{i}}}$ . The task of subspace clustering is to segment the data according to the underlying subspaces they are drawn from.

In the past two decades, many advances have been done to improve the performance of subspace clustering [7, 8, 9, 10, 11, 12, 13]. They can be roughly divided into four categories, including algebraic methods [14, 15], iterative methods [16, 17], statistical methods [18, 19] and spectral clustering based methods [20, 21, 22, 23, 24]. Most recently, spectral clustering based methods have shown its excellent performance in many applications. In general, spectral clustering based methods are consisted of two main steps. Firstly, a similarity or affinity matrix is constructed to represent the similarity between the samples in the raw data. Secondly, a spectral clustering algorithm is employed to divide the raw data into $k$ groups based on the learned similarity matrix. Note that, how to build a proper similarity matrix plays a decisive role in the process of subspace clustering. So most spectral clustering based models were proposed to construct a more efficient similarity matrix.

Reviewing the existing methods, a similarity matrix is generally constructed using a self-expression model which regards the data itself as a dictionary to learn a representation matrix [25, 26]. Such a self-expression model assumes that the samples can be well represented using the points in the same subspace and the learned representation matrix can capture the similarity between the samples in the raw data. Ideally, the learned representation matrix should be block-diagonal [27, 28], which means the affinities of samples between cluster are all zeros. Considering the real data usually contain noise, a loss function is employed to deal with the noise. Then the general model of spectral clustering based methods can be formulated as

[TABLE]

where $\bf{X}$ is the original data matrix, $\bf{Z}$ is the representation matrix and $\bf{E}$ represents the noise matrix. The functions of $\varphi(\bf{E})$ and $\delta(\bf{Z})$ are designed for restricting $\bf{E}$ and $\bf{Z}$ respectively. In many works, $\varphi(\bf{E})$ and $\delta(\bf{Z})$ are two properly norms. For example, Sparse Subspace Clustering (SSC) [22] uses $\ell_{1}$ norm to regularize the matrix $\bf{Z}$ for seeking the most sparsest representation of each point and chooses Frobenius norm to deal with the noise term $\bf{E}$ . Different with SSC, Low-Rank Representation (LRR) [29] employs the nuclear norm to regularize the matrix $\bf{Z}$ for capturing the correlation structure of the data and uses $\ell_{21}$ norm to describe the matrix $\bf{E}$ . Based on SSC and LRR, many works [30, 31, 32, 33, 34, 35] were proposed to design different regularizations for the representation matrix $\bf{Z}$ and choose a simple norm on the noise matrix $\bf{E}$ .

Note that the previous works mainly focus on choosing a proper norm to regularize the representation matrix and ignore the influence of the noise term on subspace clustering. However, the real data are always contaminated by the unknown noise, and the noise usually has a complicated statistical distribution [29, 36, 37]. If we can’t adopt a proper model to deal with the noise, the learned representation matrix may fail to capture the similarity between samples which can result in a unreliable subspace clustering result. So how to handle the noise is a difficult task and has a significant influence on subspace clustering. Although the existing methods choose the different norm to handle the noise, they can only deal with the specific noise. For example, $\ell_{1}$ norm is suitable for entry-wise corruptions, $\ell_{21}$ norm is for sample-specific corruptions and Frobenius norm is to tackle Gaussian noise. Besides, Li et al. [36] tried to describe the noise using Mixture of Gaussian Regression (MoG Regression). Although it has shown its superiority through the comparison experiments, it is sensitive to the number of Gaussian and has high computational cost.

To alleviate the noise’s effect on subspace clustering, we in this paper propose a subspace clustering method by using Cauchy loss function (CLF) to suppress the noise term. Compared with the conventional $\ell_{1}$ or $\ell_{2}$ loss, the influence function of CLF has a upper bound. So it can alleviate the influence of a single sample, especially the sample with a large noise, on estimating the residuals. Therefore, CLF has less dependence on the distribution of the noise and is more robust to the noise. Because our work mainly focuses on the noise term, we simply use the Frobenius norm to regularize the representation matrix. Furthermore, we prove the grouping effect of our method, which means that highly correlated data can be grouped together. Experimental results on the real datasets show the effectivness of our proposed method.

I-A Paper Contributions and Organization

Our work has the following three main contributions.

We propose a robust subspace clustering method based on Cauchy loss function (CLF). Specifically, CLF is able to penalize the point with large noise rather than giving a specific assumption on the distribution of the noise. So our method is more robust to different kinds of the noise in the real data. 2. 2.

The grouping effect of our method is theoretically proved, which can preserve the local structure in the raw data. Therefore, highly correlated point can be grouped together in the low-dimensional subspace. 3. 3.

We verify our method on different real applications, including motion segmentation and image clustering. The experimental results show that our method achieves better performance than several representative methods.

The rest of this paper is arranged as below: The related work are introduced in Section II. Section III gives the problem formulation and the whole framework of our subspace clustering algorithm. In Section IV, we prove the grouping effect of our method which is a very useful property for subspace clustering, and then analyze the convergence of our optimization algorithm. The experimental results on real databases are presented in Section V. Finally, the paper is briefly concluded in Section VI.

II Related Work

Considering that our proposed method is a kind of spectral clustering based method, we mainly review the most recent and related works. Throughout the paper, we use the non-bold letters, bold lower case letters and bold upper case letters to represent scalars, vectors and matrices respectively.

Sparse Subspace Clustering (SSC) [22], as a first proposed spectral clustering based method, aims to find the sparsest representation for each point with all other points in a union of subspaces by solving the following problem:

[TABLE]

where $\lambda>0$ is a weighting factor to balance two terms. $diag(\bf{Z})=0$ is used to avoid the solution $\bf{Z}$ being an identity matrix, which means that one point can not be reconstructed using itself. As we all known, solving such sparse representation is a NP hard problem. So SSC uses $\ell_{1}$ norm to approximate the $\ell_{0}$ norm. The final objective function is given below:

[TABLE]

SSC assumes that one point can be reconstructed only using few points in the same subspace. When the data are drawn from independent subspaces, SSC can divide the points into their subspaces. But for the real data, the representation matrix of SSC may be too sparse to capture the relationship between points in the same subspace. Based on SSC, Wang and Xu [38] proposed a modified version, named Noisy Sparse Subspace Clustering (NSSC), to deal with noisy data.

Low-Rank Representation (LRR) [29] was proposed to capture the correlation structure of the data by finding a low-rank representation of the samples instead of a sparse one. The original problem of LRR is formulated as

[TABLE]

The above optimization problem is hard to be solved due to the discrete nature of the rank function. So LRR adopts the nuclear norm as a surrogate of the rank function. Furthermore, LRR uses $\ell_{21}$ norm to deal with the noise term for improving its robustness to the noise and outliers. The subspace clustering problem becomes

[TABLE]

However, there is no theoretical analysis about the importance of low rank property of the representation matrix $\bf{Z}$ for subspace clustering. Besides, the solution $\bf{Z}^{*}$ may be very dense and far from block-diagonal.

Least Squares Regression (LSR) [27] employs the Frobenius norm to handle the representation matrix and the noise matrix simultaneously. The corresponding optimization problem is defined as

[TABLE]

Note that the above problem can be efficiently solved. The main contribution of LSR is that it encourages grouping effect which can group highly correlated data together.

In order to balance the sparsity and low rank property of the representation matrix, Correlation Adaptive Subspace Segmentation (CASS) [30] was proposed to optimize the problem

[TABLE]

where ${{{\left\|{{\bf{X}}diag({{\bf{z}}_{i}})}\right\|}_{*}}}$ is trace lasso and its definition can be found in [30]. Due to taking the data correlation into account, it can adaptively interpolate SSC and LSR.

Mixture of Gaussian Regression (MoG Regression) [36], as a most related method to our work, uses the mixture of Gaussian model to describe the noise term and tries to solve the following problem

[TABLE]

where ${\kern 1.0pt}{\pi_{k}}$ is the mixing weight, ${{{\bf{e}}_{n}}}$ is mean vector, ${{{\bf{\Sigma}}_{k}}}$ is the covariance matrix and $K$ denotes the number of Gaussian. Although MoG Regression has better performance than the single Gaussian model, it is only a extended version of single Gaussian and is sensitive to the number of Gaussian. Additionally, solving the above problem needs high computation cost.

III Subspace Clustering by CLF

In this paper, we propose a new spectral clustering based method to alleviate the influence of the noise on subspace clustering. Particularly, we employ Cauchy loss function (CLF) to suppress the noise. Next we give the details of our optimization objection function and the framework of our subspace clustering method.

III-A Problem Formulation

In statistics, M-estimator is a broad class of estimators, which is used to represent the minima of sum of functions. Let $r_{i}$ denotes the residual of the $i$ -th data with its estimated value and $\rho(r_{i})$ be a symmetric and positive-define function which has a unique minimum at zero. M-estimator aims to optimize the following problem:

[TABLE]

The influence function of $\rho$ -function is defined as:

[TABLE]

which is used to measure the effect of changing a point of the sample on the value of the parameter estimation.

We demonstrate different estimators and their influence functions in Fig. 1. For the $l_{2}$ estimator (least-squares) with $\rho(x)=x^{2}$ , its influence function is $\psi(x)=x$ . From Fig. 1, we can see that the influence of a sample on the estimate grows linearly as the error increases. This means the $l_{2}$ estimator is not robust to the noise. Although the $l_{1}$ estimator (least-absolute deviation) with $\rho(x)=\left|x\right|$ can alleviate the effect of the large error, its influence function has no cut-off [39, 40]. For a robust estimator, its influence function should not be sensitive to the increase of the error. CLF gives good characteristic on this aspect, and its definition is shown below

[TABLE]

with influence function

[TABLE]

where $c$ is a constant. Note that CLF’s influence function has the upper bound and its value tends to zero with the increase of the error.

Considering CLF is robust to the noise, we use CLF to penalize the noise term which is defined as

[TABLE]

where $\bf{X}$ is the data matrix, and ${\bf{z}}_{i}$ denotes the representation vector of the $i$ -th data ${\bf{x}}_{i}$ . As stated before, we simply use the Frobenius norm to regularize the representation matrix for verifying the influence of the noise model on subspace clustering and facilitating the problem solving. The corresponding model can be formulated as

[TABLE]

where $\lambda$ is a weight factor to balance the effect of two terms. For the formula (14), an iterative algorithm can be employed to find the solution for each data point, but it is not a high-efficiency way to obtain the representation matrix. In order to reduce the time complexity and keep the valuable property, we revise the formula (14) and give the final objective function

[TABLE]

Note that it takes the representation matrix $\bf{Z}$ as an integrate to learn. Therefore we can directly to optimize the representation matrix by using an iteration process.

III-B Optimization

For the problem (15), we adopt Iteratively Re-weighted Residuals (IRR) method to find the solution. Given the data matrix $X$ , the formula (15) can be rewritten as

[TABLE]

Setting the derivative of $\mathcal{J}$ with respect to $\bf{Z}$ to zero, we have

[TABLE]

which is equivalent to

[TABLE]

Then we can obtain the solution

[TABLE]

where $\bf{R}$ is the residual of the data matrix with the corrected matrix, and $Q$ is the weight function which is used to reduce the effect of the noise. Note that $Q$ should be calculated using the representation matrix $\bf{Z}$ . Then an iterative way is adopted to update $\bf{Z}$ until convergence. The whole procedure for solving problem (15) is described in Algorithm 1.

III-C Subspace Clustering Algorithm via CLF

In this section, we give the framework of our proposed subspace clustering algorithm which is outlined in Algorithm 2. Note that we first use Algorithm 1 to find the representation matrix ${\bf{Z}}^{*}$ . Then the similarity matrix is defined as ${\bf{W}}=(\left|{{{\bf{Z}}^{*}}}\right|+|{{{({{\bf{Z}}^{*}})}^{T}}}|)/2$ , where ${({{\bf{Z}}^{*}})^{T}}$ is the transposition of ${\bf{Z}}^{*}$ . Finally, Normalized Cuts [41], a kind of spectral clustering algorithm [42], is employed to group the data points into $k$ clusters based on the similarity matrix.

In order to demonstrate the structure of the learned similarity matrix, we show the similarity matrices of 10 subjects derived by SSC, LRR, LSR, CASS, MoG Regression, NSSC and our proposed method on the USPS dataset in Fig. 2. For simplicity, we use MoG to denote MoG Regression. USPS is a popular handwritten digit database for clustering analysis. From Fig. 2, we can see that all the methods can give a approximate block-diagonal matrix. The similarity matrices obtained by SSC and CASS are sparse and similar which means that CASS gives a large weight for the sparsity of the representation matrix. Besides, NSSC also gives a very sparse similarity matrix. However, the points in the same cluster have no high correlation which can degenerate the performance of subspace clustering. In contrast, the similarity matrices learned by LRR, LSR, MoG Regression and our method are very dense which give high similarity for the samples within the same cluster. Furthermore, we define a Contrast Index (CI) to quantitatively measure the difference between diagonal blocks and non-diagonal blocks of the similarity matrix. The corresponding formulation is

[TABLE]

where $S_{D}$ and $S_{ND}$ denote the sum of elements in diagonal and non-diagonal blocks, respectively. Table I gives the CI of the similarity matrices obtained by different methods. Note that MoG gives a lowest CI which can be seen from Fig. 2. Obviously, our method gives a higher CI than other methods which means that our proposed model has greater ability to group correlated data together.

IV Theoretical Analysis

In this section, we prove that our proposed method has the grouping effect which can group highly correlated data together, and then analyze the convergence of our optimization algorithm.

IV-A The Grouping Effect

Theorem 1.

Given a data point ${\bf{x}}\in\mathbb{R}^{d}$ , the normalized data matrix $\bf{X}$ and a parameter $\lambda$ . Let ${\bf{\hat{z}}}$ be the optimal solution to the following problem (in vector form):

[TABLE]

Then we have

[TABLE]

where $r={\bf{x}}_{i}^{T}{{\bf{x}}_{j}}$ is the sample correlation. ${{{\hat{z}}}^{i}}$ and ${{{\hat{z}}}^{j}}$ are the $i$ -th and $j$ -th entries of vector ${\bf{\hat{z}}}$ . ${\bf{x}}_{i}$ and ${\bf{x}}_{j}$ are the $i$ -th and $j$ -th columns of $\bf{X}$ .

Proof.

Let

[TABLE]

Since ${\bf{\hat{z}}}=\mathop{\arg\min}\limits_{\bf{z}}L({\bf{z}})$ , we have

[TABLE]

This gives

[TABLE]

Equations (25) and (26) give

[TABLE]

Since each column of $\bf{X}$ is normalized, ${\left\|{{{\bf{x}}_{i}}-{{\bf{x}}_{j}}}\right\|_{2}}=\sqrt{2(1-r)}$ , where $r={{\bf{x}}_{i}}^{T}{{\bf{x}}_{j}}$ . Note that $\bf{\hat{z}}$ is the optimal to the problem (21), and we deduce

[TABLE]

Thus ${\left\|{{\bf{x}}-{\bf{X\hat{z}}}}\right\|_{2}}\leq{\left\|{\bf{x}}\right\|_{2}}$ . Finally, we obtain

[TABLE]

∎

As stated in Theorem 1, if ${\bf{x}}_{i}$ and ${\bf{x}}_{j}$ are highly correlated, the value of $r$ is close to 1, which means that the difference between ${\hat{z}}^{i}$ and ${\hat{z}}^{j}$ is almost 0. Then ${\bf{x}}_{i}$ and ${\bf{x}}_{j}$ can be grouped into the same cluster. Note that Theorem 1 gives the grouping effect for one point (vector form). For the matrix form, the corresponding grouping effect can still be proved using the similar proof procedure of Theorem 1.

IV-B Convergence Analysis

We employ the Weiszfeld’s method [43] to analyze the convergence of Algorithm 1. The formula (16) is equivalent to

[TABLE]

where ${\bf{z}}_{i}$ is the representation vector of ${\bf{x}}_{i}$ . The solution $\bf{Z}$ in (19) can be rewritten as

[TABLE]

The main idea of the Weiszfeld s method is to globally approximate $\cal J$ using a sequence of quadratic function [44]. After obtaining the solution ${\bf{Z}}^{k}$ , we can define a upper bound of ${\cal J}({\bf{z}}_{i})$ as $\phi({\bf{z}}_{i};{\bf{z}}_{i}^{k})$ , where ${\cal J}({\bf{z}}_{i})$ is obtained by fixing the other variables in ${\cal J}({\bf{Z}})$ . $\phi({\bf{z}}_{i};{\bf{z}}_{i}^{k})$ should satisfy the following conditions:

[TABLE]

Then $\phi({\bf{z}}_{i};{\bf{z}}_{i}^{k})$ has the form

[TABLE]

with symmetric matrix $C({\bf{z}}_{i}^{k})$

[TABLE]

Then the convergence of Algorithm 1 can be guaranteed by the following theorem.

Theorem 2.

The IRR algorithm proposed in Algorithm 1 guarantees that the objective function value of (16) is monotone decreasing in iterations, i.e. ${\cal J}({{\bf{Z}}^{k+1}})\leq{\cal J}({{\bf{Z}}^{k}})$ , until it converges.

Proof.

Suppose that $\phi({\bf{z}}_{i};{\bf{z}}_{i}^{k})$ is locally convex with respect to ${\bf{z}}_{i}$ and has a local minimizer. Let ${\bf{z}}_{i}^{k+1}$ be the minimizer, we get

[TABLE]

Substituting for ${\cal J}^{\prime}({\bf{z}}_{i}^{k})$ , we can obtain the update rule in formula (31).

By appropriately choosing ${\bf{z}}_{i}^{k}$ near ${\bf{z}}_{i}$ , we have ${\cal J}({\bf{z}}_{i})\leq\phi({\bf{z}}_{i};{\bf{z}}_{i}^{k})$ which implies that

[TABLE]

Equations (35) and (36) give

[TABLE]

So we have ${\cal J}({\bf{z}}_{i}^{k+1})\leq{\cal J}({\bf{z}}_{i}^{k})$ . Based on (30), we can easily deduce

[TABLE]

∎

V Experimental Verification and Analysis

In this section, we verify the effectiveness of our proposed method on five real databases: Hopkins 155 motion segmentation database [45], USPS [46], C-Cube [47, 48], PEI and Extended Yale B database [49]. Our method is compared with the traditional Kmeans, SSC [22], LRR [29], LSR [27], CASS [30], MoG Regression [36] and NSSC [38]. SSC, LRR, LSR, CASS, MoG Regression and NSSC are representative subspace clustering methods which are introduced in section II. For fair comparison with the previous methods, we adopt the same preprocessing for the whole databases: use PCA to reduce the dimension of the original data and keep nearly 98 percent energy. Besides, the parameters of each method are manually tuned to achieve their best performance. Finally, we employ the clustering accuracy (AC) [50, 51] and the normalized mutual information metric (NMI) [52, 53] to evaluate the subspace clustering results. From the experimental results, we can see that our method achieves better performance than other state-of-the-art methods.

V-A Data sets

We firstly give the detailed description about five real data sets used in the experiments.

•

The first data set is the Hopkins 155 motion segmentation database. It consists of 155 video sequences, where 120 of the videos have two motions and 35 of the videos contain three motions (a motion corresponding to a subspace). For each video, feature trajectories have been extracted for clustering. The number of feature trajectories of each video ranges from 39 to 550. Each video can be regarded as a subspace clustering task, and so there are 155 subspace segmentation tasks totally.

•

The second is the USPS database which is one of the standard data sets for handwritten digit recognition [54]. It contains 9298 images of hand-written digits from 0 to 9. The size of each image is $16\times 16$ . To reduce the memory consumption in our experiments, we randomly select 30 images for each digit to construct a subset with 300 samples.

•

The third is the C-Cube cursive character data set which contains both the upper and lower case of 26 letters. It has 57646 character images and the average dimension of all images is about 3120. For each subject, we randomly select 20 images to form a subset for our experiments. Then each image is normalized to $24\times 24$ pixel array and reshaped to a vector.

•

The forth data set is the FEI part 1 database. This database is the subset of the whole FEI database. It contains 700 images with 50 subjects, and each subject has 14 images captured from a large range of views.

•

The fifth data set is the Extended Yale B Database which is a popular dataset for image clustering [55, 56, 57]. It consists of 2414 frontal face images of 38 subjects, and each subject has about 64 frontal face images with different pose, angle and illumination conditions. In our experiment, we construct three subspace clustering tasks based on the first 5, 8 and 10 subjects, and each subject has 64 face images.

Fig. 3 gives some samples of these five data sets. From Fig. 3(e), we can see that Extended Yale B is a tough database for subspace clustering due to its large noise. So we can further verify the effectiveness of our method in handling the noise. Table II gives the statistics of these databases. For the Hopkins 155 database, the values of size and dimensionality represent the average of the whole videos, and the class of each video is 2 or 3.

V-B Evaluation Criterion

The clustering results are evaluated by comparing the obtained label of each subspace clustering method with the groundtruth. The clustering accuracy (AC) and the normalized mutual information (NMI), as two popular metrics, are employed to measure the clustering performance.

Given an obtained label vector ${\bf{o}}_{i}$ and a corresponding groundtruth label vector ${\bf{g}}_{i}$ . The AC is calculated by

[TABLE]

where ${{\bf{o}}_{i}}^{\prime}=map({{\bf{o}}_{i}})$ . $map({\bf{o}}_{i})$ is the permutation mapping function that chooses ${\bf{g}}_{i}$ as a reference vector and maps each element in ${\bf{o}}_{i}$ to the equivalent label in ${\bf{g}}_{i}$ . So $map({\bf{o}}_{i})$ is designed for solving the problem of correspondence between two label vectors. Kuhn-Munkres algorithm can be utilized to find the best mapping.

Mutual Information (MI), as a symmetric measure to quantify the information shared between two statistical distributions, provides a degree of agreement between two clustering results. Let $c_{p}$ be the cluster obtained from the groundtruth ${\bf{g}}_{i}$ and $c^{\prime}_{q}$ obtained from our clustering result ${\bf{o}}_{i}$ . Then the corresponding MI is defined as follow:

[TABLE]

where $k$ and $k^{\prime}$ denote the number of clusters in groundtruth and our clustering result, respectively. $n_{p}$ is the number of points in cluster $c_{p}$ , $n^{\prime}_{q}$ is the number of points in cluster $c^{\prime}_{q}$ and $n_{pq}$ denotes the number of shared points between $c_{p}$ and $c^{\prime}_{q}$ . In order to obtain a normalized version of MI that ranges from 0 to 1, we use the NMI metric as

[TABLE]

where $H(\cdot)$ denotes the entropy function.

V-C Parameter Selection

Our proposed method has two essential parameters: the weight factor $\lambda$ and a constant $c$ . Then we conduct the corresponding comparison experiments to choose the best parameter for each method on the whole databases. To reduce the memory consumption in our experiments, we only use the first five videos of the Hopkins 155 database to choose the appropriate parameters. For the USPS, NSSC, FEI and Extended Yale B databases, we use the first five subjects to select the parameters, respectively. Besides, we set the range of $\lambda$ and $c$ as $[10^{-4},10^{4}]$ .

Fig. 4 gives the performance of different methods with the parameter $\lambda$ . For NSSC, when $\lambda<1$ , its optimization method usually fails to give a local optimal solution, and it can give a good performance when $\lambda=10$ . So we fix $\lambda=10$ for NSSC on the whole datasets. Note that our method can give a best performance when $\lambda=10^{-4}$ on Hopkins 155 database. Hence, we fix $\lambda=10^{-4}$ for our method on the Hopkins 155. For the USPS database, our method obtains a better performance than other methods when $\lambda$ is larger than 0.01 and gives the largest CI when $\lambda=1$ . For C-Cube, our proposed method gives the best clustering result when $\lambda=0.5$ . For FEI and Extended Yale B databases, our method shows its effectiveness when $\lambda$ is around 0.01. Compared with other methods, MoG can give a stable performance on these five databases with respect to the parameter $\lambda$ while it gives a bad clustering accuracy on the USPS and FEI databases. For the USPS and Extended Yale B databases, the curve of LRR and CASS both give a bigger fluctuation. Because Kmeans has no parameter, its accuracy curve is a straight line. Note that the Kmeans algorithm gives a very low performance on the Extended Yale B database.

For the parameter $c$ , we can see that the comparison methods have no parameter $c$ and always give a straight line. Note that our method can give the best performance when $c$ is smaller than 1 on the Hopkins 155, USPS, C-Cube and Extended Yale B databases. Especially for Extended Yale B, the accuracy of our method is almost 100 percent. For FEI, the accuracy of our method is highest when $c=0.01$ . Therefore, our method has the ability to achieve the best performance for the whole databases. Note that when the value of $c$ is larger than 0 or 1, the performance of our method tends to decrease rapidly. From our objective function (15), we can see that when parameter $c$ increases, the noise term can be very small for all situations which directly reduces the ability of our objective function to suppress the large noise. Hence, using Cauchy loss function to deal with the noise term is powerful to reduce the influence of the noise on subspace clustering. The best parameters of each method for the experiments on the whole databases are listed in Table III.

V-D Experimental results

Table IV, V, VI, VII and VIII give the experimental results of different methods on the Hopkins 155, USPS, C-Cube, FEI and Extended Yale B databases, respectively. From Table IV, we can see that MoG and our method give the best performance on the average accuracy of the whole videos. But the corresponding NMI of MoG is lower than our proposed method. For the 3 motions situation, our method gives the best results both on the metrics AC and NMI. The medians of our method for 2 motions and total cases can reach 100 percent which shows the superiority of our proposed method. Although the accuracy of MoG is slightly bigger than our method for 2 motions, our method gives better quality clustering results through balancing all the cases. For the USPS data set, our method outperforms other algorithms for the whole situations. Especially for the case of 5 subjects, the accuracy of our method is more than 7 percent better than the second best result. Note that the Kmeans algorithm gives the better performance than CASS and MoG on the USPS database, which means the handwritten digit data perhaps lacks the subspace structure. Even so, our method still shows its effectiveness on this data set. For C-Cube, we can see that SSC shows good performance for 20 subjects based on AC, and LSR gives the highest NMI for 50 subjects. However, our method outperforms other methods in eight out of ten total cases. In particular, the AC value of our method is more than 4 percent higher than the second best result. From Table VII, we can see that SSC outperforms other methods for 30 and 40 subjects, and CASS gives the best performance with 20 subjects. These subspace clustering results can be attributed to the subspace preserving of sparseness. For the remaining cases, our method can achieve the best clustering results. In particular, the accuracy of our method is nearly 99 percent for the 5 subjects. Table VIII shows the clustering results on the Extended Yale B database. It shows that our method outperforms state-of-the-art methods for all these three clustering tasks, and MoG gives the same accuracy with our method for 8 subjects. Especially for the case of 5 subjects, the accuracy of our method is higher than the second best result by 10 percent which is a significant improvement. Note that Kmeans gives a very bad performance on the Extended Yale B database which means that the performance of Kmeans algorithm is easily influenced by the noise in the data. As stated in section V-A, the Extended Yale B database contains the large noise. Therefore, this experiment can further verify the effectiveness of our method in handling the noise.

In summary, our proposed method is more robust to the noise and outperforms other state-of-the-art methods on the whole databases. It is sufficient to verify that our method is capable of finding the underlying subspace structure and clustering the data points into their subspaces.

V-E Computational Complexity Analysis

As shown in Algorithm 1, the computation cost of our iterative algorithm depends on the computation of $\bf{Z}$ , $Q$ and $\bf{R}$ . The main computation cost of $\bf{Z}$ is the computation of ${\left({{Q}^{t+1}}{{{\bf{X}}^{T}}{\bf{X}}+2\lambda{\bf{I}}}\right)^{-1}}$ which is $\mathbf{O}({n^{3}})$ . For $Q$ , its time cost is the computation of $\left\|{{{\bf{R}}^{t+1}}}\right\|_{F}^{2}$ which is $\mathbf{O}({n^{2}})$ . The computational cost for $\bf{R}$ is $\mathbf{O}(d{n^{2}})$ . Therefore, the overall time complexity of our optimization method is $\mathbf{O}(t{n^{3}}+td{n^{2}})$ , where $t$ denotes the number of iterations.

Furthermore, we give the computation time of different algorithms. Due to space limit, we only report the running time of all compared methods on the FEI data set which is shown in Table IX. Note that the results are based on the codes implemented by their authors. The calculations are performed using an Intel(R) Core(TM) i3-2130 CPU @ 3.40GHz with 16.00GB memory and 64-bit Windows7 operating system. It can be seen that the computation time of LSR is lower than other subspace clustering methods. This comes from the fact that LSR can directly obtain a closed-form solution without using an iterative way. However, SSC, CASS and MoG consume more time than other methods. Especially for MoG, its computation time increases drastically in the number of subjects. As for LRR, NSSC and our method, the computational cost of them is moderate for all situations.

VI Conclusion

In this paper, we propose a robust subspace clustering method based on Cauchy loss function (CLF). To this end, we use CLF to penalize the noise term for suppressing the large noise mixed in the real data. Due to that the CLF’s influence function has a upper bound, it can alleviate the influence of a single sample, especially the sample with a large noise, on estimating the residuals. Furthermore, we theoretically prove the grouping effect of our proposed method, and present its convergence analysis. Finally, experimental results on five real datasets reveal that our proposed method outperforms several representative methods.

Bibliography57

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] W. Hong, J. Wright, K. Huang, and Y. Ma, “Multiscale hybrid linear models for lossy image representation,” IEEE Trans. Image Process. , vol. 15, pp. 3655–3671, Dec. 2006.
2[2] J. Yan and M. Pollefeys, “A general framework for motion segmentation: Independent, articulated, rigid, non-rigid, degenerate and non-degenerate,” in Proc. Eur. Conf. Comput. Vis. , Graz, Austria, May 2006, pp. 94–106.
3[3] C. Lang, G. Liu, J. Yu, and S. Yan, “Saliency detection by multitask sparsity pursuit,” IEEE Trans. Image Process. , vol. 21, pp. 1327–1338, Mar. 2012.
4[4] J. Ho, M.-H. Yang, J. Lim, K.-C. Lee, and D. Kriegman, “Clustering appearances of objects under varying illumination conditions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. , Madison, WI, Jun. 2003, pp. 11–18.
5[5] G. Cui, X. Li, and Y. Dong, “Subspace clustering guided convex nonnegative matrix factorization,” Neurocomputing , vol. 292, pp. 38–48, 2018.
6[6] L. I. Smith, “A tutorial on principal components analysis,” Inform. Fusion , vol. 51, no. 3, pp. 219–226, 2002.
7[7] X. Peng, H. Tang, L. Zhang, Z. Yi, and S. Xiao, “A unified framework for representation-based subspace clustering of out-of-sample and large-scale data,” IEEE Trans. Neural Netw. Learn. Syst. , vol. 27, pp. 2499–2512, Dec. 2016.
8[8] M. E. Tipping and C. M. Bishop, “Mixtures of probabilistic principal component analyzers,” Neural Computation , vol. 11, no. 2, pp. 443–482, 1999.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Robust Subspace Clustering by

Abstract

Index Terms:

I Introduction

Definition 1**.**

I-A Paper Contributions and Organization

II Related Work

III Subspace Clustering by CLF

III-A Problem Formulation

III-B Optimization

III-C Subspace Clustering Algorithm via CLF

IV Theoretical Analysis

IV-A The Grouping Effect

Theorem 1**.**

Proof.

IV-B Convergence Analysis

Theorem 2**.**

Proof.

V Experimental Verification and Analysis

V-A *Data sets *

V-B Evaluation Criterion

V-C Parameter Selection

V-D Experimental results

V-E Computational Complexity Analysis

VI Conclusion

Definition 1.

Theorem 1.

Theorem 2.

V-A Data sets