Ultra-Scalable Spectral Clustering and Ensemble Clustering

Dong Huang; Chang-Dong Wang; Jian-Sheng Wu; Jian-Huang Lai; Chee-Keong; Kwoh

arXiv:1903.01057·cs.LG·March 6, 2019

Ultra-Scalable Spectral Clustering and Ensemble Clustering

Dong Huang, Chang-Dong Wang, Jian-Sheng Wu, Jian-Huang Lai, Chee-Keong, Kwoh

PDF

TL;DR

This paper introduces ultra-scalable spectral and ensemble clustering algorithms designed for extremely large datasets, achieving high efficiency and robustness with nearly linear complexity, suitable for resource-limited environments.

Contribution

The paper presents two novel algorithms, U-SPEC and U-SENC, that significantly improve scalability and robustness of spectral clustering for large-scale data.

Findings

01

Capable of clustering ten-million-level datasets on standard PCs

02

Nearly linear time and space complexity achieved

03

Demonstrated robustness and scalability on various large datasets

Abstract

This paper focuses on scalability and robustness of spectral clustering for extremely large-scale datasets with limited resources. Two novel algorithms are proposed, namely, ultra-scalable spectral clustering (U-SPEC) and ultra-scalable ensemble clustering (U-SENC). In U-SPEC, a hybrid representative selection strategy and a fast approximation method for K-nearest representatives are proposed for the construction of a sparse affinity sub-matrix. By interpreting the sparse sub-matrix as a bipartite graph, the transfer cut is then utilized to efficiently partition the graph and obtain the clustering result. In U-SENC, multiple U-SPEC clusterers are further integrated into an ensemble clustering framework to enhance the robustness of U-SPEC while maintaining high efficiency. Based on the ensemble generation via multiple U-SEPC's, a new bipartite graph is constructed between objects and…

Tables16

Table 1. TABLE I: Summary of notations

$𝒳$	A dataset of $N$ objects
$x_{i}$	The $i$ -th object in $𝒳$
$N$	Number of objects in $𝒳$
$d$	Dimension
$t$	Number of iterations in the $k$ -means method
$k$	Number of clusters in the clustering result
$p^{'}$	Number of candidate representatives
$p$	Number of representatives
$ℛ$	The set of representatives
$r_{i}$	The $i$ -th representatives in $ℛ$
$ℛ 𝒞$	The set of rep-clusters
$r c_{i}$	The $i$ -th rep-cluster in $ℛ 𝒞$
$y_{i}$	Center of $r c_{i}$
$z_{1}$	Number of rep-clusters in $ℛ 𝒞$
$z_{2}$	Average number of objects in each rep-cluster
$K$	Number of nearest representatives
$K^{'}$	Candidate neighborhood size around a representative
$D i s t (x_{i}, r c_{j})$	Distance between object $x_{i}$ and rep-cluster $r c_{j}$
$G$	A bipartite graph between $𝒳$ and $ℛ$
$B$	Cross-affinity matrix of graph $G$ .
$b_{i j}$	The $(i, j)$ -th entry of $B$
$E$	Full affinity matrix of graph $G$
$L$	Graph Laplacian of graph $G$
$D$	Degree matrix of graph $G$
$u_{i}$	The $i$ -th eigenvector of graph $G$
$γ_{i}$	The $i$ -th eigenvalue of graph $G$
$G_{ℛ}$	A small graph with $ℛ$ as the node set
$E_{ℛ}$	Affinity matrix of graph $G_{ℛ}$
$L_{ℛ}$	Graph Laplacian of graph $G_{ℛ}$
$D_{ℛ}$	Degree matrix of graph $G_{ℛ}$
$v_{i}$	The $i$ -th eigenvector of graph $G_{ℛ}$
$λ_{i}$	The $i$ -th eigenvalue of graph $G_{ℛ}$
$D_{𝒳}$	Diagonal matrix with its $(i, i)$ -th entry being the
	sum of the $i$ -th row of $B$
$T$	Transition probability matrix
$Π$	The ensemble of $m$ base clusterings
$π^{i}$	The $i$ -th base clustering in $Π$
$m$	Number of base clusterings in $Π$
U-SPEC_i	The clusterer to generate the $i$ -th base clustering
$ℛ^{i}$	The set of representatives in U-SPEC_i
$r_{j}^{i}$	The $j$ -th representatives in $ℛ^{i}$
$k^{i}$	Number of clusters in $π^{i}$
$k_{m i n}$	Minimum number of clusters in a base clustering
$k_{m a x}$	Maximum number of clusters in a base clustering
$τ$	Random variable in $[0, 1]$
$𝒞$	Set of all clusters in $Π$
$C_{i}$	The $i$ -th cluster in $𝒞$
$k_{c}$	Number of clusters in $𝒞$
$\tilde{G}$	A bipartite graph between $𝒳$ and $𝒞$
$\tilde{B}$	Cross-affinity matrix of graph $\tilde{G}$ .
${\tilde{b}}_{i j}$	The $(i, j)$ -th entry of $\tilde{B}$
${\tilde{u}}_{i}$	The $i$ -th eigenvector of graph $\tilde{G}$
${\tilde{D}}_{𝒳}$	Diagonal matrix with its $(i, i)$ -th entry being the
	sum of the $i$ -th row of $\tilde{B}$
$G_{𝒞}$	A small graph with $𝒞$ as the node set
$E_{𝒞}$	Affinity matrix of graph $G_{𝒞}$
$L_{𝒞}$	Graph Laplacian of graph $G_{𝒞}$
$D_{𝒞}$	Degree matrix of graph $G_{𝒞}$
${\tilde{v}}_{i}$	The $i$ -th eigenvector of graph $G_{𝒞}$
${\tilde{λ}}_{i}$	The $i$ -th eigenvalue of graph $G_{𝒞}$

Table 2. TABLE II: Comparison of the time complexity of several large-scale spectral clustering methods.

Method	Representative selection	Affinity construction	Eigen-decomposition
Nyström [3]	/	$O (N p d)$	$O (N p + p^{3})$
LSC-R [4]	/	$O (N p d)$	$O (N p^{2} + p^{3})$
LSC-K [4]	$O (N p d t)$	$O (N p d)$	$O (N p^{2} + p^{3})$
U-SPEC	$O (p^{2} d t)$	$O (N p^{\frac{1}{2}} d)$	$O (N K (K + k) + p^{3})$

Table 3. TABLE III: Description of the real and synthetic datasets.

Dataset		#Object	Dimension	#Class
Real	PenDigits	10,992	16	10
	USPS	11,000	256	10
	Letters	20,000	16	26
	MNIST	70,000	784	10
	Covertype	581,012	54	7
Synthetic	TB-1M	1,000,000	2	2
	SF-2M	2,000,000	2	4
	CC-5M	5,000,000	2	3
	CG-10M	10,000,000	2	11
	Flower-20M	20,000,000	2	13

Table 4. TABLE IV: Average NMI(%) scores (over 20 runs) by our methods and the baseline spectral clustering methods (The best score in each row is in bold).

Dataset	$k$ -means	SC	ESCG	Nyström	LSC-K	LSC-R	FastESC	EulerSC	U-SPEC	U-SENC
PenDigits	66.66_±1.76	59.36_±0.00	76.41_±2.26	65.67_±1.16	79.73_±2.09	78.13_±2.20	65.31_±0.71	58.59_±0.73	80.30_±2.18	85.34_±0.91
USPS	44.11_±1.24	63.44_±0.01	48.41_±3.53	44.91_±1.28	66.86_±1.58	58.64_±1.31	41.36_±1.80	40.31_±1.91	63.47_±0.97	73.89_±1.82
Letters	34.86_±0.60	10.43_±0.50	35.80_±1.72	39.02_±0.83	43.41_±0.81	40.98_±0.93	35.92_±1.41	31.76_±0.92	42.53_±1.32	45.90_±0.58
MNIST	48.91_±2.00	74.07_±0.00	55.75_±4.62	47.78_±1.17	73.97_±1.46	62.16_±2.22	43.44_±1.85	8.93_±1.22	67.43_±1.55	75.02_±0.81
Covertype	6.17_±0.00	N/A	N/A	6.93_±0.07	6.75_±0.10	6.69_±0.12	9.15_±1.00	0.01_±0.00	6.97_±0.16	9.13_±1.21
TB-1M	25.71_±0.00	N/A	N/A	24.06_±0.01	0.10_±0.11	0.20_±0.24	24.01_±2.72	25.94_±0.01	95.86_±0.48	97.48_±0.05
SF-2M	47.34_±0.23	N/A	N/A	46.66_±0.02	66.45_±6.15	58.34_±6.92	52.03_±0.95	47.35_±2.19	75.59_±2.12	77.02_±2.32
CC-5M	0.00_±0.00	N/A	N/A	N/A	N/A	N/A	N/A	0.00_±0.00	99.87_±0.01	99.91_±0.00
CG-10M	63.20_±1.59	N/A	N/A	N/A	N/A	N/A	N/A	16.19_±0.21	78.82_±1.61	89.57_±3.96
Flower-20M	64.19_±2.56	N/A	N/A	N/A	N/A	N/A	N/A	26.61_±0.86	86.86_±2.05	92.47_±2.45
Avg. score	-	N/A	N/A	N/A	N/A	N/A	N/A	25.57	69.77	74.57
N-Avg. score	-	N/A	N/A	N/A	N/A	N/A	N/A	33.94	91.71	99.98
Avg. rank	-	5.90	6.00	5.20	3.70	4.60	5.20	6.00	2.50	1.10

Table 5. TABLE V: Average CA(%) scores (over 20 runs) by our methods and the baseline spectral clustering methods (The best score in each row is in bold).

Dataset	$k$ -means	SC	ESCG	Nyström	LSC-K	LSC-R	FastESC	EulerSC	U-SPEC	U-SENC
PenDigits	71.57_±3.12	56.44_±0.00	77.21_±3.81	71.13_±2.07	83.07_±3.21	81.82_±3.17	69.97_±1.15	65.85_±1.87	84.17_±3.26	88.56_±0.61
USPS	47.25_±2.57	62.74_±0.02	53.47_±3.94	51.09_±1.93	68.42_±2.39	60.78_±2.18	48.80_±1.76	47.79_±2.41	63.76_±1.35	78.17_±3.05
Letters	28.15_±0.97	12.42_±0.46	30.37_±1.75	32.05_±0.91	35.45_±1.34	33.86_±1.13	29.32_±1.51	28.08_±1.44	35.71_±1.47	37.74_±1.06
MNIST	58.48_±2.67	74.46_±0.00	63.32_±4.64	59.72_±1.75	79.45_±1.02	69.24_±2.75	55.93_±2.41	24.06_±1.53	74.31_±2.28	80.58_±1.75
Covertype	49.05_±0.00	N/A	N/A	49.21_±0.11	49.45_±0.16	49.32_±0.25	48.88_±0.18	48.76_±0.00	49.76_±0.35	50.73_±0.62
TB-1M	78.93_±0.00	N/A	N/A	78.04_±0.01	51.54_±1.13	52.09_±1.58	77.97_±1.52	79.04_±0.00	99.55_±0.06	99.75_±0.01
SF-2M	74.33_±2.14	N/A	N/A	69.58_±0.05	85.34_±5.70	78.26_±7.43	74.13_±0.32	76.93_±2.17	93.60_±1.00	93.46_±2.27
CC-5M	52.96_±0.00	N/A	N/A	N/A	N/A	N/A	N/A	52.96_±0.00	99.99_±0.00	99.99_±0.00
CG-10M	63.14_±2.42	N/A	N/A	N/A	N/A	N/A	N/A	32.81_±0.67	81.32_±2.00	93.99_±3.25
Flower-20M	60.85_±3.33	N/A	N/A	N/A	N/A	N/A	N/A	33.75_±0.56	88.89_±2.85	93.79_±3.21
Avg. score	-	N/A	N/A	N/A	N/A	N/A	N/A	49.00	77.11	81.68
N-Avg. score	-	N/A	N/A	N/A	N/A	N/A	N/A	62.12	94.26	99.99
Avg. rank	-	6.10	5.90	5.30	3.50	4.40	5.90	5.80	2.10	1.10

Table 6. TABLE VI: Time costs(s) of our methods and the baseline spectral clustering methods.

Dataset	$k$ -means	SC	ESCG	Nyström	LSC-K	LSC-R	FastESC	EulerSC	U-SPEC	U-SENC
PenDigits	0.06	7.37	1.63	1.98	1.25	0.49	0.73	1.47	1.01	19.13
USPS	0.32	9.56	9.63	1.92	1.70	0.75	0.94	8.20	1.59	29.17
Letters	0.72	3.85	7.74	2.69	3.89	2.88	1.86	23.39	1.44	21.44
MNIST	8.79	1,231.68	1,211.54	6.40	16.51	6.38	3.82	125.35	7.48	131.60
Covertype	13.19	N/A	N/A	33.11	101.12	53.46	19.55	116.96	14.08	174.49
TB-1M	3.25	N/A	N/A	105.15	109.23	35.92	21.79	6.27	10.47	318.29
SF-2M	31.26	N/A	N/A	226.77	254.98	102.55	51.07	80.44	27.06	658.82
CC-5M	94.76	N/A	N/A	N/A	N/A	N/A	N/A	132.35	46.65	1,726.40
CG-10M	281.84	N/A	N/A	N/A	N/A	N/A	N/A	963.29	318.93	3,603.08
Flower-20M	579.06	N/A	N/A	N/A	N/A	N/A	N/A	3,397.57	764.09	7,225.83

Table 7. TABLE VII: Average NMI(%) scores (over 20 runs) by our methods and the baseline ensemble clustering methods (The best score in each row is in bold).

Dataset	U-SPEC	EAC	WCT	KCC	PTGP	ECC	SEC	LWGP	U-SENC
PenDigits	80.30_±2.18	76.31_±2.70	77.69_±2.54	58.92_±3.47	75.58_±2.26	57.64_±4.14	47.07_±7.53	77.54_±1.87	85.34_±0.91
USPS	63.47_±0.97	59.02_±1.69	58.40_±2.15	49.24_±2.98	59.63_±1.76	48.89_±1.80	39.00_±3.83	57.55_±1.78	73.89_±1.82
Letters	42.53_±1.32	37.19_±0.50	36.59_±0.95	33.64_±1.03	38.09_±0.66	34.59_±0.68	31.81_±2.01	37.09_±0.75	45.90_±0.58
MNIST	67.43_±1.55	66.19_±1.49	65.60_±0.96	54.34_±3.38	59.93_±2.23	56.01_±2.25	34.19_±4.61	65.06_±0.95	75.02_±0.81
Covertype	6.97_±0.16	N/A	N/A	5.86_±1.84	6.42_±0.44	5.70_±0.77	5.26_±2.82	7.44_±0.31	9.13_±1.21
TB-1M	95.86_±0.48	N/A	N/A	23.36_±1.62	34.20_±2.51	26.91_±2.13	10.62_±4.64	96.80_±1.90	97.48_±0.05
SF-2M	75.59_±2.12	N/A	N/A	42.72_±7.11	45.17_±2.66	41.61_±6.01	27.05_±7.73	69.88_±4.45	77.02_±2.32
CC-5M	99.87_±0.01	N/A	N/A	33.36_±12.65	0.41_±0.86	31.62_±14.99	17.05_±6.90	98.18_±7.75	99.91_±0.00
CG-10M	78.82_±1.61	N/A	N/A	64.78_±5.08	63.75_±0.61	62.79_±4.91	49.70_±6.08	78.08_±2.43	89.57_±3.96
Flower-20M	86.86_±2.05	N/A	N/A	61.18_±2.43	67.92_±1.99	60.61_±2.37	50.37_±6.32	78.55_±2.31	92.47_±2.45
Avg. score	-	N/A	N/A	42.74	45.11	42.64	31.21	66.62	74.57
N-Avg. score	-	N/A	N/A	59.69	64.12	59.51	45.35	87.82	100.00
Avg. rank	-	5.40	5.60	4.90	3.60	5.40	6.70	2.80	1.00

Table 8. TABLE VIII: Average CA(%) scores (over 20 runs) by our methods and the baseline ensemble clustering methods (The best score in each row is in bold).

Dataset	U-SPEC	EAC	WCT	KCC	PTGP	ECC	SEC	LWGP	U-SENC
PenDigits	84.17_±3.26	81.04_±4.02	82.97_±3.17	63.33_±4.06	78.33_±2.91	62.36_±4.12	51.60_±5.93	81.96_±2.77	88.56_±0.61
USPS	63.76_±1.35	63.39_±2.76	62.72_±3.14	53.46_±3.51	62.68_±1.92	53.67_±2.21	45.38_±3.20	59.73_±3.30	78.17_±3.05
Letters	35.71_±1.47	30.28_±0.58	30.17_±1.01	26.90_±1.23	31.50_±0.89	27.53_±0.72	26.12_±1.93	30.76_±0.84	37.74_±1.06
MNIST	74.31_±2.28	73.12_±2.73	70.73_±1.76	59.86_±5.11	65.06_±2.75	61.18_±3.58	43.13_±4.88	71.98_±1.67	80.58_±1.75
Covertype	49.76_±0.35	N/A	N/A	49.54_±0.58	49.11_±0.30	49.68_±0.40	49.86_±0.94	49.50_±0.28	50.73_±0.62
TB-1M	99.55_±0.06	N/A	N/A	70.05_±1.21	82.94_±1.08	72.50_±1.48	60.12_±3.64	99.65_±0.31	99.75_±0.01
SF-2M	93.60_±1.00	N/A	N/A	67.12_±5.41	73.46_±1.76	66.90_±6.15	55.91_±5.71	88.71_±3.28	93.46_±2.27
CC-5M	99.99_±0.00	N/A	N/A	66.76_±6.24	52.96_±0.00	62.71_±5.38	61.91_±5.49	99.30_±3.07	99.99_±0.00
CG-10M	81.32_±2.00	N/A	N/A	66.96_±5.60	63.36_±1.26	64.74_±6.80	58.19_±4.69	81.95_±3.93	93.99_±3.25
Flower-20M	88.89_±2.85	N/A	N/A	57.78_±3.37	63.83_±2.34	56.69_±2.35	50.70_±5.02	81.37_±2.69	93.79_±3.21
Avg. score	-	N/A	N/A	58.18	62.32	57.80	50.29	74.49	81.68
N-Avg. score	-	N/A	N/A	72.48	77.98	72.22	63.53	90.54	100.00
Avg. rank	-	5.40	5.60	5.00	4.20	5.00	6.30	2.90	1.00

Table 9. TABLE IX: Time costs(s) of our methods and the baseline ensemble clustering methods.

Dataset	U-SPEC	EAC	WCT	KCC	PTGP	ECC	SEC	LWGP	U-SENC
PenDigits	1.01	8.89	47.01	8.97	11.94	13.56	5.27	5.46	19.13
USPS	1.59	13.11	48.45	15.87	59.71	23.53	10.15	10.25	29.17
Letters	1.44	29.60	177.11	33.91	137.46	53.04	16.06	15.58	21.44
MNIST	7.48	576.71	3,435.19	315.58	2,205.18	417.10	260.96	259.91	131.60
Covertype	14.08	N/A	N/A	954.89	7,919.02	1,482.43	712.84	685.89	174.49
TB-1M	10.47	N/A	N/A	1,308.54	1,276.82	2,100.02	1,000.30	989.10	318.29
SF-2M	27.06	N/A	N/A	2,908.34	2,493.99	4,714.16	2,160.46	2,105.82	658.82
CC-5M	46.65	N/A	N/A	6,833.38	5,027.91	11,202.43	5,130.84	5,070.21	1,726.40
CG-10M	318.93	N/A	N/A	17,344.29	11,578.11	27,492.40	10,938.88	10,700.38	3,603.08
Flower-20M	764.09	N/A	N/A	34,869.83	21,198.87	54,913.10	21,696.29	21,378.63	7,225.83

Table 10. TABLE X: Average NMI(%), CA(%), and time costs(s) over 20 runs by different methods with varying number of representatives p 𝑝 p .

Table 11. TABLE XI: Average NMI(%), CA(%), and time costs(s) over 20 runs by different methods with varying number of nearest representatives K 𝐾 K .

Table 12. TABLE XII: Average NMI(%), CA(%), and time costs(s) over 20 runs by different methods with varying ensemble size m 𝑚 m .

Table 13. TABLE XIII: The NMI(%), CA(%), and time costs(s) by U-SPEC using different representative selection strategies ( H : hybrid selection; R : random selection; K : K 𝐾 K -means based selection).

Table 14. TABLE XIV: The NMI(%), CA(%), and time costs(s) by U-SENC using different representative selection strategies ( H : hybrid selection; R : random selection; K : K 𝐾 K -means based selection).

Table 15. TABLE XV: The NMI(%), CA(%), and time costs(s) by U-SPEC using A pproximate K 𝐾 K -nearest representatives against E xact K 𝐾 K -nearest representatives.

Table 16. TABLE XVI: The NMI(%), CA(%), and time costs(s) by U-SENC using A pproximate K 𝐾 K -nearest representatives against E xact K 𝐾 K -nearest representatives.

Equations35

R = {r_{1}, r_{2}, \dots, r_{p}},

R = {r_{1}, r_{2}, \dots, r_{p}},

R C = {r c_{1}, r c_{2}, \dots, r c_{z_{1}}},

R C = {r c_{1}, r c_{2}, \dots, r c_{z_{1}}},

D i s t (x_{i}, r c_{j}) = ∥ x_{i} - y_{j} ∥,

D i s t (x_{i}, r c_{j}) = ∥ x_{i} - y_{j} ∥,

y_{j} = \frac{1}{∣ r c _{j} ∣} r_{l} \in r c_{j} \sum r_{l},

B

B

b_{ij}

E = [0 B B^{⊤} 0] .

E = [0 B B^{⊤} 0] .

Lu = γ D u,

Lu = γ D u,

L_{R} v = λ D_{R} v .

L_{R} v = λ D_{R} v .

γ_{i} (2 - γ_{i}) = λ_{i},

γ_{i} (2 - γ_{i}) = λ_{i},

u_{i} = [h_{i} v_{i}]

h_{i} = \frac{1}{1 - γ _{i}} T v_{i},

R^{i} = {r_{1}^{i}, r_{2}^{i}, \dots, r_{p}^{i}} .

R^{i} = {r_{1}^{i}, r_{2}^{i}, \dots, r_{p}^{i}} .

k^{i} = ⌊ τ (k_{ma x} - k_{min})⌋ + k_{min},

k^{i} = ⌊ τ (k_{ma x} - k_{min})⌋ + k_{min},

Π = {π^{1}, π^{2}, \dots, π^{m}},

Π = {π^{1}, π^{2}, \dots, π^{m}},

C = {C_{1}, C_{2}, \dots, C_{k_{c}}},

C = {C_{1}, C_{2}, \dots, C_{k_{c}}},

\tilde{G} = {X, C, \tilde{B}},

\tilde{G} = {X, C, \tilde{B}},

\tilde{B}

\tilde{B}

\tilde{b}_{ij}

L_{C} \tilde{v} = \tilde{λ} D_{C} \tilde{v},

L_{C} \tilde{v} = \tilde{λ} D_{C} \tilde{v},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Methodspc · Large-scale spectral clustering · Spectral Clustering

Full text

Ultra-Scalable Spectral Clustering and Ensemble Clustering

Dong Huang, Chang-Dong Wang, Jian-Sheng Wu,

Jian-Huang Lai, and Chee-Keong Kwoh D. Huang is with the College of Mathematics and Informatics, South China Agricultural University, Guangzhou, China, and also with the School of Computer Science and Engineering, Nanyang Technological University, Singapore. E-mail: [email protected]. C.-D. Wang and J.-H. Lai are with the School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China, and also with Guangdong Key Laboratory of Information Security Technology, Guangzhou, China, and also with Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China.

E-mail: [email protected], [email protected]. J.-S. Wu is with the School of Information Engineering, Nanchang University, Nanchang, China. E-mail: [email protected] C.-K. Kwoh is with the School of Computer Science and Engineering, Nanyang Technological University, Singapore.

E-mail: [email protected].

Abstract

This paper focuses on scalability and robustness of spectral clustering for extremely large-scale datasets with limited resources. Two novel algorithms are proposed, namely, ultra-scalable spectral clustering (U-SPEC) and ultra-scalable ensemble clustering (U-SENC). In U-SPEC, a hybrid representative selection strategy and a fast approximation method for $K$ -nearest representatives are proposed for the construction of a sparse affinity sub-matrix. By interpreting the sparse sub-matrix as a bipartite graph, the transfer cut is then utilized to efficiently partition the graph and obtain the clustering result. In U-SENC, multiple U-SPEC clusterers are further integrated into an ensemble clustering framework to enhance the robustness of U-SPEC while maintaining high efficiency. Based on the ensemble generation via multiple U-SEPC’s, a new bipartite graph is constructed between objects and base clusters and then efficiently partitioned to achieve the consensus clustering result. It is noteworthy that both U-SPEC and U-SENC have nearly linear time and space complexity, and are capable of robustly and efficiently partitioning ten-million-level nonlinearly-separable datasets on a PC with 64GB memory. Experiments on various large-scale datasets have demonstrated the scalability and robustness of our algorithms. The MATLAB code and experimental data are available at https://www.researchgate.net/publication/330760669.

Index Terms:

Data clustering, Large-scale clustering, Spectral clustering, Ensemble clustering, Large-scale datasets, Nonlinearly separable datasets.

1 Introduction

Data clustering is a fundamental problem in the field of data mining and machine learning [1], whose purpose is to partition a set of objects into a certain number of homogeneous groups, each referred to as a cluster. Out of the large number of clustering algorithms that have been developed, spectral clustering in recent years has been gaining increasing attention due to its promising ability in dealing with nonlinearly separable datasets [2, 3, 4, 5]. However, a critical limitation to conventional spectral clustering lies in its huge time and space complexity, which significantly restricts its application to large-scale problems.

Conventional spectral clustering typically consists of two time- and memory-consuming phases, namely, affinity matrix construction and eigen-decomposition. It generally takes $O(N^{2}d)$ time and $O(N^{2})$ memory to construct the affinity matrix, and takes $O(N^{3})$ time and $O(N^{2})$ memory to solve the eigen-decomposition problem [2], where $N$ is the data size and $d$ is the dimension. As the data size $N$ increases, the computational burden of spectral clustering grows dramatically. For example, given a dataset with one million objects, the $N\times N$ affinity matrix alone will consume 7450.58 GB of memory (with each entry in the matrix stored as a double-precision value), which prohibitively exceeds the memory capacity of a general-purpose machine, not to mention the next phase of eigen-decomposition.

To alleviate the huge computational burden of spectral clustering, a commonly used strategy is to sparsify the affinity matrix and solve the eigen-decomposition problem by some sparse eigen-solvers [2]. The matrix sparsification strategy can reduce the memory cost of storing the affinity matrix and facilitate the eigen-decomposition, but it still requires the computation of all entries in the original affinity matrix. Besides matrix sparsification, another widely-studied strategy is based on sub-matrix construction [3, 4]. The Nyström method [3] randomly selects $p$ representatives from the original dataset and builds an $N\times p$ affinity sub-matrix. Cai et al. [4] extended the Nyström method and proposed the landmark based spectral clustering (LSC) method, which performs $k$ -means on the dataset to get $p$ cluster centers as the $p$ representatives. However, these sub-matrix based spectral clustering methods [3, 4] are typically restricted by an $O(Np)$ complexity bottleneck, which has been a critical hurdle for them to deal with extremely large-scale dataset where a larger $p$ is often desired for achieving better approximation [4]. Moreover, the clustering results of these methods heavily rely on their one-shot approximation via the sub-matrix, which places an unstable factor on their clustering robustness. Despite the considerable efforts that have been made in recent years [2, 3, 4, 5], it remains a highly challenging problem how to enable spectral clustering to efficiently and robustly cluster extremely large-scale datasets (which may even be nonlinearly separable) with rather limited computing resources.

In light of this, this paper focuses on scalability and robustness of spectral clustering for extremely larger-scale datasets. Specifically, this paper proposes two novel large-scale algorithms, namely, ultra-scalable spectral clustering (U-SPEC) and ultra-scalable ensemble clustering (U-SENC). In U-SPEC, a new hybrid representative selection strategy is presented to efficiently find a set of $p$ representatives, which reduces the time complexity of $k$ -means based selection from $O(Npdt)$ to $O(p^{2}dt)$ . Then, a fast approximation method for $K$ -nearest representatives are designed to efficiently build a sparse sub-matrix with $O(Np^{\frac{1}{2}}d)$ time and $O(Np^{\frac{1}{2}})$ memory. With the sparse sub-matrix serving as the cross-affinity matrix, a bipartite graph is constructed between the dataset and the representative set. By taking advantage of the bipartite graph structure, the transfer cut [6] is utilized to solve the eigen-decomposition problem with $O(NK(K+k)+p^{3})$ time, where $k$ is the number of clusters and $K$ is the number of nearest representatives. Finally, the $k$ -means discretization is adopted to construct the clustering result from a set of $k$ eigenvectors, which takes $O(Nk^{2}t)$ time. As it generally holds that $k,K\ll p\ll N$ , the time and space complexity of our U-SPEC algorithm are respectively dominated by $O(Np^{\frac{1}{2}}d)$ and $O(Np^{\frac{1}{2}})$ . Further, to go beyond the one-shot approximation of U-SPEC and provide better clustering robustness, the U-SENC algorithm is proposed by integrating multiple U-SPEC clusterers into a unified ensemble clustering framework, whose time and space complexity are respectively dominated by $O(Nmp^{\frac{1}{2}}d)$ and $O(Np^{\frac{1}{2}})$ . Extensive experiments have been conducted on ten large-scale datasets (including five synthetic datasets and five real datasets), which have shown the superiority of our U-SPEC and U-SENC algorithms over the state-of-the-art in terms of both clustering robustness and scalability.

To summarize, the main contributions of this paper are listed as follows:

A hybrid representative selection strategy is proposed to strike a balance between the efficiency of random selection and the effectiveness of $k$ -means based selection. 2. 2.

A fast approximation method for $K$ -nearest representatives is designed, which is time- and memory-efficient for constructing the sparse affinity sub-matrix between objects and representatives. 3. 3.

A large-scale spectral clustering algorithm termed U-SPEC is developed based on efficient affinity sub-matrix construction and bipartite graph formulation. Its time and space complexity are dominated by $O(Np^{\frac{1}{2}}d)$ and $O(Np^{\frac{1}{2}})$ respectively. 4. 4.

By integrating multiple U-SPEC clusterers, a new large-scale ensemble clustering algorithm termed U-SENC is developed, which significantly enhances the robustness of U-SPEC while maintaining high scalability. Its time and space complexity are dominated by $O(Nmp^{\frac{1}{2}}d)$ and $O(Np^{\frac{1}{2}})$ respectively.

The notations that are used throughout the paper are summarized in Table I. The rest of the paper is organized as follows. The related work on large-scale spectral clustering and ensemble clustering is reviewed in Section 2. The proposed U-SPEC and U-SENC algorithms are described in Section 3. The experimental results are reported in Section 4. Finally, the paper is concluded in Section 5.

2 Related Work

In this section, we review the literature related to spectral clustering and ensemble clustering, with special emphasis on their recent large-scale extensions.

2.1 Spectral Clustering

Given a dataset of $N$ objects, conventional spectral clustering [2] first computes an $N\times N$ affinity matrix, in which each entry corresponds to the similarity of two objects according to some similarity metrics. Then, the eigen-decomposition is performed on the graph Laplacian of the affinity matrix to obtain the $k$ eigenvectors associated with the first $k$ eigenvalues. By embedding the datasets into the low-dimensional space via the obtained $k$ eigenvectors, the final clustering can be achieved via $k$ -means or some other discretization techniques [2].

Although spectral clustering has shown promising advantages in finding clusters of arbitrary shapes from complex data, its $O(N^{3})$ time complexity and $O(N^{2})$ space complexity significantly restrict its application in large-scale tasks. To alleviate the huge computational cost, some researchers sparsified the affinity matrix by considering $K$ -nearest neighbors or $\epsilon$ -neighbors, and then solved the eigen-decomposition problem by some sparse eigen-solvers [2], which, however, still requires the computation of all the entries in the original affinity matrix.

To avoid the computation of the full affinity matrix, the sub-matrix based approximation has emerged as a powerful and efficient tool for spectral clustering [3, 4, 5]. The Nyström approximation [3] randomly selects $p$ representatives from the dataset and builds an $N\times p$ affinity sub-matrix between the $N$ objects and the $p$ representatives. The sub-matrix construction takes $O(Npd)$ time and $O(Np)$ memory, which are much lower than the full affinity matrix construction. Although the random representative selection is very efficient, it is often unstable with regard to the quality of the selected representatives (see Fig. 1). Moreover, while it has been shown that a larger $p$ is often favorable for better approximation [3], the $O(Np)$ memory cost of the sub-matrix construction can still be a critical bottleneck when dealing with very large datasets. To address the potential instability of random selection, Cai and Chen [4] proposed the LSC algorithm, which first partitions the dataset into $p$ clusters via $k$ -means and then uses the $p$ cluster centers as the representatives. With the $N\times p$ sub-matrix constructed, they further sparsified it by preserving the $K$ -nearest representatives for each row and zeroing out the others [4]. Despite its progress over the previous methods, there are still three computational bottlenecks in the LSC algorithm [4]. First, although the $k$ -means based selection often provides a better set of representatives, it comes with the time complexity of $O(Npdt)$ . Second, the calculation of all possible entries in the $N\times p$ sub-matrix is still required before the sparsification, which comes with the time complexity of $O(Npd)$ . Third, the computation of the $K$ -nearest representatives for all objects comes with the time complexity of $O(NpK)$ . More recently, instead of using $p$ representatives, He et al. [5] used Fourier features to represent data objects in kernel space, and built an $N\times p$ sub-matrix between the $N$ objects and the $p$ selected Fourier features, upon which the efficient eigen-decomposition can be performed. The time and space complexity of the fast explicit spectral clustering (FastESC) algorithm in [5] are respectively $O(Npd+p^{3})$ and $O(Np)$ , which are still restricted by the $O(Np)$ complexity bottleneck. By incorporating a newly-designed positive Euler kernel, Wu et al. [7] proposed the Euler spectral clustering (EulerSC) method and proved that the EulerSC is equivalent to the weighted positive Euler k-means, which can be iteratively optimized with $O(Ndkt)$ time. However, EulerSC can only use the positive Euler kernel to define the pair-wise similarity, and is not feasible for the general spectral clustering formulation with other similarity metrics. Moreover, its clustering robustness heavily relies on the proper selection of the Euler kernel parameter, which is difficult to find without prior knowledge.

2.2 Ensemble Clustering

Ensemble clustering has been a popular technique in recent years, which aims to combine multiple base clusterings into a better and more robust consensus clustering [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]. The existing ensemble clustering algorithms can be mainly classified into three categories.

The first category is the pair-wise co-occurrence based methods [8, 9, 21]. Fred and Jain [8] proposed the evidence accumulation clustering (EAC) method, which makes use of the co-association matrix by considering the frequency of pair-wise co-occurrence among multiple base clusterings. With the co-association matrix treated as the similarity matrix, the agglomerative clustering algorithms [1] were then performed to obtain the consensus clustering. Iam-On et al. [9] presented the weighted connected triple (WCT) method, which extends the EAC method by refining the co-association matrix via the common neighborhood information between clusters.

The second category is the graph partitioning based methods [18, 22, 11, 12]. Strehl and Ghosh [18] transformed the multiple base clusterings into a hypergraph representation, based on which three graph partitioning based ensemble clustering methods were presented. Fern and Brodley [22] built a bipartite graph structure by treating both base clusters and data objects as graph nodes, and then partitioned the graph via the METIS algorithm [23].

The third category is the median partition based methods [24, 17], which cast the ensemble clustering problem into an optimization problem that aims to find a median clustering (or partition) by maximizing the similarity between this clustering and the multiple base clusterings. Franek and Jiang [24] formulated the median partition problem into a Euclidian median problem and solved it by the Weiszfeld algorithm [25]. Huang et al. [17] cast the median partition problem into a binary linear programming problem and solved it by the factor graph model.

These ensemble clustering algorithms have shown their advantages in improving clustering accuracy and robustness. However, due to the efficiency bottleneck, most of them are not suitable for very large-scale applications. Recently some efforts have been made to (partially) address the scalability problem for ensemble clustering. To reduce the problem size, Huang et al. [11] exploited the microcluster representation, which maps the $N$ data objects onto $N^{\prime}$ microclusters ( $N^{\prime}\ll N$ ). Then, the set of microclusters are treated as the primitive objects, based on which two novel algorithms, i.e., the probability trajectory accumulation (PTA) and the probability trajectory based graph partitioning (PTGP), are proposed. Wu et al. [10] transformed the ensemble clustering problem into a $k$ -means based consensus clustering (KCC) framework, which significantly facilitated the computation of the consensus function. Liu et al. [15] proved that the spectral clustering of the co-association matrix is equivalent to an instance of weighted $k$ -means clustering, and presented the spectral ensemble clustering (SEC) algorithm. While there are two phases in ensemble clustering (i.e., ensemble generation and consensus function), these algorithms [11, 10, 15] generally focus on the efficiency of the consensus function. In ensemble generation, they mostly exploited $k$ -means to produce $m$ base clusterings [11, 10, 15]. Note that the time complexity of ensemble generation by $k$ -means is $O(Nmdkt)$ , which can still be computationally expensive when dealing with very large-scale datasets. Moreover, the performance of $k$ -means may significantly deteriorate when handling nonlinearly separable datasets, which has a critical influence on the robustness of the ensemble clustering algorithms. Unlike the common practice that typically exploits multiple $k$ -means clusterers as base clusterers, the proposed U-SENC algorithm integrates a diverse set of large-scale U-SPEC clusterers into a highly efficient ensemble clustering framework, which for the first time, to our knowledge, simultaneously tackles the scalability and nonlinear separability issues in both the ensemble generation and consensus function phases in ensemble clustering.

3 Proposed Framework

In this section, we describe the proposed U-SPEC and U-SENC algorithms in Sections 3.1 and 3.2, respectively.

3.1 Ultra-Scalable Spectral Clustering (U-SPEC)

To deal with extremely large-scale datasets, the proposed U-SPEC algorithm complies with the sub-matrix based formulation [3, 4] and aims to break through the efficiency bottleneck of previous algorithms via three phases. Specifically, in the first phase, we present a hybrid representative selection strategy to strike a balance between the efficiency of the random selection and the effectiveness of the $k$ -means based selection. In the second phase, we develop a coarse-to-fine method to efficiently approximate the $K$ -nearest representatives for each data object, and construct a sparse affinity sub-matrix between the $N$ objects and the $p$ representatives. In the third phase, the $N\times p$ sub-matrix is interpreted as a bipartite graph, which can be efficiently partitioned to obtain the final clustering result. These three phases of U-SPEC will be described in Sections 3.1.1, 3.1.2, and 3.1.3, respectively.

3.1.1 Hybrid Representative Selection

Let $\mathcal{X}=\{x_{1},x_{2},\cdots,x_{N}\}$ denote a dataset with $N$ objects, where $x_{i}\in\mathbb{R}^{d}$ is the $i$ -th object and $d$ is the dimension. To capture the relationship between all objects in $\mathcal{X}$ , an $N\times N$ affinity matrix can be constructed in conventional spectral clustering [2], which consumes $O(N^{2}d)$ time and $O(N^{2})$ memory and is not feasible for large-scale datasets. To avoid the computation of the full affinity matrix, the sub-matrix representation is often adopted in the literature of large-scale spectral clustering [3, 4]. The sub-matrix representation generally exploits a set of representatives to encode the overall structure of the dataset. These representatives play a crucial role in the sub-matrix representation, and can be selected by random selection [3] or $k$ -means based selection [4]. Though the random selection strategy [3] is highly efficient, it suffers from the inherent randomness and may lead to a set of low-quality representatives (see Fig. 1). To deal with the instability of random selection, the $k$ -means based selection [4] first groups the entire dataset into $p$ clusters via $k$ -means and then uses the $p$ cluster centers as the representatives. However, the $k$ -means based selection brings in an extra time cost of $O(Npdt)$ , which restricts its feasibility for very large-scale datasets.

In this paper, we propose a hybrid representative selection strategy, which is designed to find a balance between the efficiency of random selection and the effectiveness of $k$ -means based selection. The process of the hybrid representative selection strategy is illustrated in Fig. 2. Different from the $k$ -means based selection which attempts to cluster the entire dataset even when the data size $N$ is extremely large, the proposed hybrid selection strategy first randomly samples a set of $p^{\prime}$ candidate representatives such that $p<p^{\prime}\ll N$ . Then, upon the $p^{\prime}$ candidates, we perform the $k$ -means method to obtain $p$ clusters and exploit the $p$ cluster centers as the set of representatives. Empirically, the number of candidates $p^{\prime}$ is suggested to be several times larger than $p$ , e.g., $p^{\prime}=10p$ , so as to provide enough candidates while still keeping $p^{\prime}$ much smaller than $N$ in large-scale datasets. Formally, we denote the set of selected representatives as

[TABLE]

where $r_{i}$ is the $i$ -th representative in $\mathcal{R}$ .

By introducing an intermediate stage of random pre-sampling, the computational complexity of the $k$ -means based selection is reduced from $O(Npdt)$ to $O(p^{2}dt)$ . As illustrated in Fig. 1, the set of representatives produced by the hybrid selection can better reflect the data distribution than the random selection while requiring much less computational cost than the $k$ -means based selection. To discuss this in more detail, quantitative evaluation of the performance of the proposed hybrid selection strategy against random selection and $k$ -means based selection will be provided in Section 4.6.

3.1.2 Approximation of $K$ -Nearest Representatives

With the $p$ representatives obtained, the next objective is to encode the pair-wise relationship of the entire dataset via the small set of representatives.

In the sub-matrix formulation of the Nyström algorithm [3], the construction of the $N\times p$ affinity sub-matrix between objects and representatives takes $O(Npd)$ time and $O(Np)$ memory, which is the main efficiency bottleneck of the overall algorithm [3]. Given a dataset with ten million objects and a set of one thousand representatives, the storage of the $N\times p$ sub-matrix alone takes 74.51GB of memory, while the later manipulations of the sub-matrix even require more memory consumption. Cai and Chen [4] proposed to sparsify the $N\times p$ affinity matrix by $K$ -nearest representatives (with $K\ll p$ ), which, however, still requires the computation of all the distances between the $N$ objects and the $p$ representatives. Moreover, besides the calculation of the total of $Np$ entries, the sparsification step also consumes $O(NpK)$ time [4].

Before introducing our facilitation strategy, we first investigate the characteristics of the sparse sub-matrix between $N$ objects and $p$ representatives, where each object is only connected to its $K$ -nearest representatives. It is obvious that there are $K$ non-zero entries in each row of the matrix, and $NK$ non-zero entries in the entire matrix. Assume we have $p=1,000$ and $K=5$ , the proportion of the non-zero entries in the matrix will be $0.5\%$ . However, to exactly identify such a small proportion of useful entries via $K$ -nearest representatives, the entire matrix should first be calculated, which unfortunately consists of $99.5\%$ of intermediate entries. To break the efficiency bottleneck, the key problem here is how to significantly reduce the calculation of these intermediate entries when building the sub-matrix with $K$ -nearest representatives.

In this section, our aim is to alleviate the computational cost of the exact $K$ -nearest representative calculation [4] by designing a time- and memory-efficient approximation method. Though the $K$ -nearest representative approximation problem and the classical $K$ -nearest neighbor ( $K$ -NN) approximation problem [26, 27, 28] have some characteristics in common, they are faced with very different computational issues in actual applications. Different from the conventional $K$ -NN approximation scenarios, which mostly deal with a general graph with an $N\times N$ affinity matrix, our aim here is to find the $K$ -nearest representatives in a heavily imbalanced bipartite graph with an $N\times p$ affinity sub-matrix, where $p$ is generally far smaller than $N$ . This imbalanced nature is crucial to our $K$ -nearest representative approximation problem. On the one hand, it makes the conventional $K$ -NN approximation methods [26, 27, 28] (which are typically designed for general graphs with $N\times N$ affinity matrices) inappropriate here. On the other hand, it may as well contribute to the design of our $K$ -nearest representative approximation strategy. To take advantage of the imbalanced structure, it is intuitive to pre-process the graph on the side of the $p$ representatives and minimize the computation on the other side of the $N$ objects.

In particular, we present a new $K$ -nearest representative approximation method based on the coarse-to-fine mechanism, and build the sparse affinity sub-matrix with $O(Np^{\frac{1}{2}}d)$ complexity. The main idea of our $K$ -nearest representative approximation is to first find the nearest region, then find the nearest representative (denoted as $r_{l}$ ) in the nearest region, and finally find the $K$ -nearest representatives in the neighborhood of $r_{l}$ . To efficiently implement the approximation, two preprocessing steps are required, that is

•

Pre-step 1. The set of representatives are grouped into $z_{1}$ rep-clusters via $k$ -means (with $z_{1}\ll p$ ). The time complexity is $O(pz_{1}dt)$ .

•

Pre-step 2. For each representative in $\mathcal{R}$ , its $K^{\prime}$ -nearest neighbors are computed and stored (with $K^{\prime}>K$ ). The time complexity is $O(p^{2}(d+K^{\prime}))$ .

In pre-step 1, each rep-cluster consists of a certain number of representatives, and can be regarded as a local region of the representative set (see Fig. 3). Formally, the obtained $z_{1}$ rep-clusters are denoted as

[TABLE]

where $rc_{i}$ is the $i$ -th rep-cluster in $\mathcal{RC}$ . Given an object $x_{i}\in\mathcal{X}$ and a rep-cluster $rc_{j}\in\mathcal{RC}$ , their distance is defined as the distance between $x_{i}$ and the center of $rc_{j}$ . That is

[TABLE]

where $|rc_{j}|$ denotes the number of representatives in the rep-cluster $rc_{j}$ and $\|x_{i}-y_{j}\|$ computes the Euclidean distance between two vectors $x_{i}$ and $y_{j}$ .

With the distance between objects and rep-clusters defined, for each object $x_{i}\in\mathcal{X}$ , we approximately find its $K$ -nearest representatives according to three main steps:

Step 1

Find the nearest rep-cluster of $x_{i}$ , denoted as $rc_{j}$ .

Step 2

Find the nearest representative of $x_{i}$ inside the rep-cluster $rc_{j}$ , denoted as $r_{l}$ .

Step 3

Out of $r_{l}$ and its $K^{\prime}$ -nearest neighbors, find the $K$ -nearest representatives of $x_{i}$ .

More details are illustrated in Fig. 3. For a dataset with $N$ objects, the time cost of step 1 is $O(Nz_{1}d)$ . The time cost of step 2 is $O(Nz_{2}d)=O(N({p}/{z_{1}})d)$ , where $z_{2}={p}/{z_{1}}$ denotes the average size of the rep-clusters. The time cost of step 3 is $O(NK^{\prime}d+NK^{\prime}K)$ . It is obvious that $z_{1}+{p}/{z_{1}}$ reaches its minimum when $z_{1}=z_{2}={p}^{\frac{1}{2}}$ . Thus, to minimize the cost, $z_{1}=\lfloor{p}^{\frac{1}{2}}\rfloor$ is used in this work, where $\lfloor\cdot\rfloor$ denotes the floor of a value. The candidate neighborhood size $K^{\prime}$ is suggested to be several times larger than $K$ , which can be set to $K^{\prime}=10K$ in practice. Then, the total time complexity of the $K$ -nearest representative approximation is $O(Nz_{1}d+N({p}/{z_{1}})d+NK^{\prime}d+NK^{\prime}K)$ , which can be re-written as $O(N(p^{\frac{1}{2}}d+Kd+K^{2}))$ . As $K\ll p\ll N$ , the dominant term in the complexity is $O(Np^{\frac{1}{2}}d)$ .

With the $K$ -nearest representatives of each object obtained, a sparse $N\times p$ affinity sub-matrix can thereby be constructed. In this paper, the Gaussian kernel is used as the similarity kernel. Thus the sparse affinity sub-matrix can be represented as

[TABLE]

where $N_{K}(x_{i})$ denotes the set of $K$ -nearest representatives of $x_{i}$ and the kernel parameter $\sigma$ is set to the average Euclidean distance between the objects and their $K$ -nearest representatives. Note that $B$ is a sparse matrix which only contains $NK$ non-zero entries.

3.1.3 Bipartite Graph Partitioning

The affinity sub-matrix $B$ reflects the relationship between the objects in $\mathcal{X}$ and the representatives in $\mathcal{R}$ , which can be naturally interpreted as a bipartite graph $G=\{\mathcal{X},\mathcal{R},B\}$ , where $\mathcal{X}\cup\mathcal{R}$ is the node set and $B$ is the cross-affinity matrix (as shown in Fig. 4). By taking advantage of the bipartite graph structure, the transfer cut [6] can thereby be used to efficiently partition the graph and achieve the final clustering result.

To start, if we view the graph $G$ as a general graph with $N+p$ nodes, then its full affinity matrix can be denoted as

[TABLE]

Spectral clustering seeks to partition the graph by solving the following generalized eigen-problem [29]:

[TABLE]

where $L=D-E$ is the graph Laplacian and $D\in\mathbb{R}^{(N+p)\times(N+p)}$ is the degree matrix. By treating $G$ as a general graph, it takes $O((N+p)^{3})$ time to solve the eigen-problem (8) [30], which is not computationally feasible for very large-scale datasets.

By exploiting the bipartite structure, we resort to the transfer cut [6] to reduce the eigen-problem (8) on the graph $G$ (with $N+p$ nodes) to an eigen-problem on a much smaller graph $G_{\mathcal{R}}$ (with $p$ nodes). Specifically, the graph $G_{\mathcal{R}}$ is constructed as $G_{\mathcal{R}}=\{\mathcal{R},E_{\mathcal{R}}\}$ , where $\mathcal{R}$ is the node set, $E_{\mathcal{R}}=B^{\top}{D_{\mathcal{X}}}^{-1}B$ is the affinity matrix (whose computation takes $O(NK^{2})$ time), and $D_{\mathcal{X}}\in\mathbb{R}^{N\times N}$ is a diagonal matrix with its $(i,i)$ -th entry being the sum of the $i$ -th row of $B$ . Let $L_{\mathcal{R}}=D_{\mathcal{R}}-E_{\mathcal{R}}$ be the graph Laplacian, where $D_{\mathcal{R}}\in\mathbb{R}^{p\times p}$ is the degree matrix of $G_{\mathcal{R}}$ . Then, the generalized eigen-problem on the graph $G_{\mathcal{R}}$ can be represented as

[TABLE]

It has been proved by Li et al. [6] that solving the eigen-problem (8) on the graph $G$ is equivalent to solving the eigen-problem (9) on the graph $G_{\mathcal{R}}$ . Let the first $k$ eigen-pairs for the eigen-problem (9) be denoted as $\{(\lambda_{i},v_{i})\}_{i=1}^{k}$ with $0=\lambda_{1}\leq\lambda_{2}\leq\cdots\leq\lambda_{k}<1$ , and the first $k$ eigen-pairs for the eigen-problem (8) denoted as $\{(\gamma_{i},u_{i})\}_{i=1}^{k}$ with $0=\gamma_{1}\leq\gamma_{2}\leq\cdots\leq\gamma_{k}<1$ . It has been shown that [6]

[TABLE]

where $T=D_{\mathcal{X}}^{-1}B$ is the transition probability matrix. It takes $O(p^{3})$ time to compute the first $k$ eigen-pairs for the eigen-problem (9). As $B$ is a sparse matrix with $NK$ non-zero entries, it takes $O(NK)$ time to compute $u_{i}$ from $v_{i}$ according to Eqs. (10), (11), and (12). Therefore, the total cost of computing the first $k$ eigenvectors for the eigen-problem (8) will be $O(NK^{2})+O(NKk)+O(p^{3})=O(NK(K+k)+p^{3})$ .

With the eigen-problem solved, the obtained $k$ eigenvectors are stacked to form an $(N+p)\times k$ matrix. By treating each row of this matrix as a new feature vector, the $N$ rows corresponding to the $N$ original objects are used, upon which the $k$ -means discretization can be performed to obtain the final clustering result with $O(Nk^{2}t)$ time complexity.

3.1.4 Computational Complexity

In this section, we summarize the time and memory cost of our U-SPEC algorithm.

The hybrid representative selection takes $O(p^{2}dt)$ time. The affinity construction takes $O(N(p^{\frac{1}{2}}d+Kd+K^{2}))$ time. The eigen-decomposition takes $O(NK(K+k)+p^{3})$ time. The $k$ -means discretization takes $O(Nk^{2}t)$ time. With consideration to $k,K\ll p\ll N$ , the overall time complexity of U-SPEC is $O(N(p^{\frac{1}{2}}d+K^{2}+Kk+Kd+k^{2}t))$ , where $O(Np^{\frac{1}{2}}d)$ is the dominant term. Table II provides a comparison of time complexity of our U-SPEC algorithm against several other large-scale spectral clustering algorithms.

Besides the time cost, the memory cost of U-SPEC can be either $O(NK)$ or $O(Np^{\frac{1}{2}})$ , which depends on the actual implementation of the $K$ -nearest representative approximation. As the $K$ -nearest representative approximation for the $N$ objects are independent of each other, one strategy is to perform approximation for the $N$ objects one after the other (i.e., in a serial processing manner), where the time cost is dominated by the storage of the cross-affinity matrix with $NK$ non-zero entries. Another strategy is to first construct an affinity matrix between the $N$ objects and the $z_{1}=\lfloor{p}^{\frac{1}{2}}\rfloor$ rep-cluster centers and then approximate the $K$ -nearest representatives for the $N$ objects in a batch processing manner. For some matrix-oriented software, such as MATLAB, it will be much faster to perform the approximation in a batch processing manner (with optimized matrix computation) than in a serial processing manner. To facilitate the matrix computation, our implementation of U-SPEC actually takes $O(Np^{\frac{1}{2}})$ memory. Similarly, the LSC algorithm [4] also has a theoretically minimum memory cost of $O(NK)$ , but the implementation111www.cad.zju.edu.cn/home/dengcai/Data/Clustering.html provided by the authors actually takes $O(Np)$ memory, which is also due to the matrix-computation consideration.

3.2 Ultra-Scalable Ensemble Clustering (U-SENC)

Starting from U-SPEC, this section proposes the U-SENC algorithm to integrate multiple U-SPEC’s into a unified ensemble clustering framework, aiming to further enhance the clustering robustness while maintaining high efficiency.

3.2.1 Ensemble Generation via Multiple U-SPEC’s

Ensemble clustering has been a popular research topic in recent years, due to its promising ability in enhancing clustering robustness by incorporating multiple base clusterers [10, 11, 14, 15, 12]. The general ensemble clustering process consists of two phases. The first phase is the ensemble generation, which involves producing a set of diverse and high-quality base clusterings. The second phase is the consensus function, which involves combining multiple base clusterings into a better and more robust consensus clustering.

In ensemble generation, the previous ensemble clustering algorithms mostly use the $k$ -means method to generate an ensemble of multiple base clusterings [10, 11, 14, 15, 12]. Though $k$ -means has the advantage of high efficiency, it typically favors spherical distribution and lacks the ability to properly partition nonlinearly separable datasets. Some researchers have exploited the spectral clustering technique in ensemble generation [31, 32], but the large computational cost of conventional spectral clustering significantly restricts its feasibility for scalable applications.

To address this, we utilize multiple instances of U-SPEC as the multiple base clusterers in our ensemble clustering framework. To generate an ensemble of $m$ base clusterings, a set of $m$ U-SPEC clusterers are required, which are denoted as U-SPEC ${}_{1},$ U-SPEC ${}_{2},\cdots,$ U-SPECm. The diversity which is highly desired in ensemble generation is incorporated from two aspects. First, the set of representatives for each base clusterer is independently obtained by the hybrid selection strategy. There are two components in hybrid selection, i.e., random pre-selection and $k$ -means based post-selection, both of which are non-deterministic and can bring in diversity for the multiple base clusterers. Second, the number of clusters for each base clustering is randomly selected to further enhance the diversity. Formally, given the dataset $\mathcal{X}$ , the set of $p^{\prime}$ candidate representatives for the $i$ -th base clusterer (i.e., U-SPECi) are randomly selected from $\mathcal{X}$ . Then the $k$ -means is used to partition the $p^{\prime}$ candidates into $p$ clusters. After that, the $p$ cluster centers will be used as the set of $p$ representatives for U-SPECi, denoted as

[TABLE]

With the representatives obtained, the sparse affinity sub-matrix $B^{i}$ for U-SPECi can be built between the dataset $\mathcal{X}$ and the representative set $\mathcal{R}^{i}$ via fast approximation of $K$ -nearest representatives.

By treating $\mathcal{X}\bigcup\mathcal{R}^{i}$ as the node set and $B^{i}$ as the cross-affinity matrix, the bipartite graph $G^{i}$ is built and its first $k^{i}$ eigenvectors are then computed via transfer cut [6]. Note that the number of clusters $k^{i}$ is randomly selected as

[TABLE]

where $\tau\in[0,1]$ is a random variable and $k_{max}$ and $k_{min}$ are respectively the upper bound and lower bound of the cluster number. Then, the obtained $k^{i}$ eigenvectors are stacked to form a new matrix, upon which the $k$ -means is applied to construct the base clustering result for U-SPECi. With the $m$ U-SPEC clusterers, the ensemble of $m$ base clusterings can be generated, which are represented as

[TABLE]

where $\pi^{i}$ denotes the $i$ -th base clustering.

3.2.2 Consensus Function with Bipartite Graph

Having obtained the set of multiple base clusterings, this section presents the consensus function with bipartite graph for obtaining the consensus clustering.

Each base clustering consists of a certain number of clusters. For clarity, we denote the set of clusters in the ensemble of $m$ base clusterings as

[TABLE]

where $C_{i}$ is the $i$ -th cluster and $k_{c}$ is the total number of clusters in $\Pi$ . It is obvious that $k_{c}=\sum_{i=1}^{m}k^{i}$ .

By treating both objects and clusters as graph nodes, the bipartite graph for the ensemble $\Pi$ is defined as

[TABLE]

where $\mathcal{X}\bigcup\mathcal{C}$ is the node set and $\tilde{B}$ is the cross-affinity matrix. In this bipartite graph, a (non-zero) edge exists between two nodes if and only if one node is an object and the other one is the cluster that contains it. Formally, the cross-affinity matrix is constructed as follows:

[TABLE]

Inside the same base clustering, there is no intersection between two different clusters, i.e., $\forall i^{\prime}\neq j^{\prime}$ , if $C_{i^{\prime}}\in\pi^{i}$ and $C_{j^{\prime}}\in\pi^{i}$ , then $C_{i^{\prime}}\bigcap C_{j^{\prime}}=\emptyset$ . Obviously, each object belongs to one and only one cluster in each base clustering, and thus each object belongs exactly to $m$ clusters in the ensemble of $m$ base clusterings. Therefore, there are exactly $m$ non-zero entries in each row of $\tilde{B}$ . Although the cross-affinity matrix $\tilde{B}$ is an $N\times k_{c}$ matrix, it can be stored as a sparse matrix with $O(Nm)$ memory, which corresponds to the exactly $Nm$ non-zero entries in $\tilde{B}$ . Besides the memory cost, the time cost of building the sparse matrix $\tilde{B}$ is $O(Nm)$ .

As shown in Section 3.1.3, solving the eigen-problem for the bipartite graph $\tilde{G}$ can be equivalent to solving the eigen-problem for a much smaller graph $G_{\mathcal{C}}=\{\mathcal{C},E_{\mathcal{C}}\}$ , that is

[TABLE]

where $E_{\mathcal{C}}=\tilde{B}^{\top}{\tilde{D}_{\mathcal{X}}}^{-1}\tilde{B}$ is the affinity matrix, $\tilde{D}_{\mathcal{X}}\in\mathbb{R}^{N\times N}$ is a diagonal matrix with its $(i,i)$ -th entry being the sum of the $i$ -th row of $\tilde{B}$ , ${L}_{\mathcal{C}}={D}_{\mathcal{C}}-E_{\mathcal{C}}$ is the graph Laplacian, and ${D}_{\mathcal{C}}\in\mathbb{R}^{k_{c}\times k_{c}}$ is the degree matrix of $G_{\mathcal{C}}$ .

Let $\tilde{v}_{1},\tilde{v}_{2},\cdots,\tilde{v}_{k}$ denote the first $k$ eigenvectors for the eigen-problem (20), which can be computed with a time cost of $O({k_{c}}^{3})$ . Based on the $k$ eigenvectors for $G_{\mathcal{C}}$ , the first $k$ eigenvectors (denoted as $\tilde{u}_{1},\tilde{u}_{2},\cdots,\tilde{u}_{k}$ ) for the bipartite graph $\tilde{G}$ can be computed with $O(Nm(m+k))$ time (see Eqs. (10), (11), and (12)). Finally, by stacking the $k$ eigenvectors to form a new matrix, the consensus clustering result in U-SENC can be obtained by $k$ -means discretization with $O(Nk^{2}t)$ time.

3.2.3 Computational Complexity

This section summarizes the time and memory cost of the proposed U-SENC algorithm.

The ensemble generation of the U-SENC algorithm takes $O(Nm(p^{\frac{1}{2}}d+K^{2}+Kk+Kd+k^{2}t))$ time. The consensus function of U-SENC takes $O(N(m^{2}+mk+k^{2}t)+{k_{c}}^{3})$ time. With consideration to $m,k,K\ll p\ll N$ , the dominant term of the overall time complexity of U-SENC is $O(Nmp^{\frac{1}{2}}d)$ .

Meanwhile, the memory costs of the ensemble generation and the consensus function of our U-SENC algorithm are respectively $O(Np^{\frac{1}{2}})$ and $O(Nm)$ .

4 Experiments

In this section, we conduct experiments on a variety of real and synthetic datasets to compare the proposed U-SPEC and U-SENC algorithms against several state-of-the-art spectral clustering and ensemble clustering algorithms.

All experiments are conducted in Matlab 2016b on a PC with an Intel i5-6600 CPU and 64GB of RAM.

4.1 Datasets and Evaluation Measures

Our experiments are conducted on ten large-scale datasets (including five real datasets and five synthetic datasets), whose data sizes range from ten thousand to as large as twenty million. Specifically, the five real datasets are PenDigits [33], USPS [34], Letters [33], MNIST [34], and Covertype [33]. The five synthetic datasets are Two Bananas-1M (TB-1M), Smiling Face-2M (SF-2M), Concentric Circles-5M (CC-5M), Circles and Gaussians-10M (CG-10M), and Flower-20M. The details of the datasets are provided in Table III and Fig. 5.

To evaluate the clustering results by different algorithms, two widely used evaluation measures are adopted, namely, normalized mutual information (NMI) [18] and clustering accuracy (CA) [35]. To rule out the factor of getting lucky occasionally, in each experiment, every test method will be conducted 20 times and their average NMI, CA, and time costs will be reported. Note that larger values of NMI and CA indicate better clustering results.

4.2 Baseline Methods and Experimental Settings

In the experiments, we first compare our algorithms against the classical $k$ -means algorithm [36] as well as seven spectral clustering algorithms (including the original algorithm and six large-scale algorithms). The baseline spectral clustering algorithms are listed as follows:

SC [2]: original spectral clustering. 2. 2.

ESCG [37]: efficient spectral clustering on graphs. 3. 3.

Nyström [3]: Nyström spectral clustering. 4. 4.

LSC-K [4]: landmark based spectral clustering using $k$ -means based landmark selection. 5. 5.

LSC-R [4]: landmark based spectral clustering using random landmark selection. 6. 6.

FastESC [5]: fast explicit spectral clustering. 7. 7.

EulerSC [7]: Euler spectral clustering.

Besides these large-scale spectral clustering algorithms, we also compare our algorithms against seven ensemble clustering algorithms, which are listed as follows:

EAC [8]: evidence accumulation clustering. 2. 2.

WCT [9]: weighted connected triple method. 3. 3.

KCC [10]: $k$ -means based consensus clustering. 4. 4.

PTGP [11]: probability trajectory based graph partitioning. 5. 5.

ECC [14]: entropy based consensus clustering. 6. 6.

SEC [15]: spectral ensemble clustering. 7. 7.

LWGP [12]: locally weighted graph partitioning.

There are several common parameters among the above-mentioned algorithms. In our experiments, we comply with the following experimental settings:

•

The SC and ESCG methods need to take the $N\times N$ affinity matrix as input. The affinity matrix is constructed using the same Gaussian kernel as Eq. (6) with $K$ -nearest neighbors.

•

The U-SPEC, U-SENC, Nyström, LSC-K, and LSC-R methods have a common parameter $p$ . In the experiments, $p=1000$ is used for these methods. Their performances with varying $p$ will be further evaluated in Section 4.5.1.

•

The U-SPEC, U-SENC, LSC-K, and LSC-R methods have a common parameter $K$ . In the experiments, $K=5$ is used. Their performances with varying $K$ will be further evaluated in Section 4.5.2.

•

For the seven ensemble clustering methods, the base clusterings are generated by $k$ -means as suggested by their papers [8, 9, 10, 11, 14, 15, 12]. The number of clusters in each base clustering is randomly selected in $[20,60]$ . The number of base clusterings, i.e., $m$ , is set to $20$ . Their performances with varying $m$ will be further evaluated in Section 4.5.3.

•

The true number of classes on each dataset is used as the number of clusters for all the test methods.

•

Besides these common parameters, the other parameters in the baseline methods will be set as suggested by the corresponding papers.

4.3 Comparison with Spectral Clustering Methods

In this section, we compare our U-SPEC and U-SENC algorithms with several state-of-the-art large-scale spectral clustering algorithms.

As the data sizes range from ten thousand to twenty million, most of the baseline algorithms are not computationally feasible for ten-million-level datasets. Specifically, we use N/A to indicate the out-of-memory error in the results. As shown in Tables IV and V, the SC and ESCG methods are not able to handle the datasets large than MNIST (which consists of 70,000 objects), due to the memory consumption of constructing and manipulating the $N\times N$ affinity matrix. The Nyström, LSC-K, LSC-R, and FastESC methods can at most partition a dataset with two million objects, and cannot deal with datasets larger than that. Out of the total of nine spectral clustering methods, only three methods (i.e., U-SPEC, U-SENC, and EulerSC) can deal with all of the benchmark datasets. As shown in Tables IV and V, our U-SENC and U-SPEC methods achieve the best and the second best scores, respectively, on most of the ten benchmark datasets.

In Tables IV and V, we also provide the average score, normalized average score (N-Avg. score), and average rank of each method across the ten datasets. To obtain the normalized average score, the scores in each row will first be divided by the maximum score in this row, where it is obvious that the maximum score will become $100\%$ . Then we take the average of these normalized rows as the normalized average score. Note that if a baseline method cannot process all the datasets, it will not have the average score and normalized average score information, but it will still have the average rank information. For example, if only three methods are efficient enough to process the CC-5M dataset, then all the other infeasible methods will be treated as equally ranked in the fourth position on this dataset. As shown in Tables IV and V, our U-SENC method ranks in the first position on nine out of the ten datasets, and achieves an average rank of 1.10 w.r.t. both NMI and CA. Our U-SPEC method achieves an average rank of 2.40 w.r.t. NMI and 2.00 w.r.t. CA. In terms of average score and normalized average score, our U-SENC and U-SPEC methods also significantly outperform the other methods.

Table VI reports the time costs of different methods on the benchmark datasets. The U-SPEC shows superior efficiency on most of the datasets, especially on the datasets larger than one million. The U-SENC requires a larger time cost than U-SPEC, but it still provides better scalability than most of the baseline methods and scales well for ten-million-level datasets due to its memory efficiency. As U-SENC is a spectral clustering algorithm and also an ensemble clustering algorithm, in the following, we will further compare it with other state-of-the-art ensemble clustering algorithms.

4.4 Comparison with Ensemble Clustering Methods

In this section, we compare our algorithms with several state-of-the-art ensemble clustering algorithms.

Note that U-SPEC is not an ensemble clustering algorithm; its clustering results are provided in Tables VII, VIII, and IX for reference only. As shown in Tables VII and VIII, our U-SENC algorithm obtains the highest NMI and CA scores on all of the ten datasets. In terms of average score across the ten datasets, U-SENC achieves the best average NMI( $\%$ ) and CA( $\%$ ) scores of $74.57$ and $81.68$ , respectively while the second best ensemble clustering method (i.e., LWGP) only achieves average NMI( $\%$ ) and CA( $\%$ ) scores of $66.62$ and $74.49$ , respectively. Similar advantages of U-SENC can also be observed in the normalized average scores. In terms of average rank, U-SENC obtains an average rank of 1.00 w.r.t. both NMI and CA, while the second best method obtains an average rank of 2.80 w.r.t. NMI and 2.90 w.r.t. CA.

In Table IX, the time costs of different ensemble clustering methods are provided. As can be seen in Table IX, the proposed U-SENC method has shown its advantage in efficiency over the other ensemble clustering methods, especially on the large-scale datasets whose data sizes go beyond millions.

4.5 Parameters Analysis

In this section, we evaluate the performances of our algorithms and several baseline algorithms with varying parameters. Because some important baseline methods (such as Nyström, LSC-K, and LSC-R) can not go beyond two-million-level datasets, in order to fairly test the influence of some common parameters among them, we perform the parameter analysis on four benchmark datasets, namely, MNIST, Covertype, TB-1M, and SF-2M, which are the largest four datasets whose sizes are no larger than two million.

4.5.1 Number of Representatives $p$

The parameter $p$ denotes the number of representatives (or landmarks), which is a common parameter in the sub-matrix based spectral clustering methods, such as Nyström, LSC-K, LSC-R, and our U-SPEC and U-SENC methods. As can be seen in Table X, a larger $p$ generally leads to better performance, but also brings in an increasing time cost. In terms of NMI and CA, our U-SENC method consistently outperforms the other methods with varying parameter $p$ on all of the four datasets. The LSC-K outperforms U-SPEC on the MNIST dataset. But on all the other three datasets, U-SPEC achieves better or significantly better NMI and CA scores than LSC-K. In terms of computational cost, the LSC-K and Nyström methods cannot deal with $p\geq 1,400$ representatives on the SF-2M dataset with two million objects. On the benchmark datasets, U-SPEC is overall the fastest method with varying parameter $p$ (as shown in Table X).

4.5.2 Number of Nearest Representatives $K$

The parameter $K$ denotes the number of nearest representatives (or landmarks), which is a common parameter in LSC-K, LSC-R, and our U-SPEC and U-SENC methods. Note that the Nyström method doesn’t have such a parameter $K$ , but we still illustrate the performance of Nyström in Table XI just to use Nyström as a benchmark here. As illustrated in Table XI, on the MNIST dataset, U-SENC and LSC-K are respectively the best and the second best methods w.r.t. NMI and CA, while U-SPEC is the third best method. On all of the other three benchmark datasets, U-SENC and U-SPEC are overall the best two methods w.r.t. both NMI and CA with varying parameter $K$ (as shown in Table XI).

4.5.3 Ensemble Size $m$

The parameter $m$ denotes the number of base clusterings, which is a common parameter in all of the ensemble clustering methods, including U-SENC as well as the baseline ensemble clustering methods. Note that U-SPEC is not an ensemble clustering method and doesn’t have the parameter $m$ , but we still illustrate the performance of U-SPEC in Table XII for reference only. As shown in Table XII, U-SENC outperforms, or even significantly outperforms, the other ensemble clustering methods w.r.t. both NMI and CA on the benchmark datasets with varying ensemble size $m$ . Meanwhile, U-SENC consistently requires a lower computational cost than the other ensemble clustering methods.

4.6 Influence of Representative Selection Strategies

In this section, we compare the performances of our algorithms using different representative selection strategies. Specifically, Table XIII illustrates the performances of U-SPEC using hybrid selection (U-SPEC-H), U-SPEC using random selection (U-SPEC-R), and U-SPEC using $k$ -means based selection (U-SPEC-K), whereas Table XIV illustrates the performances of U-SENC using hybrid selection (U-SENC-H), U-SENC using random selection (U-SENC-R), and U-SENC using $k$ -means based selection (U-SENC-K). As shown in Tables XIII and XIV, the random representative selection is very efficient compared to $k$ -means based selection, but may degrade the clustering quality due to its inherent instability. The $k$ -means based selection generally leads to better clustering quality than random selection, but brings in a much larger computational cost. Compared to random selection and $k$ -means based selection, our hybrid selection strategy strikes a balance between efficiency and clustering robustness. It achieves comparable efficiency to the random selection and significantly better efficiency than the $k$ -means based selection, and also yields competitive clustering quality as compared to the $k$ -means based selection.

4.7 Influence of Approximate $K$ -Nearest Neighbors

In this section, we compare our algorithms using Approximate $K$ -nearest representatives against using Exact $K$ -nearest representatives, where four variants are evaluated, i.e., U-SPEC(A), U-SPEC(E), U-SENC(A), and U-SENC(E). The purpose of using approximate $K$ -nearest representatives (see Section 3.1.2) is to alleviate the time and memory cost of the affinity sub-matrix construction while maintaining the overall clustering quality. As shown in Tables XV and XVI, using approximate $K$ -nearest representatives can achieve comparable clustering quality (w.r.t. NMI and CA) with using exact $K$ -nearest representatives while alleviating the computational cost. As our approximation of $K$ -nearest representatives reduces the time complexity from $O(Npd)$ to $O(Np^{\frac{1}{2}}d)$ , the improvement in efficiency is more significant for high-dimensional datasets, such as the MNIST dataset, whose dimension is 784. Even for the low-dimensional datasets, such as TB-1M and SF-2M, the use of approximate $K$ -nearest representatives can still consistently reduce the time cost. Besides the time efficiency, the approximate $K$ -nearest representatives also alleviate the memory burden. Specifically, on a machine with 64GB memory, the computation of conventional $K$ -nearest representatives can hardly go beyond five million objects, whereas the proposed approximation method for $K$ -nearest representatives can scale well for even ten-million-level datasets.

5 Conclusion

This paper proposes two large-scale clustering algorithms, termed ultra-scalable spectral clustering (U-SPEC) and ultra-scalable ensemble clustering (U-SENC), respectively. In U-SPEC, a new hybrid representative selection strategy is designed to strike a balance between the efficiency of random selection and the effectiveness of $k$ -means based selection. Then a new approximation method for $K$ -nearest representatives is presented to efficiently construct a bipartite graph between the original data objects and the set of representatives, upon which the transfer cut can be utilized to obtain the clustering result. Starting from the U-SPEC algorithm, we further integrate multiple U-SPEC clusterers into a unified ensemble clustering framework and propose the U-SENC algorithm. Specifically, multiple U-SPEC’s are exploited in the ensemble generation phase to produce an ensemble of diverse and high-quality base clusterings. The multiple base clusterings are incorporated into a new bipartite graph, which treats both objects and base clusters as graph nodes and is then efficiently partitioned to achieve the final consensus clustering. Extensive experiments have been conducted on ten large-scale datasets, which demonstrate the scalability and robustness of our algorithms.

Acknowledgments

This project was supported by NSFC (61602189, 61876193 & 61876104), National Key Research and Development Program of China (2016YFB1001003), and Guangdong Natural Science Funds for Distinguished Young Scholars (2016A030306014).

Bibliography37

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. K. Jain, “Data clustering: 50 years beyond k 𝑘 k -means,” Pattern Recognition Letters , vol. 31, no. 8, pp. 651–666, 2010.
2[2] U. von Luxburg, “A tutorial on spectral clustering,” Statistics and Computing , vol. 17, no. 4, pp. 395–416, 2007.
3[3] W. Y. Chen, Y. Song, H. Bai, C. J. Lin, and E. Y. Chang, “Parallel spectral clustering in distributed systems,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 33, no. 3, pp. 568–586, 2011.
4[4] D. Cai and X. Chen, “Large scale spectral clustering via landmark-based sparse representation,” IEEE Transactions on Cybernetics , vol. 45, no. 8, pp. 1669–1680, 2015.
5[5] L. He, N. Ray, Y. Guan, and H. Zhang, “Fast large-scale spectral clustering via explicit feature mapping,” IEEE Transactions on Cybernetics, in press , 2018.
6[6] Z. Li, X.-M. Wu, and S.-F. Chang, “Segmentation using superpixels: A bipartite graph partitioning approach,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2012.
7[7] J. S. Wu, W. S. Zheng, J. H. Lai, and C. Y. Suen, “Euler clustering on large-scale dataset,” IEEE Transactions on Big Data, in press , 2018.
8[8] A. L. N. Fred and A. K. Jain, “Combining multiple clusterings using evidence accumulation,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 27, no. 6, pp. 835–850, 2005.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Ultra-Scalable Spectral Clustering and Ensemble Clustering

Abstract

Index Terms:

1 Introduction

2 Related Work

2.1 Spectral Clustering

2.2 Ensemble Clustering

3 Proposed Framework

3.1 Ultra-Scalable Spectral Clustering (U-SPEC)

3.1.1 Hybrid Representative Selection

3.1.2 Approximation of KKK-Nearest Representatives

3.1.3 Bipartite Graph Partitioning

3.1.4 Computational Complexity

3.2 Ultra-Scalable Ensemble Clustering (U-SENC)

3.2.1 Ensemble Generation via Multiple U-SPEC’s

3.2.2 Consensus Function with Bipartite Graph

3.2.3 Computational Complexity

4 Experiments

4.1 Datasets and Evaluation Measures

4.2 Baseline Methods and Experimental Settings

4.3 Comparison with Spectral Clustering Methods

4.4 Comparison with Ensemble Clustering Methods

4.5 Parameters Analysis

4.5.1 Number of Representatives ppp

4.5.2 Number of Nearest Representatives KKK

4.5.3 Ensemble Size mmm

4.6 Influence of Representative Selection Strategies

4.7 Influence of Approximate KKK-Nearest Neighbors

5 Conclusion

Acknowledgments

3.1.2 Approximation of $K$ -Nearest Representatives

4.5.1 Number of Representatives $p$

4.5.2 Number of Nearest Representatives $K$

4.5.3 Ensemble Size $m$

4.7 Influence of Approximate $K$ -Nearest Neighbors