Ultra-Scalable Spectral Clustering and Ensemble Clustering
Dong Huang, Chang-Dong Wang, Jian-Sheng Wu, Jian-Huang Lai, Chee-Keong, Kwoh

TL;DR
This paper introduces ultra-scalable spectral and ensemble clustering algorithms designed for extremely large datasets, achieving high efficiency and robustness with nearly linear complexity, suitable for resource-limited environments.
Contribution
The paper presents two novel algorithms, U-SPEC and U-SENC, that significantly improve scalability and robustness of spectral clustering for large-scale data.
Findings
Capable of clustering ten-million-level datasets on standard PCs
Nearly linear time and space complexity achieved
Demonstrated robustness and scalability on various large datasets
Abstract
This paper focuses on scalability and robustness of spectral clustering for extremely large-scale datasets with limited resources. Two novel algorithms are proposed, namely, ultra-scalable spectral clustering (U-SPEC) and ultra-scalable ensemble clustering (U-SENC). In U-SPEC, a hybrid representative selection strategy and a fast approximation method for K-nearest representatives are proposed for the construction of a sparse affinity sub-matrix. By interpreting the sparse sub-matrix as a bipartite graph, the transfer cut is then utilized to efficiently partition the graph and obtain the clustering result. In U-SENC, multiple U-SPEC clusterers are further integrated into an ensemble clustering framework to enhance the robustness of U-SPEC while maintaining high efficiency. Based on the ensemble generation via multiple U-SEPC's, a new bipartite graph is constructed between objects and…
| A dataset of objects | |
| The -th object in | |
| Number of objects in | |
| Dimension | |
| Number of iterations in the -means method | |
| Number of clusters in the clustering result | |
| Number of candidate representatives | |
| Number of representatives | |
| The set of representatives | |
| The -th representatives in | |
| The set of rep-clusters | |
| The -th rep-cluster in | |
| Center of | |
| Number of rep-clusters in | |
| Average number of objects in each rep-cluster | |
| Number of nearest representatives | |
| Candidate neighborhood size around a representative | |
| Distance between object and rep-cluster | |
| A bipartite graph between and | |
| Cross-affinity matrix of graph . | |
| The -th entry of | |
| Full affinity matrix of graph | |
| Graph Laplacian of graph | |
| Degree matrix of graph | |
| The -th eigenvector of graph | |
| The -th eigenvalue of graph | |
| A small graph with as the node set | |
| Affinity matrix of graph | |
| Graph Laplacian of graph | |
| Degree matrix of graph | |
| The -th eigenvector of graph | |
| The -th eigenvalue of graph | |
| Diagonal matrix with its -th entry being the | |
| sum of the -th row of | |
| Transition probability matrix | |
| The ensemble of base clusterings | |
| The -th base clustering in | |
| Number of base clusterings in | |
| U-SPECi | The clusterer to generate the -th base clustering |
| The set of representatives in U-SPECi | |
| The -th representatives in | |
| Number of clusters in | |
| Minimum number of clusters in a base clustering | |
| Maximum number of clusters in a base clustering | |
| Random variable in | |
| Set of all clusters in | |
| The -th cluster in | |
| Number of clusters in | |
| A bipartite graph between and | |
| Cross-affinity matrix of graph . | |
| The -th entry of | |
| The -th eigenvector of graph | |
| Diagonal matrix with its -th entry being the | |
| sum of the -th row of | |
| A small graph with as the node set | |
| Affinity matrix of graph | |
| Graph Laplacian of graph | |
| Degree matrix of graph | |
| The -th eigenvector of graph | |
| The -th eigenvalue of graph |
| Dataset | #Object | Dimension | #Class | |
|---|---|---|---|---|
| Real | PenDigits | 10,992 | 16 | 10 |
| USPS | 11,000 | 256 | 10 | |
| Letters | 20,000 | 16 | 26 | |
| MNIST | 70,000 | 784 | 10 | |
| Covertype | 581,012 | 54 | 7 | |
| Synthetic | TB-1M | 1,000,000 | 2 | 2 |
| SF-2M | 2,000,000 | 2 | 4 | |
| CC-5M | 5,000,000 | 2 | 3 | |
| CG-10M | 10,000,000 | 2 | 11 | |
| Flower-20M | 20,000,000 | 2 | 13 | |
| Dataset | -means | SC | ESCG | Nyström | LSC-K | LSC-R | FastESC | EulerSC | U-SPEC | U-SENC |
| PenDigits | 66.66±1.76 | 59.36±0.00 | 76.41±2.26 | 65.67±1.16 | 79.73±2.09 | 78.13±2.20 | 65.31±0.71 | 58.59±0.73 | 80.30±2.18 | 85.34±0.91 |
| USPS | 44.11±1.24 | 63.44±0.01 | 48.41±3.53 | 44.91±1.28 | 66.86±1.58 | 58.64±1.31 | 41.36±1.80 | 40.31±1.91 | 63.47±0.97 | 73.89±1.82 |
| Letters | 34.86±0.60 | 10.43±0.50 | 35.80±1.72 | 39.02±0.83 | 43.41±0.81 | 40.98±0.93 | 35.92±1.41 | 31.76±0.92 | 42.53±1.32 | 45.90±0.58 |
| MNIST | 48.91±2.00 | 74.07±0.00 | 55.75±4.62 | 47.78±1.17 | 73.97±1.46 | 62.16±2.22 | 43.44±1.85 | 8.93±1.22 | 67.43±1.55 | 75.02±0.81 |
| Covertype | 6.17±0.00 | N/A | N/A | 6.93±0.07 | 6.75±0.10 | 6.69±0.12 | 9.15±1.00 | 0.01±0.00 | 6.97±0.16 | 9.13±1.21 |
| TB-1M | 25.71±0.00 | N/A | N/A | 24.06±0.01 | 0.10±0.11 | 0.20±0.24 | 24.01±2.72 | 25.94±0.01 | 95.86±0.48 | 97.48±0.05 |
| SF-2M | 47.34±0.23 | N/A | N/A | 46.66±0.02 | 66.45±6.15 | 58.34±6.92 | 52.03±0.95 | 47.35±2.19 | 75.59±2.12 | 77.02±2.32 |
| CC-5M | 0.00±0.00 | N/A | N/A | N/A | N/A | N/A | N/A | 0.00±0.00 | 99.87±0.01 | 99.91±0.00 |
| CG-10M | 63.20±1.59 | N/A | N/A | N/A | N/A | N/A | N/A | 16.19±0.21 | 78.82±1.61 | 89.57±3.96 |
| Flower-20M | 64.19±2.56 | N/A | N/A | N/A | N/A | N/A | N/A | 26.61±0.86 | 86.86±2.05 | 92.47±2.45 |
| Avg. score | - | N/A | N/A | N/A | N/A | N/A | N/A | 25.57 | 69.77 | 74.57 |
| N-Avg. score | - | N/A | N/A | N/A | N/A | N/A | N/A | 33.94 | 91.71 | 99.98 |
| Avg. rank | - | 5.90 | 6.00 | 5.20 | 3.70 | 4.60 | 5.20 | 6.00 | 2.50 | 1.10 |
| Dataset | -means | SC | ESCG | Nyström | LSC-K | LSC-R | FastESC | EulerSC | U-SPEC | U-SENC |
| PenDigits | 71.57±3.12 | 56.44±0.00 | 77.21±3.81 | 71.13±2.07 | 83.07±3.21 | 81.82±3.17 | 69.97±1.15 | 65.85±1.87 | 84.17±3.26 | 88.56±0.61 |
| USPS | 47.25±2.57 | 62.74±0.02 | 53.47±3.94 | 51.09±1.93 | 68.42±2.39 | 60.78±2.18 | 48.80±1.76 | 47.79±2.41 | 63.76±1.35 | 78.17±3.05 |
| Letters | 28.15±0.97 | 12.42±0.46 | 30.37±1.75 | 32.05±0.91 | 35.45±1.34 | 33.86±1.13 | 29.32±1.51 | 28.08±1.44 | 35.71±1.47 | 37.74±1.06 |
| MNIST | 58.48±2.67 | 74.46±0.00 | 63.32±4.64 | 59.72±1.75 | 79.45±1.02 | 69.24±2.75 | 55.93±2.41 | 24.06±1.53 | 74.31±2.28 | 80.58±1.75 |
| Covertype | 49.05±0.00 | N/A | N/A | 49.21±0.11 | 49.45±0.16 | 49.32±0.25 | 48.88±0.18 | 48.76±0.00 | 49.76±0.35 | 50.73±0.62 |
| TB-1M | 78.93±0.00 | N/A | N/A | 78.04±0.01 | 51.54±1.13 | 52.09±1.58 | 77.97±1.52 | 79.04±0.00 | 99.55±0.06 | 99.75±0.01 |
| SF-2M | 74.33±2.14 | N/A | N/A | 69.58±0.05 | 85.34±5.70 | 78.26±7.43 | 74.13±0.32 | 76.93±2.17 | 93.60±1.00 | 93.46±2.27 |
| CC-5M | 52.96±0.00 | N/A | N/A | N/A | N/A | N/A | N/A | 52.96±0.00 | 99.99±0.00 | 99.99±0.00 |
| CG-10M | 63.14±2.42 | N/A | N/A | N/A | N/A | N/A | N/A | 32.81±0.67 | 81.32±2.00 | 93.99±3.25 |
| Flower-20M | 60.85±3.33 | N/A | N/A | N/A | N/A | N/A | N/A | 33.75±0.56 | 88.89±2.85 | 93.79±3.21 |
| Avg. score | - | N/A | N/A | N/A | N/A | N/A | N/A | 49.00 | 77.11 | 81.68 |
| N-Avg. score | - | N/A | N/A | N/A | N/A | N/A | N/A | 62.12 | 94.26 | 99.99 |
| Avg. rank | - | 6.10 | 5.90 | 5.30 | 3.50 | 4.40 | 5.90 | 5.80 | 2.10 | 1.10 |
| Dataset | -means | SC | ESCG | Nyström | LSC-K | LSC-R | FastESC | EulerSC | U-SPEC | U-SENC |
| PenDigits | 0.06 | 7.37 | 1.63 | 1.98 | 1.25 | 0.49 | 0.73 | 1.47 | 1.01 | 19.13 |
| USPS | 0.32 | 9.56 | 9.63 | 1.92 | 1.70 | 0.75 | 0.94 | 8.20 | 1.59 | 29.17 |
| Letters | 0.72 | 3.85 | 7.74 | 2.69 | 3.89 | 2.88 | 1.86 | 23.39 | 1.44 | 21.44 |
| MNIST | 8.79 | 1,231.68 | 1,211.54 | 6.40 | 16.51 | 6.38 | 3.82 | 125.35 | 7.48 | 131.60 |
| Covertype | 13.19 | N/A | N/A | 33.11 | 101.12 | 53.46 | 19.55 | 116.96 | 14.08 | 174.49 |
| TB-1M | 3.25 | N/A | N/A | 105.15 | 109.23 | 35.92 | 21.79 | 6.27 | 10.47 | 318.29 |
| SF-2M | 31.26 | N/A | N/A | 226.77 | 254.98 | 102.55 | 51.07 | 80.44 | 27.06 | 658.82 |
| CC-5M | 94.76 | N/A | N/A | N/A | N/A | N/A | N/A | 132.35 | 46.65 | 1,726.40 |
| CG-10M | 281.84 | N/A | N/A | N/A | N/A | N/A | N/A | 963.29 | 318.93 | 3,603.08 |
| Flower-20M | 579.06 | N/A | N/A | N/A | N/A | N/A | N/A | 3,397.57 | 764.09 | 7,225.83 |
| Dataset | U-SPEC | EAC | WCT | KCC | PTGP | ECC | SEC | LWGP | U-SENC |
|---|---|---|---|---|---|---|---|---|---|
| PenDigits | 80.30±2.18 | 76.31±2.70 | 77.69±2.54 | 58.92±3.47 | 75.58±2.26 | 57.64±4.14 | 47.07±7.53 | 77.54±1.87 | 85.34±0.91 |
| USPS | 63.47±0.97 | 59.02±1.69 | 58.40±2.15 | 49.24±2.98 | 59.63±1.76 | 48.89±1.80 | 39.00±3.83 | 57.55±1.78 | 73.89±1.82 |
| Letters | 42.53±1.32 | 37.19±0.50 | 36.59±0.95 | 33.64±1.03 | 38.09±0.66 | 34.59±0.68 | 31.81±2.01 | 37.09±0.75 | 45.90±0.58 |
| MNIST | 67.43±1.55 | 66.19±1.49 | 65.60±0.96 | 54.34±3.38 | 59.93±2.23 | 56.01±2.25 | 34.19±4.61 | 65.06±0.95 | 75.02±0.81 |
| Covertype | 6.97±0.16 | N/A | N/A | 5.86±1.84 | 6.42±0.44 | 5.70±0.77 | 5.26±2.82 | 7.44±0.31 | 9.13±1.21 |
| TB-1M | 95.86±0.48 | N/A | N/A | 23.36±1.62 | 34.20±2.51 | 26.91±2.13 | 10.62±4.64 | 96.80±1.90 | 97.48±0.05 |
| SF-2M | 75.59±2.12 | N/A | N/A | 42.72±7.11 | 45.17±2.66 | 41.61±6.01 | 27.05±7.73 | 69.88±4.45 | 77.02±2.32 |
| CC-5M | 99.87±0.01 | N/A | N/A | 33.36±12.65 | 0.41±0.86 | 31.62±14.99 | 17.05±6.90 | 98.18±7.75 | 99.91±0.00 |
| CG-10M | 78.82±1.61 | N/A | N/A | 64.78±5.08 | 63.75±0.61 | 62.79±4.91 | 49.70±6.08 | 78.08±2.43 | 89.57±3.96 |
| Flower-20M | 86.86±2.05 | N/A | N/A | 61.18±2.43 | 67.92±1.99 | 60.61±2.37 | 50.37±6.32 | 78.55±2.31 | 92.47±2.45 |
| Avg. score | - | N/A | N/A | 42.74 | 45.11 | 42.64 | 31.21 | 66.62 | 74.57 |
| N-Avg. score | - | N/A | N/A | 59.69 | 64.12 | 59.51 | 45.35 | 87.82 | 100.00 |
| Avg. rank | - | 5.40 | 5.60 | 4.90 | 3.60 | 5.40 | 6.70 | 2.80 | 1.00 |
| Dataset | U-SPEC | EAC | WCT | KCC | PTGP | ECC | SEC | LWGP | U-SENC |
|---|---|---|---|---|---|---|---|---|---|
| PenDigits | 84.17±3.26 | 81.04±4.02 | 82.97±3.17 | 63.33±4.06 | 78.33±2.91 | 62.36±4.12 | 51.60±5.93 | 81.96±2.77 | 88.56±0.61 |
| USPS | 63.76±1.35 | 63.39±2.76 | 62.72±3.14 | 53.46±3.51 | 62.68±1.92 | 53.67±2.21 | 45.38±3.20 | 59.73±3.30 | 78.17±3.05 |
| Letters | 35.71±1.47 | 30.28±0.58 | 30.17±1.01 | 26.90±1.23 | 31.50±0.89 | 27.53±0.72 | 26.12±1.93 | 30.76±0.84 | 37.74±1.06 |
| MNIST | 74.31±2.28 | 73.12±2.73 | 70.73±1.76 | 59.86±5.11 | 65.06±2.75 | 61.18±3.58 | 43.13±4.88 | 71.98±1.67 | 80.58±1.75 |
| Covertype | 49.76±0.35 | N/A | N/A | 49.54±0.58 | 49.11±0.30 | 49.68±0.40 | 49.86±0.94 | 49.50±0.28 | 50.73±0.62 |
| TB-1M | 99.55±0.06 | N/A | N/A | 70.05±1.21 | 82.94±1.08 | 72.50±1.48 | 60.12±3.64 | 99.65±0.31 | 99.75±0.01 |
| SF-2M | 93.60±1.00 | N/A | N/A | 67.12±5.41 | 73.46±1.76 | 66.90±6.15 | 55.91±5.71 | 88.71±3.28 | 93.46±2.27 |
| CC-5M | 99.99±0.00 | N/A | N/A | 66.76±6.24 | 52.96±0.00 | 62.71±5.38 | 61.91±5.49 | 99.30±3.07 | 99.99±0.00 |
| CG-10M | 81.32±2.00 | N/A | N/A | 66.96±5.60 | 63.36±1.26 | 64.74±6.80 | 58.19±4.69 | 81.95±3.93 | 93.99±3.25 |
| Flower-20M | 88.89±2.85 | N/A | N/A | 57.78±3.37 | 63.83±2.34 | 56.69±2.35 | 50.70±5.02 | 81.37±2.69 | 93.79±3.21 |
| Avg. score | - | N/A | N/A | 58.18 | 62.32 | 57.80 | 50.29 | 74.49 | 81.68 |
| N-Avg. score | - | N/A | N/A | 72.48 | 77.98 | 72.22 | 63.53 | 90.54 | 100.00 |
| Avg. rank | - | 5.40 | 5.60 | 5.00 | 4.20 | 5.00 | 6.30 | 2.90 | 1.00 |
| Dataset | U-SPEC | EAC | WCT | KCC | PTGP | ECC | SEC | LWGP | U-SENC |
|---|---|---|---|---|---|---|---|---|---|
| PenDigits | 1.01 | 8.89 | 47.01 | 8.97 | 11.94 | 13.56 | 5.27 | 5.46 | 19.13 |
| USPS | 1.59 | 13.11 | 48.45 | 15.87 | 59.71 | 23.53 | 10.15 | 10.25 | 29.17 |
| Letters | 1.44 | 29.60 | 177.11 | 33.91 | 137.46 | 53.04 | 16.06 | 15.58 | 21.44 |
| MNIST | 7.48 | 576.71 | 3,435.19 | 315.58 | 2,205.18 | 417.10 | 260.96 | 259.91 | 131.60 |
| Covertype | 14.08 | N/A | N/A | 954.89 | 7,919.02 | 1,482.43 | 712.84 | 685.89 | 174.49 |
| TB-1M | 10.47 | N/A | N/A | 1,308.54 | 1,276.82 | 2,100.02 | 1,000.30 | 989.10 | 318.29 |
| SF-2M | 27.06 | N/A | N/A | 2,908.34 | 2,493.99 | 4,714.16 | 2,160.46 | 2,105.82 | 658.82 |
| CC-5M | 46.65 | N/A | N/A | 6,833.38 | 5,027.91 | 11,202.43 | 5,130.84 | 5,070.21 | 1,726.40 |
| CG-10M | 318.93 | N/A | N/A | 17,344.29 | 11,578.11 | 27,492.40 | 10,938.88 | 10,700.38 | 3,603.08 |
| Flower-20M | 764.09 | N/A | N/A | 34,869.83 | 21,198.87 | 54,913.10 | 21,696.29 | 21,378.63 | 7,225.83 |
| Dataset | MNIST | Covertype | TB-1M | SF-2M |
|---|---|---|---|---|
| NMI |
|
|
|
|
| CA |
|
|
|
|
| Time cost |
|
|
|
|
| Dataset | MNIST | Covertype | TB-1M | SF-2M |
|---|---|---|---|---|
| NMI |
|
|
|
|
| CA |
|
|
|
|
| Time cost |
|
|
|
|
| Dataset | MNIST | Covertype | TB-1M | SF-2M |
|---|---|---|---|---|
| NMI |
|
|
|
|
| CA |
|
|
|
|
| Time cost |
|
|
|
|
| Dataset | MNIST | Covertype | TB-1M | SF-2M |
|---|---|---|---|---|
| NMI |
|
|
|
|
| CA |
|
|
|
|
| Time cost |
|
|
|
|
| Dataset | MNIST | Covertype | TB-1M | SF-2M |
|---|---|---|---|---|
| NMI |
|
|
|
|
| CA |
|
|
|
|
| Time cost |
|
|
|
|
| Dataset | MNIST | Covertype | TB-1M | SF-2M |
|---|---|---|---|---|
| NMI |
|
|
|
|
| CA |
|
|
|
|
| Time cost |
|
|
|
|
| Dataset | MNIST | Covertype | TB-1M | SF-2M |
|---|---|---|---|---|
| NMI |
|
|
|
|
| CA |
|
|
|
|
| Time cost |
|
|
|
|
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methodspc · Large-scale spectral clustering · Spectral Clustering
Ultra-Scalable Spectral Clustering and Ensemble Clustering
Dong Huang, Chang-Dong Wang, Jian-Sheng Wu,
Jian-Huang Lai, and Chee-Keong Kwoh D. Huang is with the College of Mathematics and Informatics, South China Agricultural University, Guangzhou, China, and also with the School of Computer Science and Engineering, Nanyang Technological University, Singapore. E-mail: [email protected]. C.-D. Wang and J.-H. Lai are with the School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China, and also with Guangdong Key Laboratory of Information Security Technology, Guangzhou, China, and also with Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China.
E-mail: [email protected], [email protected]. J.-S. Wu is with the School of Information Engineering, Nanchang University, Nanchang, China. E-mail: [email protected] C.-K. Kwoh is with the School of Computer Science and Engineering, Nanyang Technological University, Singapore.
E-mail: [email protected].
Abstract
This paper focuses on scalability and robustness of spectral clustering for extremely large-scale datasets with limited resources. Two novel algorithms are proposed, namely, ultra-scalable spectral clustering (U-SPEC) and ultra-scalable ensemble clustering (U-SENC). In U-SPEC, a hybrid representative selection strategy and a fast approximation method for -nearest representatives are proposed for the construction of a sparse affinity sub-matrix. By interpreting the sparse sub-matrix as a bipartite graph, the transfer cut is then utilized to efficiently partition the graph and obtain the clustering result. In U-SENC, multiple U-SPEC clusterers are further integrated into an ensemble clustering framework to enhance the robustness of U-SPEC while maintaining high efficiency. Based on the ensemble generation via multiple U-SEPC’s, a new bipartite graph is constructed between objects and base clusters and then efficiently partitioned to achieve the consensus clustering result. It is noteworthy that both U-SPEC and U-SENC have nearly linear time and space complexity, and are capable of robustly and efficiently partitioning ten-million-level nonlinearly-separable datasets on a PC with 64GB memory. Experiments on various large-scale datasets have demonstrated the scalability and robustness of our algorithms. The MATLAB code and experimental data are available at https://www.researchgate.net/publication/330760669.
Index Terms:
Data clustering, Large-scale clustering, Spectral clustering, Ensemble clustering, Large-scale datasets, Nonlinearly separable datasets.
1 Introduction
Data clustering is a fundamental problem in the field of data mining and machine learning [1], whose purpose is to partition a set of objects into a certain number of homogeneous groups, each referred to as a cluster. Out of the large number of clustering algorithms that have been developed, spectral clustering in recent years has been gaining increasing attention due to its promising ability in dealing with nonlinearly separable datasets [2, 3, 4, 5]. However, a critical limitation to conventional spectral clustering lies in its huge time and space complexity, which significantly restricts its application to large-scale problems.
Conventional spectral clustering typically consists of two time- and memory-consuming phases, namely, affinity matrix construction and eigen-decomposition. It generally takes time and memory to construct the affinity matrix, and takes time and memory to solve the eigen-decomposition problem [2], where is the data size and is the dimension. As the data size increases, the computational burden of spectral clustering grows dramatically. For example, given a dataset with one million objects, the affinity matrix alone will consume 7450.58 GB of memory (with each entry in the matrix stored as a double-precision value), which prohibitively exceeds the memory capacity of a general-purpose machine, not to mention the next phase of eigen-decomposition.
To alleviate the huge computational burden of spectral clustering, a commonly used strategy is to sparsify the affinity matrix and solve the eigen-decomposition problem by some sparse eigen-solvers [2]. The matrix sparsification strategy can reduce the memory cost of storing the affinity matrix and facilitate the eigen-decomposition, but it still requires the computation of all entries in the original affinity matrix. Besides matrix sparsification, another widely-studied strategy is based on sub-matrix construction [3, 4]. The Nyström method [3] randomly selects representatives from the original dataset and builds an affinity sub-matrix. Cai et al. [4] extended the Nyström method and proposed the landmark based spectral clustering (LSC) method, which performs -means on the dataset to get cluster centers as the representatives. However, these sub-matrix based spectral clustering methods [3, 4] are typically restricted by an complexity bottleneck, which has been a critical hurdle for them to deal with extremely large-scale dataset where a larger is often desired for achieving better approximation [4]. Moreover, the clustering results of these methods heavily rely on their one-shot approximation via the sub-matrix, which places an unstable factor on their clustering robustness. Despite the considerable efforts that have been made in recent years [2, 3, 4, 5], it remains a highly challenging problem how to enable spectral clustering to efficiently and robustly cluster extremely large-scale datasets (which may even be nonlinearly separable) with rather limited computing resources.
In light of this, this paper focuses on scalability and robustness of spectral clustering for extremely larger-scale datasets. Specifically, this paper proposes two novel large-scale algorithms, namely, ultra-scalable spectral clustering (U-SPEC) and ultra-scalable ensemble clustering (U-SENC). In U-SPEC, a new hybrid representative selection strategy is presented to efficiently find a set of representatives, which reduces the time complexity of -means based selection from to . Then, a fast approximation method for -nearest representatives are designed to efficiently build a sparse sub-matrix with time and memory. With the sparse sub-matrix serving as the cross-affinity matrix, a bipartite graph is constructed between the dataset and the representative set. By taking advantage of the bipartite graph structure, the transfer cut [6] is utilized to solve the eigen-decomposition problem with time, where is the number of clusters and is the number of nearest representatives. Finally, the -means discretization is adopted to construct the clustering result from a set of eigenvectors, which takes time. As it generally holds that , the time and space complexity of our U-SPEC algorithm are respectively dominated by and . Further, to go beyond the one-shot approximation of U-SPEC and provide better clustering robustness, the U-SENC algorithm is proposed by integrating multiple U-SPEC clusterers into a unified ensemble clustering framework, whose time and space complexity are respectively dominated by and . Extensive experiments have been conducted on ten large-scale datasets (including five synthetic datasets and five real datasets), which have shown the superiority of our U-SPEC and U-SENC algorithms over the state-of-the-art in terms of both clustering robustness and scalability.
To summarize, the main contributions of this paper are listed as follows:
A hybrid representative selection strategy is proposed to strike a balance between the efficiency of random selection and the effectiveness of -means based selection. 2. 2.
A fast approximation method for -nearest representatives is designed, which is time- and memory-efficient for constructing the sparse affinity sub-matrix between objects and representatives. 3. 3.
A large-scale spectral clustering algorithm termed U-SPEC is developed based on efficient affinity sub-matrix construction and bipartite graph formulation. Its time and space complexity are dominated by and respectively. 4. 4.
By integrating multiple U-SPEC clusterers, a new large-scale ensemble clustering algorithm termed U-SENC is developed, which significantly enhances the robustness of U-SPEC while maintaining high scalability. Its time and space complexity are dominated by and respectively.
The notations that are used throughout the paper are summarized in Table I. The rest of the paper is organized as follows. The related work on large-scale spectral clustering and ensemble clustering is reviewed in Section 2. The proposed U-SPEC and U-SENC algorithms are described in Section 3. The experimental results are reported in Section 4. Finally, the paper is concluded in Section 5.
2 Related Work
In this section, we review the literature related to spectral clustering and ensemble clustering, with special emphasis on their recent large-scale extensions.
2.1 Spectral Clustering
Given a dataset of objects, conventional spectral clustering [2] first computes an affinity matrix, in which each entry corresponds to the similarity of two objects according to some similarity metrics. Then, the eigen-decomposition is performed on the graph Laplacian of the affinity matrix to obtain the eigenvectors associated with the first eigenvalues. By embedding the datasets into the low-dimensional space via the obtained eigenvectors, the final clustering can be achieved via -means or some other discretization techniques [2].
Although spectral clustering has shown promising advantages in finding clusters of arbitrary shapes from complex data, its time complexity and space complexity significantly restrict its application in large-scale tasks. To alleviate the huge computational cost, some researchers sparsified the affinity matrix by considering -nearest neighbors or -neighbors, and then solved the eigen-decomposition problem by some sparse eigen-solvers [2], which, however, still requires the computation of all the entries in the original affinity matrix.
To avoid the computation of the full affinity matrix, the sub-matrix based approximation has emerged as a powerful and efficient tool for spectral clustering [3, 4, 5]. The Nyström approximation [3] randomly selects representatives from the dataset and builds an affinity sub-matrix between the objects and the representatives. The sub-matrix construction takes time and memory, which are much lower than the full affinity matrix construction. Although the random representative selection is very efficient, it is often unstable with regard to the quality of the selected representatives (see Fig. 1). Moreover, while it has been shown that a larger is often favorable for better approximation [3], the memory cost of the sub-matrix construction can still be a critical bottleneck when dealing with very large datasets. To address the potential instability of random selection, Cai and Chen [4] proposed the LSC algorithm, which first partitions the dataset into clusters via -means and then uses the cluster centers as the representatives. With the sub-matrix constructed, they further sparsified it by preserving the -nearest representatives for each row and zeroing out the others [4]. Despite its progress over the previous methods, there are still three computational bottlenecks in the LSC algorithm [4]. First, although the -means based selection often provides a better set of representatives, it comes with the time complexity of . Second, the calculation of all possible entries in the sub-matrix is still required before the sparsification, which comes with the time complexity of . Third, the computation of the -nearest representatives for all objects comes with the time complexity of . More recently, instead of using representatives, He et al. [5] used Fourier features to represent data objects in kernel space, and built an sub-matrix between the objects and the selected Fourier features, upon which the efficient eigen-decomposition can be performed. The time and space complexity of the fast explicit spectral clustering (FastESC) algorithm in [5] are respectively and , which are still restricted by the complexity bottleneck. By incorporating a newly-designed positive Euler kernel, Wu et al. [7] proposed the Euler spectral clustering (EulerSC) method and proved that the EulerSC is equivalent to the weighted positive Euler k-means, which can be iteratively optimized with time. However, EulerSC can only use the positive Euler kernel to define the pair-wise similarity, and is not feasible for the general spectral clustering formulation with other similarity metrics. Moreover, its clustering robustness heavily relies on the proper selection of the Euler kernel parameter, which is difficult to find without prior knowledge.
2.2 Ensemble Clustering
Ensemble clustering has been a popular technique in recent years, which aims to combine multiple base clusterings into a better and more robust consensus clustering [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]. The existing ensemble clustering algorithms can be mainly classified into three categories.
The first category is the pair-wise co-occurrence based methods [8, 9, 21]. Fred and Jain [8] proposed the evidence accumulation clustering (EAC) method, which makes use of the co-association matrix by considering the frequency of pair-wise co-occurrence among multiple base clusterings. With the co-association matrix treated as the similarity matrix, the agglomerative clustering algorithms [1] were then performed to obtain the consensus clustering. Iam-On et al. [9] presented the weighted connected triple (WCT) method, which extends the EAC method by refining the co-association matrix via the common neighborhood information between clusters.
The second category is the graph partitioning based methods [18, 22, 11, 12]. Strehl and Ghosh [18] transformed the multiple base clusterings into a hypergraph representation, based on which three graph partitioning based ensemble clustering methods were presented. Fern and Brodley [22] built a bipartite graph structure by treating both base clusters and data objects as graph nodes, and then partitioned the graph via the METIS algorithm [23].
The third category is the median partition based methods [24, 17], which cast the ensemble clustering problem into an optimization problem that aims to find a median clustering (or partition) by maximizing the similarity between this clustering and the multiple base clusterings. Franek and Jiang [24] formulated the median partition problem into a Euclidian median problem and solved it by the Weiszfeld algorithm [25]. Huang et al. [17] cast the median partition problem into a binary linear programming problem and solved it by the factor graph model.
These ensemble clustering algorithms have shown their advantages in improving clustering accuracy and robustness. However, due to the efficiency bottleneck, most of them are not suitable for very large-scale applications. Recently some efforts have been made to (partially) address the scalability problem for ensemble clustering. To reduce the problem size, Huang et al. [11] exploited the microcluster representation, which maps the data objects onto microclusters (). Then, the set of microclusters are treated as the primitive objects, based on which two novel algorithms, i.e., the probability trajectory accumulation (PTA) and the probability trajectory based graph partitioning (PTGP), are proposed. Wu et al. [10] transformed the ensemble clustering problem into a -means based consensus clustering (KCC) framework, which significantly facilitated the computation of the consensus function. Liu et al. [15] proved that the spectral clustering of the co-association matrix is equivalent to an instance of weighted -means clustering, and presented the spectral ensemble clustering (SEC) algorithm. While there are two phases in ensemble clustering (i.e., ensemble generation and consensus function), these algorithms [11, 10, 15] generally focus on the efficiency of the consensus function. In ensemble generation, they mostly exploited -means to produce base clusterings [11, 10, 15]. Note that the time complexity of ensemble generation by -means is , which can still be computationally expensive when dealing with very large-scale datasets. Moreover, the performance of -means may significantly deteriorate when handling nonlinearly separable datasets, which has a critical influence on the robustness of the ensemble clustering algorithms. Unlike the common practice that typically exploits multiple -means clusterers as base clusterers, the proposed U-SENC algorithm integrates a diverse set of large-scale U-SPEC clusterers into a highly efficient ensemble clustering framework, which for the first time, to our knowledge, simultaneously tackles the scalability and nonlinear separability issues in both the ensemble generation and consensus function phases in ensemble clustering.
3 Proposed Framework
In this section, we describe the proposed U-SPEC and U-SENC algorithms in Sections 3.1 and 3.2, respectively.
3.1 Ultra-Scalable Spectral Clustering (U-SPEC)
To deal with extremely large-scale datasets, the proposed U-SPEC algorithm complies with the sub-matrix based formulation [3, 4] and aims to break through the efficiency bottleneck of previous algorithms via three phases. Specifically, in the first phase, we present a hybrid representative selection strategy to strike a balance between the efficiency of the random selection and the effectiveness of the -means based selection. In the second phase, we develop a coarse-to-fine method to efficiently approximate the -nearest representatives for each data object, and construct a sparse affinity sub-matrix between the objects and the representatives. In the third phase, the sub-matrix is interpreted as a bipartite graph, which can be efficiently partitioned to obtain the final clustering result. These three phases of U-SPEC will be described in Sections 3.1.1, 3.1.2, and 3.1.3, respectively.
3.1.1 Hybrid Representative Selection
Let denote a dataset with objects, where is the -th object and is the dimension. To capture the relationship between all objects in , an affinity matrix can be constructed in conventional spectral clustering [2], which consumes time and memory and is not feasible for large-scale datasets. To avoid the computation of the full affinity matrix, the sub-matrix representation is often adopted in the literature of large-scale spectral clustering [3, 4]. The sub-matrix representation generally exploits a set of representatives to encode the overall structure of the dataset. These representatives play a crucial role in the sub-matrix representation, and can be selected by random selection [3] or -means based selection [4]. Though the random selection strategy [3] is highly efficient, it suffers from the inherent randomness and may lead to a set of low-quality representatives (see Fig. 1). To deal with the instability of random selection, the -means based selection [4] first groups the entire dataset into clusters via -means and then uses the cluster centers as the representatives. However, the -means based selection brings in an extra time cost of , which restricts its feasibility for very large-scale datasets.
In this paper, we propose a hybrid representative selection strategy, which is designed to find a balance between the efficiency of random selection and the effectiveness of -means based selection. The process of the hybrid representative selection strategy is illustrated in Fig. 2. Different from the -means based selection which attempts to cluster the entire dataset even when the data size is extremely large, the proposed hybrid selection strategy first randomly samples a set of candidate representatives such that . Then, upon the candidates, we perform the -means method to obtain clusters and exploit the cluster centers as the set of representatives. Empirically, the number of candidates is suggested to be several times larger than , e.g., , so as to provide enough candidates while still keeping much smaller than in large-scale datasets. Formally, we denote the set of selected representatives as
[TABLE]
where is the -th representative in .
By introducing an intermediate stage of random pre-sampling, the computational complexity of the -means based selection is reduced from to . As illustrated in Fig. 1, the set of representatives produced by the hybrid selection can better reflect the data distribution than the random selection while requiring much less computational cost than the -means based selection. To discuss this in more detail, quantitative evaluation of the performance of the proposed hybrid selection strategy against random selection and -means based selection will be provided in Section 4.6.
3.1.2 Approximation of -Nearest Representatives
With the representatives obtained, the next objective is to encode the pair-wise relationship of the entire dataset via the small set of representatives.
In the sub-matrix formulation of the Nyström algorithm [3], the construction of the affinity sub-matrix between objects and representatives takes time and memory, which is the main efficiency bottleneck of the overall algorithm [3]. Given a dataset with ten million objects and a set of one thousand representatives, the storage of the sub-matrix alone takes 74.51GB of memory, while the later manipulations of the sub-matrix even require more memory consumption. Cai and Chen [4] proposed to sparsify the affinity matrix by -nearest representatives (with ), which, however, still requires the computation of all the distances between the objects and the representatives. Moreover, besides the calculation of the total of entries, the sparsification step also consumes time [4].
Before introducing our facilitation strategy, we first investigate the characteristics of the sparse sub-matrix between objects and representatives, where each object is only connected to its -nearest representatives. It is obvious that there are non-zero entries in each row of the matrix, and non-zero entries in the entire matrix. Assume we have and , the proportion of the non-zero entries in the matrix will be . However, to exactly identify such a small proportion of useful entries via -nearest representatives, the entire matrix should first be calculated, which unfortunately consists of of intermediate entries. To break the efficiency bottleneck, the key problem here is how to significantly reduce the calculation of these intermediate entries when building the sub-matrix with -nearest representatives.
In this section, our aim is to alleviate the computational cost of the exact -nearest representative calculation [4] by designing a time- and memory-efficient approximation method. Though the -nearest representative approximation problem and the classical -nearest neighbor (-NN) approximation problem [26, 27, 28] have some characteristics in common, they are faced with very different computational issues in actual applications. Different from the conventional -NN approximation scenarios, which mostly deal with a general graph with an affinity matrix, our aim here is to find the -nearest representatives in a heavily imbalanced bipartite graph with an affinity sub-matrix, where is generally far smaller than . This imbalanced nature is crucial to our -nearest representative approximation problem. On the one hand, it makes the conventional -NN approximation methods [26, 27, 28] (which are typically designed for general graphs with affinity matrices) inappropriate here. On the other hand, it may as well contribute to the design of our -nearest representative approximation strategy. To take advantage of the imbalanced structure, it is intuitive to pre-process the graph on the side of the representatives and minimize the computation on the other side of the objects.
In particular, we present a new -nearest representative approximation method based on the coarse-to-fine mechanism, and build the sparse affinity sub-matrix with complexity. The main idea of our -nearest representative approximation is to first find the nearest region, then find the nearest representative (denoted as ) in the nearest region, and finally find the -nearest representatives in the neighborhood of . To efficiently implement the approximation, two preprocessing steps are required, that is
- •
Pre-step 1. The set of representatives are grouped into rep-clusters via -means (with ). The time complexity is .
- •
Pre-step 2. For each representative in , its -nearest neighbors are computed and stored (with ). The time complexity is .
In pre-step 1, each rep-cluster consists of a certain number of representatives, and can be regarded as a local region of the representative set (see Fig. 3). Formally, the obtained rep-clusters are denoted as
[TABLE]
where is the -th rep-cluster in . Given an object and a rep-cluster , their distance is defined as the distance between and the center of . That is
[TABLE]
where denotes the number of representatives in the rep-cluster and computes the Euclidean distance between two vectors and .
With the distance between objects and rep-clusters defined, for each object , we approximately find its -nearest representatives according to three main steps:
Step 1
Find the nearest rep-cluster of , denoted as .
Step 2
Find the nearest representative of inside the rep-cluster , denoted as .
Step 3
Out of and its -nearest neighbors, find the -nearest representatives of .
More details are illustrated in Fig. 3. For a dataset with objects, the time cost of step 1 is . The time cost of step 2 is , where denotes the average size of the rep-clusters. The time cost of step 3 is . It is obvious that reaches its minimum when . Thus, to minimize the cost, is used in this work, where denotes the floor of a value. The candidate neighborhood size is suggested to be several times larger than , which can be set to in practice. Then, the total time complexity of the -nearest representative approximation is , which can be re-written as . As , the dominant term in the complexity is .
With the -nearest representatives of each object obtained, a sparse affinity sub-matrix can thereby be constructed. In this paper, the Gaussian kernel is used as the similarity kernel. Thus the sparse affinity sub-matrix can be represented as
[TABLE]
where denotes the set of -nearest representatives of and the kernel parameter is set to the average Euclidean distance between the objects and their -nearest representatives. Note that is a sparse matrix which only contains non-zero entries.
3.1.3 Bipartite Graph Partitioning
The affinity sub-matrix reflects the relationship between the objects in and the representatives in , which can be naturally interpreted as a bipartite graph , where is the node set and is the cross-affinity matrix (as shown in Fig. 4). By taking advantage of the bipartite graph structure, the transfer cut [6] can thereby be used to efficiently partition the graph and achieve the final clustering result.
To start, if we view the graph as a general graph with nodes, then its full affinity matrix can be denoted as
[TABLE]
Spectral clustering seeks to partition the graph by solving the following generalized eigen-problem [29]:
[TABLE]
where is the graph Laplacian and is the degree matrix. By treating as a general graph, it takes time to solve the eigen-problem (8) [30], which is not computationally feasible for very large-scale datasets.
By exploiting the bipartite structure, we resort to the transfer cut [6] to reduce the eigen-problem (8) on the graph (with nodes) to an eigen-problem on a much smaller graph (with nodes). Specifically, the graph is constructed as , where is the node set, is the affinity matrix (whose computation takes time), and is a diagonal matrix with its -th entry being the sum of the -th row of . Let be the graph Laplacian, where is the degree matrix of . Then, the generalized eigen-problem on the graph can be represented as
[TABLE]
It has been proved by Li et al. [6] that solving the eigen-problem (8) on the graph is equivalent to solving the eigen-problem (9) on the graph . Let the first eigen-pairs for the eigen-problem (9) be denoted as with , and the first eigen-pairs for the eigen-problem (8) denoted as with . It has been shown that [6]
[TABLE]
where is the transition probability matrix. It takes time to compute the first eigen-pairs for the eigen-problem (9). As is a sparse matrix with non-zero entries, it takes time to compute from according to Eqs. (10), (11), and (12). Therefore, the total cost of computing the first eigenvectors for the eigen-problem (8) will be .
With the eigen-problem solved, the obtained eigenvectors are stacked to form an matrix. By treating each row of this matrix as a new feature vector, the rows corresponding to the original objects are used, upon which the -means discretization can be performed to obtain the final clustering result with time complexity.
3.1.4 Computational Complexity
In this section, we summarize the time and memory cost of our U-SPEC algorithm.
The hybrid representative selection takes time. The affinity construction takes time. The eigen-decomposition takes time. The -means discretization takes time. With consideration to , the overall time complexity of U-SPEC is , where is the dominant term. Table II provides a comparison of time complexity of our U-SPEC algorithm against several other large-scale spectral clustering algorithms.
Besides the time cost, the memory cost of U-SPEC can be either or , which depends on the actual implementation of the -nearest representative approximation. As the -nearest representative approximation for the objects are independent of each other, one strategy is to perform approximation for the objects one after the other (i.e., in a serial processing manner), where the time cost is dominated by the storage of the cross-affinity matrix with non-zero entries. Another strategy is to first construct an affinity matrix between the objects and the rep-cluster centers and then approximate the -nearest representatives for the objects in a batch processing manner. For some matrix-oriented software, such as MATLAB, it will be much faster to perform the approximation in a batch processing manner (with optimized matrix computation) than in a serial processing manner. To facilitate the matrix computation, our implementation of U-SPEC actually takes memory. Similarly, the LSC algorithm [4] also has a theoretically minimum memory cost of , but the implementation111www.cad.zju.edu.cn/home/dengcai/Data/Clustering.html provided by the authors actually takes memory, which is also due to the matrix-computation consideration.
3.2 Ultra-Scalable Ensemble Clustering (U-SENC)
Starting from U-SPEC, this section proposes the U-SENC algorithm to integrate multiple U-SPEC’s into a unified ensemble clustering framework, aiming to further enhance the clustering robustness while maintaining high efficiency.
3.2.1 Ensemble Generation via Multiple U-SPEC’s
Ensemble clustering has been a popular research topic in recent years, due to its promising ability in enhancing clustering robustness by incorporating multiple base clusterers [10, 11, 14, 15, 12]. The general ensemble clustering process consists of two phases. The first phase is the ensemble generation, which involves producing a set of diverse and high-quality base clusterings. The second phase is the consensus function, which involves combining multiple base clusterings into a better and more robust consensus clustering.
In ensemble generation, the previous ensemble clustering algorithms mostly use the -means method to generate an ensemble of multiple base clusterings [10, 11, 14, 15, 12]. Though -means has the advantage of high efficiency, it typically favors spherical distribution and lacks the ability to properly partition nonlinearly separable datasets. Some researchers have exploited the spectral clustering technique in ensemble generation [31, 32], but the large computational cost of conventional spectral clustering significantly restricts its feasibility for scalable applications.
To address this, we utilize multiple instances of U-SPEC as the multiple base clusterers in our ensemble clustering framework. To generate an ensemble of base clusterings, a set of U-SPEC clusterers are required, which are denoted as U-SPECU-SPECU-SPECm. The diversity which is highly desired in ensemble generation is incorporated from two aspects. First, the set of representatives for each base clusterer is independently obtained by the hybrid selection strategy. There are two components in hybrid selection, i.e., random pre-selection and -means based post-selection, both of which are non-deterministic and can bring in diversity for the multiple base clusterers. Second, the number of clusters for each base clustering is randomly selected to further enhance the diversity. Formally, given the dataset , the set of candidate representatives for the -th base clusterer (i.e., U-SPECi) are randomly selected from . Then the -means is used to partition the candidates into clusters. After that, the cluster centers will be used as the set of representatives for U-SPECi, denoted as
[TABLE]
With the representatives obtained, the sparse affinity sub-matrix for U-SPECi can be built between the dataset and the representative set via fast approximation of -nearest representatives.
By treating as the node set and as the cross-affinity matrix, the bipartite graph is built and its first eigenvectors are then computed via transfer cut [6]. Note that the number of clusters is randomly selected as
[TABLE]
where is a random variable and and are respectively the upper bound and lower bound of the cluster number. Then, the obtained eigenvectors are stacked to form a new matrix, upon which the -means is applied to construct the base clustering result for U-SPECi. With the U-SPEC clusterers, the ensemble of base clusterings can be generated, which are represented as
[TABLE]
where denotes the -th base clustering.
3.2.2 Consensus Function with Bipartite Graph
Having obtained the set of multiple base clusterings, this section presents the consensus function with bipartite graph for obtaining the consensus clustering.
Each base clustering consists of a certain number of clusters. For clarity, we denote the set of clusters in the ensemble of base clusterings as
[TABLE]
where is the -th cluster and is the total number of clusters in . It is obvious that .
By treating both objects and clusters as graph nodes, the bipartite graph for the ensemble is defined as
[TABLE]
where is the node set and is the cross-affinity matrix. In this bipartite graph, a (non-zero) edge exists between two nodes if and only if one node is an object and the other one is the cluster that contains it. Formally, the cross-affinity matrix is constructed as follows:
[TABLE]
Inside the same base clustering, there is no intersection between two different clusters, i.e., , if and , then . Obviously, each object belongs to one and only one cluster in each base clustering, and thus each object belongs exactly to clusters in the ensemble of base clusterings. Therefore, there are exactly non-zero entries in each row of . Although the cross-affinity matrix is an matrix, it can be stored as a sparse matrix with memory, which corresponds to the exactly non-zero entries in . Besides the memory cost, the time cost of building the sparse matrix is .
As shown in Section 3.1.3, solving the eigen-problem for the bipartite graph can be equivalent to solving the eigen-problem for a much smaller graph , that is
[TABLE]
where is the affinity matrix, is a diagonal matrix with its -th entry being the sum of the -th row of , is the graph Laplacian, and is the degree matrix of .
Let denote the first eigenvectors for the eigen-problem (20), which can be computed with a time cost of . Based on the eigenvectors for , the first eigenvectors (denoted as ) for the bipartite graph can be computed with time (see Eqs. (10), (11), and (12)). Finally, by stacking the eigenvectors to form a new matrix, the consensus clustering result in U-SENC can be obtained by -means discretization with time.
3.2.3 Computational Complexity
This section summarizes the time and memory cost of the proposed U-SENC algorithm.
The ensemble generation of the U-SENC algorithm takes time. The consensus function of U-SENC takes time. With consideration to , the dominant term of the overall time complexity of U-SENC is .
Meanwhile, the memory costs of the ensemble generation and the consensus function of our U-SENC algorithm are respectively and .
4 Experiments
In this section, we conduct experiments on a variety of real and synthetic datasets to compare the proposed U-SPEC and U-SENC algorithms against several state-of-the-art spectral clustering and ensemble clustering algorithms.
All experiments are conducted in Matlab 2016b on a PC with an Intel i5-6600 CPU and 64GB of RAM.
4.1 Datasets and Evaluation Measures
Our experiments are conducted on ten large-scale datasets (including five real datasets and five synthetic datasets), whose data sizes range from ten thousand to as large as twenty million. Specifically, the five real datasets are PenDigits [33], USPS [34], Letters [33], MNIST [34], and Covertype [33]. The five synthetic datasets are Two Bananas-1M (TB-1M), Smiling Face-2M (SF-2M), Concentric Circles-5M (CC-5M), Circles and Gaussians-10M (CG-10M), and Flower-20M. The details of the datasets are provided in Table III and Fig. 5.
To evaluate the clustering results by different algorithms, two widely used evaluation measures are adopted, namely, normalized mutual information (NMI) [18] and clustering accuracy (CA) [35]. To rule out the factor of getting lucky occasionally, in each experiment, every test method will be conducted 20 times and their average NMI, CA, and time costs will be reported. Note that larger values of NMI and CA indicate better clustering results.
4.2 Baseline Methods and Experimental Settings
In the experiments, we first compare our algorithms against the classical -means algorithm [36] as well as seven spectral clustering algorithms (including the original algorithm and six large-scale algorithms). The baseline spectral clustering algorithms are listed as follows:
SC [2]: original spectral clustering. 2. 2.
ESCG [37]: efficient spectral clustering on graphs. 3. 3.
Nyström [3]: Nyström spectral clustering. 4. 4.
LSC-K [4]: landmark based spectral clustering using -means based landmark selection. 5. 5.
LSC-R [4]: landmark based spectral clustering using random landmark selection. 6. 6.
FastESC [5]: fast explicit spectral clustering. 7. 7.
EulerSC [7]: Euler spectral clustering.
Besides these large-scale spectral clustering algorithms, we also compare our algorithms against seven ensemble clustering algorithms, which are listed as follows:
EAC [8]: evidence accumulation clustering. 2. 2.
WCT [9]: weighted connected triple method. 3. 3.
KCC [10]: -means based consensus clustering. 4. 4.
PTGP [11]: probability trajectory based graph partitioning. 5. 5.
ECC [14]: entropy based consensus clustering. 6. 6.
SEC [15]: spectral ensemble clustering. 7. 7.
LWGP [12]: locally weighted graph partitioning.
There are several common parameters among the above-mentioned algorithms. In our experiments, we comply with the following experimental settings:
- •
The SC and ESCG methods need to take the affinity matrix as input. The affinity matrix is constructed using the same Gaussian kernel as Eq. (6) with -nearest neighbors.
- •
The U-SPEC, U-SENC, Nyström, LSC-K, and LSC-R methods have a common parameter . In the experiments, is used for these methods. Their performances with varying will be further evaluated in Section 4.5.1.
- •
The U-SPEC, U-SENC, LSC-K, and LSC-R methods have a common parameter . In the experiments, is used. Their performances with varying will be further evaluated in Section 4.5.2.
- •
For the seven ensemble clustering methods, the base clusterings are generated by -means as suggested by their papers [8, 9, 10, 11, 14, 15, 12]. The number of clusters in each base clustering is randomly selected in . The number of base clusterings, i.e., , is set to . Their performances with varying will be further evaluated in Section 4.5.3.
- •
The true number of classes on each dataset is used as the number of clusters for all the test methods.
- •
Besides these common parameters, the other parameters in the baseline methods will be set as suggested by the corresponding papers.
4.3 Comparison with Spectral Clustering Methods
In this section, we compare our U-SPEC and U-SENC algorithms with several state-of-the-art large-scale spectral clustering algorithms.
As the data sizes range from ten thousand to twenty million, most of the baseline algorithms are not computationally feasible for ten-million-level datasets. Specifically, we use N/A to indicate the out-of-memory error in the results. As shown in Tables IV and V, the SC and ESCG methods are not able to handle the datasets large than MNIST (which consists of 70,000 objects), due to the memory consumption of constructing and manipulating the affinity matrix. The Nyström, LSC-K, LSC-R, and FastESC methods can at most partition a dataset with two million objects, and cannot deal with datasets larger than that. Out of the total of nine spectral clustering methods, only three methods (i.e., U-SPEC, U-SENC, and EulerSC) can deal with all of the benchmark datasets. As shown in Tables IV and V, our U-SENC and U-SPEC methods achieve the best and the second best scores, respectively, on most of the ten benchmark datasets.
In Tables IV and V, we also provide the average score, normalized average score (N-Avg. score), and average rank of each method across the ten datasets. To obtain the normalized average score, the scores in each row will first be divided by the maximum score in this row, where it is obvious that the maximum score will become . Then we take the average of these normalized rows as the normalized average score. Note that if a baseline method cannot process all the datasets, it will not have the average score and normalized average score information, but it will still have the average rank information. For example, if only three methods are efficient enough to process the CC-5M dataset, then all the other infeasible methods will be treated as equally ranked in the fourth position on this dataset. As shown in Tables IV and V, our U-SENC method ranks in the first position on nine out of the ten datasets, and achieves an average rank of 1.10 w.r.t. both NMI and CA. Our U-SPEC method achieves an average rank of 2.40 w.r.t. NMI and 2.00 w.r.t. CA. In terms of average score and normalized average score, our U-SENC and U-SPEC methods also significantly outperform the other methods.
Table VI reports the time costs of different methods on the benchmark datasets. The U-SPEC shows superior efficiency on most of the datasets, especially on the datasets larger than one million. The U-SENC requires a larger time cost than U-SPEC, but it still provides better scalability than most of the baseline methods and scales well for ten-million-level datasets due to its memory efficiency. As U-SENC is a spectral clustering algorithm and also an ensemble clustering algorithm, in the following, we will further compare it with other state-of-the-art ensemble clustering algorithms.
4.4 Comparison with Ensemble Clustering Methods
In this section, we compare our algorithms with several state-of-the-art ensemble clustering algorithms.
Note that U-SPEC is not an ensemble clustering algorithm; its clustering results are provided in Tables VII, VIII, and IX for reference only. As shown in Tables VII and VIII, our U-SENC algorithm obtains the highest NMI and CA scores on all of the ten datasets. In terms of average score across the ten datasets, U-SENC achieves the best average NMI() and CA() scores of and , respectively while the second best ensemble clustering method (i.e., LWGP) only achieves average NMI() and CA() scores of and , respectively. Similar advantages of U-SENC can also be observed in the normalized average scores. In terms of average rank, U-SENC obtains an average rank of 1.00 w.r.t. both NMI and CA, while the second best method obtains an average rank of 2.80 w.r.t. NMI and 2.90 w.r.t. CA.
In Table IX, the time costs of different ensemble clustering methods are provided. As can be seen in Table IX, the proposed U-SENC method has shown its advantage in efficiency over the other ensemble clustering methods, especially on the large-scale datasets whose data sizes go beyond millions.
4.5 Parameters Analysis
In this section, we evaluate the performances of our algorithms and several baseline algorithms with varying parameters. Because some important baseline methods (such as Nyström, LSC-K, and LSC-R) can not go beyond two-million-level datasets, in order to fairly test the influence of some common parameters among them, we perform the parameter analysis on four benchmark datasets, namely, MNIST, Covertype, TB-1M, and SF-2M, which are the largest four datasets whose sizes are no larger than two million.
4.5.1 Number of Representatives
The parameter denotes the number of representatives (or landmarks), which is a common parameter in the sub-matrix based spectral clustering methods, such as Nyström, LSC-K, LSC-R, and our U-SPEC and U-SENC methods. As can be seen in Table X, a larger generally leads to better performance, but also brings in an increasing time cost. In terms of NMI and CA, our U-SENC method consistently outperforms the other methods with varying parameter on all of the four datasets. The LSC-K outperforms U-SPEC on the MNIST dataset. But on all the other three datasets, U-SPEC achieves better or significantly better NMI and CA scores than LSC-K. In terms of computational cost, the LSC-K and Nyström methods cannot deal with representatives on the SF-2M dataset with two million objects. On the benchmark datasets, U-SPEC is overall the fastest method with varying parameter (as shown in Table X).
4.5.2 Number of Nearest Representatives
The parameter denotes the number of nearest representatives (or landmarks), which is a common parameter in LSC-K, LSC-R, and our U-SPEC and U-SENC methods. Note that the Nyström method doesn’t have such a parameter , but we still illustrate the performance of Nyström in Table XI just to use Nyström as a benchmark here. As illustrated in Table XI, on the MNIST dataset, U-SENC and LSC-K are respectively the best and the second best methods w.r.t. NMI and CA, while U-SPEC is the third best method. On all of the other three benchmark datasets, U-SENC and U-SPEC are overall the best two methods w.r.t. both NMI and CA with varying parameter (as shown in Table XI).
4.5.3 Ensemble Size
The parameter denotes the number of base clusterings, which is a common parameter in all of the ensemble clustering methods, including U-SENC as well as the baseline ensemble clustering methods. Note that U-SPEC is not an ensemble clustering method and doesn’t have the parameter , but we still illustrate the performance of U-SPEC in Table XII for reference only. As shown in Table XII, U-SENC outperforms, or even significantly outperforms, the other ensemble clustering methods w.r.t. both NMI and CA on the benchmark datasets with varying ensemble size . Meanwhile, U-SENC consistently requires a lower computational cost than the other ensemble clustering methods.
4.6 Influence of Representative Selection Strategies
In this section, we compare the performances of our algorithms using different representative selection strategies. Specifically, Table XIII illustrates the performances of U-SPEC using hybrid selection (U-SPEC-H), U-SPEC using random selection (U-SPEC-R), and U-SPEC using -means based selection (U-SPEC-K), whereas Table XIV illustrates the performances of U-SENC using hybrid selection (U-SENC-H), U-SENC using random selection (U-SENC-R), and U-SENC using -means based selection (U-SENC-K). As shown in Tables XIII and XIV, the random representative selection is very efficient compared to -means based selection, but may degrade the clustering quality due to its inherent instability. The -means based selection generally leads to better clustering quality than random selection, but brings in a much larger computational cost. Compared to random selection and -means based selection, our hybrid selection strategy strikes a balance between efficiency and clustering robustness. It achieves comparable efficiency to the random selection and significantly better efficiency than the -means based selection, and also yields competitive clustering quality as compared to the -means based selection.
4.7 Influence of Approximate -Nearest Neighbors
In this section, we compare our algorithms using Approximate -nearest representatives against using Exact -nearest representatives, where four variants are evaluated, i.e., U-SPEC(A), U-SPEC(E), U-SENC(A), and U-SENC(E). The purpose of using approximate -nearest representatives (see Section 3.1.2) is to alleviate the time and memory cost of the affinity sub-matrix construction while maintaining the overall clustering quality. As shown in Tables XV and XVI, using approximate -nearest representatives can achieve comparable clustering quality (w.r.t. NMI and CA) with using exact -nearest representatives while alleviating the computational cost. As our approximation of -nearest representatives reduces the time complexity from to , the improvement in efficiency is more significant for high-dimensional datasets, such as the MNIST dataset, whose dimension is 784. Even for the low-dimensional datasets, such as TB-1M and SF-2M, the use of approximate -nearest representatives can still consistently reduce the time cost. Besides the time efficiency, the approximate -nearest representatives also alleviate the memory burden. Specifically, on a machine with 64GB memory, the computation of conventional -nearest representatives can hardly go beyond five million objects, whereas the proposed approximation method for -nearest representatives can scale well for even ten-million-level datasets.
5 Conclusion
This paper proposes two large-scale clustering algorithms, termed ultra-scalable spectral clustering (U-SPEC) and ultra-scalable ensemble clustering (U-SENC), respectively. In U-SPEC, a new hybrid representative selection strategy is designed to strike a balance between the efficiency of random selection and the effectiveness of -means based selection. Then a new approximation method for -nearest representatives is presented to efficiently construct a bipartite graph between the original data objects and the set of representatives, upon which the transfer cut can be utilized to obtain the clustering result. Starting from the U-SPEC algorithm, we further integrate multiple U-SPEC clusterers into a unified ensemble clustering framework and propose the U-SENC algorithm. Specifically, multiple U-SPEC’s are exploited in the ensemble generation phase to produce an ensemble of diverse and high-quality base clusterings. The multiple base clusterings are incorporated into a new bipartite graph, which treats both objects and base clusters as graph nodes and is then efficiently partitioned to achieve the final consensus clustering. Extensive experiments have been conducted on ten large-scale datasets, which demonstrate the scalability and robustness of our algorithms.
Acknowledgments
This project was supported by NSFC (61602189, 61876193 & 61876104), National Key Research and Development Program of China (2016YFB1001003), and Guangdong Natural Science Funds for Distinguished Young Scholars (2016A030306014).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. K. Jain, “Data clustering: 50 years beyond k 𝑘 k -means,” Pattern Recognition Letters , vol. 31, no. 8, pp. 651–666, 2010.
- 2[2] U. von Luxburg, “A tutorial on spectral clustering,” Statistics and Computing , vol. 17, no. 4, pp. 395–416, 2007.
- 3[3] W. Y. Chen, Y. Song, H. Bai, C. J. Lin, and E. Y. Chang, “Parallel spectral clustering in distributed systems,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 33, no. 3, pp. 568–586, 2011.
- 4[4] D. Cai and X. Chen, “Large scale spectral clustering via landmark-based sparse representation,” IEEE Transactions on Cybernetics , vol. 45, no. 8, pp. 1669–1680, 2015.
- 5[5] L. He, N. Ray, Y. Guan, and H. Zhang, “Fast large-scale spectral clustering via explicit feature mapping,” IEEE Transactions on Cybernetics, in press , 2018.
- 6[6] Z. Li, X.-M. Wu, and S.-F. Chang, “Segmentation using superpixels: A bipartite graph partitioning approach,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2012.
- 7[7] J. S. Wu, W. S. Zheng, J. H. Lai, and C. Y. Suen, “Euler clustering on large-scale dataset,” IEEE Transactions on Big Data, in press , 2018.
- 8[8] A. L. N. Fred and A. K. Jain, “Combining multiple clusterings using evidence accumulation,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 27, no. 6, pp. 835–850, 2005.
