Deep Kernel Learning for Clustering
Chieh Wu, Zulqarnain Khan, Yale Chang, Stratis Ioannidis, Jennifer Dy

TL;DR
This paper introduces a deep learning method for creating custom kernels that improve clustering accuracy, offering faster training and better out-of-sample performance compared to existing methods.
Contribution
It presents a novel neural network-based kernel learning approach optimized with the Hilbert Schmidt Information Criterion, outperforming traditional and deep clustering techniques.
Findings
Outperforms state-of-the-art deep clustering methods
Faster training due to gradient-based optimization
Effective on both real-life and synthetic datasets
Abstract
We propose a deep learning approach for discovering kernels tailored to identifying clusters over sample data. Our neural network produces sample embeddings that are motivated by--and are at least as expressive as--spectral clustering. Our training objective, based on the Hilbert Schmidt Information Criterion, can be optimized via gradient adaptations on the Stiefel manifold, leading to significant acceleration over spectral methods relying on eigendecompositions. Finally, our trained embedding can be directly applied to out-of-sample data. We show experimentally that our approach outperforms several state-of-the-art deep clustering methods, as well as traditional approaches such as -means and spectral clustering over a broad array of real-life and synthetic datasets.
| Data | Type | ||||
|---|---|---|---|---|---|
| Moon | 1000 | 2 | 2 | Geometric Shape | 0.1701 |
| Spiral1 | 3000 | 3 | 2 | Geometric Shape | 0.1708 |
| Spiral2 | 30000 | 3 | 2 | Geometric Shape | 0.1811 |
| Cancer | 683 | 2 | 9 | Medical | 3.3194 |
| Wine | 178 | 3 | 13 | Classification | 4.939 |
| RCV | 10000 | 4 | 5 | Text | 2.364 |
| Face | 624 | 20 | 27 | Image | 6.883 |
| Dataset | AEC | DEC | IMSAT | SN | SC | -means | KNetEIG | KNetSMA |
|---|---|---|---|---|---|---|---|---|
| Moon | 56.2 0.0 | 42.2 0.0 | 51.3 20.3 | 100 0.0 | 72.0 0.0 | 66.1 0.0 | 100.0 0.0 | 100.0 0.0 |
| Spiral1 | 28.3 0.0 | 32.0 0.01 | 59.6 7.5 | 100.0 0.0 | 100.0 0.0 | 42.0 0.0 | 100.0 0.0 | 100.0 0.0 |
| Cancer | 79.9 0.2 | 79.2 0.0 | 74.6 2.2 | 82.9 0.0 | 69.8 0.0 | 73.0 0.0 | 84.2 0.4 | 82.5 0.1 |
| Wine | 54.6 0.0 | 80.6 0.0 | 72.3 11.4 | 79.7 0.2 | 88.0 0.0 | 42.8 0.0 | 91.0 0.8 | 90.0 0.7 |
| RCV | 39.3 0.0 | 51.3 0.0 | 39.0 5.5 | 43.5 0.2 | 46 0.0 | 56.0 0 | 46.3 0.4 | 46.1 0.2 |
| Face | 76.8 0.0 | 75.8 1.6 | 83.8 3.5 | 75.6 0.1 | 66.0 0.4 | 91.8 0.0 | 93.0 0.3 | 92.6 0.5 |
| AEC | DEC | IMSAT | SN | SC | -means | KN | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dataset | Prep | RT | Prep | RT | Prep | RT | RT | RT | RT | Prep | RT | RT |
| Moon | 18.23 | 28.04 | 331.80 | 3.52 | 1.20 | 34.34 | 324.00 | 0.34 | 0.04 | 129.00 | 28.90 | 18.40 |
| Spiral1 | 2574.00 | 7920.00 | 343.80 | 121.20 | 53.06 | 309.00 | 444.00 | 4.20 | 0.12 | 300.00 | 42.00 | 19.00 |
| Cancer | 99.00 | 280.20 | 342.00 | 38.73 | 1.10 | 144.00 | 234.00 | 0.18 | 0.03 | 150.00 | 19.30 | 10.30 |
| Wine | 33.81 | 41.69 | 345.00 | 38.02 | 1.10 | 33.32 | 330.00 | 0.03 | 0.06 | 462.00 | 7.20 | 3.40 |
| RCV | 141.60 | 2536.80 | 381.00 | 784.20 | 10.39 | 73.20 | 438.00 | 83.40 | 0.35 | 1080.00 | 1830.00 | 1116.00 |
| Face | 27.50 | 189.00 | 344.40 | 254.10 | 1.20 | 121.80 | 170.90 | 0.26 | 0.15 | 1320.00 | 20.90 | 3.30 |
| Dataset | Data % | AEC | DEC | IMSAT | SN | -means | KNetEIG | KNetSMA |
|---|---|---|---|---|---|---|---|---|
| Moon | 25% | 51.1 | 45.6 | 45.5 | 100.0 | 21.5 | 100.0 | 100.0 |
| Spiral1 | 10.0% | 32.2 | 49.5 | 48.7 | 100.0 | 56.7 | 100.0 | 100.0 |
| Cancer | 30.0% | 76.4 | 76.9 | 74.9 | 82.2 | Fails | 84.0 | 83.6 |
| Wine | 75.0% | 49.1 | 81.5 | 69.4 | 77.2 | 25.0 | 91.1 | 89.3 |
| RCV | 6.0% | 26.8 | 43.2 | 35.2 | 41.3 | 52.5 | 45.2 | 43.1 |
| Face | 35.0% | 52.5 | 67.1 | 77.8 | 75.5 | 87.2 | 92.7 | 91.1 |
| AEC | DEC | IMSAT | SN | -means | KN | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Data | Prep | RT | Prep | RT | Prep | RT | RT | RT | Prep | RT | RT |
| Moon | 15.01 | 22.30 | 201.00 | 2.95 | 1.10 | 27.63 | 140.10 | 0.03s | 48.00 | 2.30 | 1.10 |
| Spiral1 | 192.00 | 498.60 | 250.20 | 99.00 | 47.05 | 229.80 | 223.30 | 0.07 | 82.00 | 3.10 | 1.40 |
| Cancer | 55.30 | 75.00 | 306.00 | 30.01 | 1.65 | 100.20 | 38.90 | 0.03 | 16.00 | 4.00 | 1.80 |
| Wine | 25.30 | 35.31 | 279.00 | 32.31 | 2.40 | 25.60 | 20.40 | 0.03 | 74.00 | 1.30 | 0.09 |
| RCV | 63.00 | 316.20 | 339.00 | 672.60 | 3.21 | 56.00 | 40.67 | 0.03 | 72.00 | 5.50 | 2.10 |
| Face | 15.10 | 171.60 | 315.00 | 233.40 | 1.10 | 93.60 | 35.09 | 0.10 | 76.80 | 2.20 | 0.96 |
| KNetEIG | KNetSMA | |||||
|---|---|---|---|---|---|---|
| HSIC | AE error | NMI | HSIC | AE error | NMI | |
| 98.11 | 21.58 | 0.90 | 92.01 | 8.98 | 0.89 | |
| 98.80 | 72.95 | 0.88 | 94.21 | 72.62 | 0.87 | |
| 112.32 | 101.45 | 0.87 | 101.11 | 88.39 | 0.89 | |
| 0.005 | 109.01 | 105.37 | 0.91 | 108.22 | 107.22 | 0.90 |
| 113.18 | 124.56 | 0.91 | 110.18 | 126.63 | 0.91 | |
| 0 | 110.89 | 127.23 | 0.92 | 109.36 | 127.11 | 0.91 |
| KNetEIG | KNetSMA | |||||
|---|---|---|---|---|---|---|
| HSIC | AE error | NMI | HSIC | AE error | NMI | |
| 2.989 | 0.001 | 0.446 | 2.989 | 0 | 0.421 | |
| 2.989 | 0.0001 | 0.412 | 2.989 | 0 | 0.414 | |
| 2.989 | 0.0001 | 0.421 | 2.989 | 0 | 0.418 | |
| 2.989 | 0.001 | 0.434 | 2.989 | 0 | 0.419 | |
| 2.988 | 0.001 | 0.421 | 2.988 | 0 | 0.418 | |
| 2.965 | 0.099 | 0.557 | 2.979 | 0.033 | 0.558 | |
| 1.982 | 0.652 | 0.644 | 2.33 | 0.583 | 0.56 | |
| 2.124 | 0.669 | 0.969 | 2.037 | 0.66 | 0.824 | |
| 2.249 | 0.69 | 1 | 2.368 | 0.67 | 1 | |
| 0 | 2.278 | 0.715 | 1 | 2.121 | 0.718 | 1 |
| Algorithm | Learning Rate | Batch size | Optimizer |
|---|---|---|---|
| AEC | - | 100 | SGD |
| DEC | 0.01 | 256 | SGDSolver |
| IMSAT | 0.002 | 250 | Adam |
| SN | 0.001 | 128 | RMSProp |
| KNet | 0.001 | 5 | Adam |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Video Surveillance and Tracking Methods · Gaussian Processes and Bayesian Inference
MethodsSpectral Clustering
Deep Kernel Learning for Clustering ††thanks: Source code is available at https://github.com/neu-spiral/kernel_net
Chieh Wu
Zulqarnain Khan
Stratis Ioannidis
Jennifer G. Dy All authors are associated with Department of Electrical and Computer Engineering, Northeastern University, Boston, MA
Abstract
We propose a deep learning approach for discovering kernels tailored to identifying clusters over sample data. Our neural network produces sample embeddings that are motivated by and are at least as expressive as spectral clustering. Our training objective, based on the Hilbert Schmidt Independence Criterion, can be optimized via gradient adaptations on the Stiefel manifold, leading to significant acceleration over spectral methods relying on eigen-decompositions. Finally, our trained embedding can be directly applied to out-of-sample data. We show experimentally that our approach outperforms several state-of-the-art deep clustering methods, as well as traditional approaches such as -means and spectral clustering over a broad array of real and synthetic datasets.
1 Introduction
Clustering algorithms group similar samples together based on some predefined notion of similarity. One way of representing this similarity is through kernels. However, the choice for an appropriate kernel is data-dependent; as a result, the kernel design process is frequently an art that requires intimate knowledge of the data. A common alternative is to simply use a general-purpose kernel that performs well under various conditions (e.g., polynomial or Gaussian kernels).
In this paper, we propose KernelNet (KNet), a methodology for learning a kernel and an induced clustering directly from the observed data. In particular, we train a deep kernel by combining a neural network representation with a Gaussian kernel. More specifically, given a dataset of samples in , we learn a kernel of the form:
[TABLE]
where is an embedding function modeled as a neural network parametrized by , and , , are normalizing constants. Intuitively, incorporating a neural network (NN) parameterization to a Gaussian kernel, we are able to learn a flexible deep kernel for clustering, tailored specifically to a given dataset.
We train our deep kernel with a spectral clustering objective based on the Hilbert Schmidt Independence Criterion [1]. This training can be interpreted as learning a non-linear transformation as well as its spectral embedding simultaneously. Via an appropriate and intuitive initialization of our training process, we ensure that our clustering method is at least as powerful as spectral clustering. In particular, just as spectral clustering, our learned kernel and the induced clustering work exceptionally well on non-convex clusters. In practice, by training the kernel directly from the data, our proposed method significantly outperforms spectral clustering.
The non-linear transformation learned directly from the dataset allows us to readily handle out-of-sample data. Given a new unobserved sample , we can easily identify its cluster label by first computing its image , thus embedding it in the same space as the (already clustered) existing dataset. This is in contrast to spectral clustering, that would require a re-execution of the algorithm from scratch on the combined dataset of samples. The aforementioned properties of our algorithm are illustrated in Fig. 1. A dataset of samples in with non-convex spiral clusters is shown in Fig. 1(a). Applying a Gaussian kernel with directly to these samples leads to a highly uninformative kernel matrix, as shown in Fig 1(b). We train our embedding on only 1% of the samples, and apply it to the entire dataset; the embedded data, shown in Fig. 1(c), consists of nearly-convex, linearly-separable clusters. More importantly, the corresponding learned kernel , yields a highly informative kernel matrix that clearly exhibits a block diagonal structure as shown in Fig. 1(d). In summary, our major contributions are:
We propose a novel methodology of discovering a deep kernel tailored for clustering directly from data, using an objective based on the Hilbert-Schmidt Independence Criterion.
We propose an algorithm for training the kernel by maximizing this objective, as well as for selecting a good parameter initialization. Our algorithm, KNet, can be perceived as alternating between training the kernel and discovering its spectral embedding.
We evaluate the performance of KNet with synthetic and real data compared to multiple state-of-the-art methods for deep clustering. In 5 out 6 datasets, KNet outperforms state-of-the-art by as much as 57.8%; this discrepancy is more pronounced in datasets with non-convex clusters, which KNet handles very well.
Finally, we demonstrate that the algorithm does well in clustering out-of-sample data. This generalization capability means we can significantly accelerate KNet through subsampling: learning the embedding on only 1%-35% of the data can be used to cluster an entire dataset, leading only to a 0%-3% degradation of clustering performance.
2 Related Work
Several recent works propose autoencoders specifically designed for clustering. Song et al. [2] combine an autoencoder with -means, including an -penalty w.r.t. distance to cluster centers. They optimize this objective by alternating between stochastic gradient descent (SGD) and cluster center assignment. Ji et al. [3] incorporate a subspace clustering penalty to an autoencoder, and alternate between SGD and dictionary learning. Tian et al. [4] learn a stacked autoencoder initialized via a similarity matrix. Xie et al. [5] incorporate a KL-divergence penalty between the encoding and a soft cluster assignment, both of which are again alternately optimized; a similar approach is followed by Guo et al. [6] and Hu et al. [7]. In KNet, we significantly depart from these methods by using an HSIC-based objective, motivated by spectral clustering. In practice, this makes KNet better tailored to learning non-convex clusters, on which the aforementioned techniques perform poorly. We demonstrate this experimentally in Section 6.
Our work is closest to SpectralNet [8] by Shaham et al. in that the authors propose a neural network approach for spectral clustering. However, they first learn a similarity matrix using a Siamese network, and then keeping this similarity fixed they optimize a spectral clustering objective to learn the spectral embedding. In contrast, KNet learns both the kernel similarity matrix and the spectral embedding jointly, iteratively improving both. By not having a mechanism to improve upon the previously learnt similarity matrix once a spectral embedding is learnt, as KNet does, SpectralNet can only do as well as the initially learnt similarity matrix: this is evidenced by the overall improved performance of KNet over SpectralNet (see also Sec. 6).
KNet also has some relationship to methods for kernel learning. A series of papers [11, 9, 10] regress deep kernels to model Gaussian processes. Zhou et al. [12] learn (shallow) linear combinations of given kernels. Closest to us, Niu et al. [13] use HSIC to jointly discover a subspace in which data lives as well as its spectral embedding; the latter is used to cluster the data. This corresponds to learning a kernel over a (shallow) projection of the data to a linear subspace. KNet, therefore, generalizes the work by Niu et al. [13] to learning a deep, non-linear kernel representation (c.f. Eq. (4.10)), which improves upon spectral embeddings and is used directly to cluster the data.
3 Hilbert-Schmidt Independence Criterion
Proposed by Gretton et al. [1], the Hilbert Schmidt Independence Criterion (HSIC) is a statistical dependence measure between two random variables. Like Mutual Information (MI), it measures dependence by comparing the joint distribution of the two random variables with the product of their marginal distributions. However, compared to MI, HSIC is easier to compute empirically, since it does not require a direct estimation of the joint distribution. It is used in many applications due to this advantage, including dimensionality reduction [13], feature selection [16], and alternative clustering [14], to name a few.
Formally, consider a set of i.i.d. samples , where , are drawn from a joint distribution. Let and be the corresponding matrices comprising a sample in each row. Let also be any characteristic kernel, in this paper we consider Gaussian kernel and be another characteristic kernel, assumed to be the linear kernel here. Define to be the kernel matrices with entries and , respectively, and let be the normalized Gaussian kernel matrix given by
[TABLE]
where the degree matrix
[TABLE]
is a normalizing diagonal matrix. Then, the HSIC between and is estimated empirically via:
[TABLE]
where intuitively, HSIC empirically measures the dependence between samples of the two random variables. Though HSIC can be more generally defined for any arbitrary characteristic kernels, this particular choice has a direct relationship with (and motivation from) spectral clustering. In particular, given , consider the optimization:
[TABLE]
where is given by (3.4). Then, the optimal solution to (3.5) is precisely the spectral embedding of [13]. Indeed, comprises the top eigenvectors of the normalized similarity matrix, given by:
[TABLE]
For completeness, we prove this in Appendix A.
4 Problem Formulation
We are given a dataset of samples grouped in (potentially) non-convex clusters. Our objective is to cluster samples by first embedding them into a space in which the clusters become convex. Given such an embedding, clusters can subsequently be identified via, e.g., -means. We would like the embedding, modeled as a neural network, to be at least as expressive as spectral clustering: clusters separable by spectral clustering should also become separable via our embedding. In addition, the embedding should generalize to out-of-sample data, thereby enabling us to cluster new samples outside the original dataset.
Learning the Embedding and a Deep Kernel. Formally, we wish to identify clusters over a dataset of samples and features. Let be an embedding of a sample to , modeled as a DNN parametrized by ; we denote by the image of under parameters . We also denote by the embedding of the entire dataset induced by , and use for the image of .
Let be the spectral embedding of , obtained via spectral clustering. We can train to induce similar behavior as via the following optimization:
[TABLE]
where is given by Eq. (3.4). Since HSIC is a dependence measure, by training so that is maximally dependent on , becomes a surrogate to the spectral embedding, sharing similar properties.
However, the surrogate learned via (4.7) is restricted by , hence it can only be as discriminative as . To address this issue, we depart from (4.7) by jointly discovering both as well as a coupled spectral embedding . In particular, we solve the following optimization problem w.r.t. both the embedding and :
[TABLE]
where,
[TABLE]
is an autoencoder, comprising and as an encoder and decoder respectively. This autoencoder objective is theoretically necessary to ensure that the embedding is injective, as stated in Theorem 1 by Li et al. in [30]. However, as observed in [30], our empirical experiments also suggest that this autoencoder objective is not necessary in practice: the dependence of the local minimum found by gradient descent to the starting point ensures that the trained embedding is representative of the input even in the absense of this penalty (see Sec. 6.1).
To gain some intuition into how problem (4.8) generalizes (4.7), observe that if the embedding is fixed to be the identity map (i.e., for ) then, by Eq. (3.5), optimizing only for produces the spectral embedding . The joint optimization of both and allows us to further improve upon , as well as on the coupled ; we demonstrate experimentally in Section 6 that this significantly improves clustering quality.
Kernel Learning. The optimization (4.8) can also be interpreted as an instance of kernel learning. Indeed, as discussed in the introduction, by learning , we discover in effect a normalized kernel of the form
[TABLE]
where are the corresponding diagonal elements of degree matrix .
Out-of-Sample Data. The embedding can readily be applied to clustering out-of-sample data. In particular, having trained over dataset , given a new dataset , we can cluster this new dataset efficiently as follows. First, we use the pre-trained to map every sample in to its image, producing : this effectively embeds to the same space as . From this point on, clusters can be recomputed efficiently via, e.g., -means, or by mapping the images to the closest existing cluster head. In contrast to, e.g., spectral clustering, this avoids recomputing the joint embedding of the entire dataset from scratch.
The ability to handle out-of-sample data can be leveraged to also accelerate training. In particular, given the original dataset , computation can be sped up by training the embedding by solving (4.8) on a small subset of . The resulting trained can be used to embed, and subsequently cluster, the entire dataset. We show in Section 6 that this approach works very well, leading to a significant acceleration in computations without degrading clustering quality.
Convex Cluster Images. The first term in objective (4.8a) naturally encourages to form convex clusters. To see this, as derived in Appendix A, ignoring the reconstruction error, the objective (4.8a) becomes:
[TABLE]
where are the elements of matrix . The exponential terms in Eq. (4.11) compel samples under which to become attracted to each other, while samples for which drift farther apart. This is illustrated in Figure 2. Linearly separable, increasingly convex cluster images arise over several iterations of solving our algorithm at Eq. (4.8). The algorithm, KNet, is described in the next section.
5 KNet Algorithm
We solve optimization problem (4.8) by iteratively adapting and . In particular, we initialize to be the identity map, and to be the spectral embedding , and then alternate between adapting and We optimize via stochastic gradient ascent (SGA). To optimize , we adopt two approaches: one based on eigendecomposition, and one based on optimization over the Stiefel manifold. We describe each of these steps in detail below; a summary can be found in Algorithm 1.
**Initialization. ** The non-convexity of (4.8) necessitates a principled approach for selecting good initialization points for and . We initialize to , computed via the top- eigenvectors of the normalized similarity matrix of , given by (3.6). We initialize so that is the identity map; This is accomplished by pre-training , via SGD as solutions to:
[TABLE]
Note that, in this construction, we use .
Updating . A simple way to update is via gradient ascent, i.e.:
[TABLE]
for , where is the objective (4.8a). In practice, we wish to apply stochastic gradient ascent over mini-batches; for fixed, the first term in the objective (4.8a) reduces to (4.11); however, the terms in the sum are coupled via the normalizing degree matrix , which depends on via (3.3). This significantly increases the cost of computing mini-batch gradients. To simplify this computation, instead we hold both and fixed, and update via one epoch of SGA over (4.8a). At the conclusion of one epoch, we update the Gaussian kernel and the corresponding degree matrix via Eq. (3.3). We implemented both this heuristic and regular SGA, and found that it led to a significant speedup without any observable degradation in clustering performance (see also Section 6).
**Updating via Eigendecomposition. ** Our first approach to adapting relies on the fact that, holding constant, problem (4.8) reduces to the form (3.5). That is, at each iteration, for fixed, the optimal solution
[TABLE]
is given by the top eigenvectors of matrix
[TABLE]
Hence, given at an iteration, we update by returning . Note that when , there are several algorithms for computing the top eigenvectors efficiently (see, e.g., [15, 17]).
**Updating via Stiefel Manifold Ascent. ** The manifold in defined by constraint (4.8b), a.k.a. the Stiefel Manifold, is not convex; nevertheless, techniques such as those outlined in [18, 19] for optimization over this set are available in the literature. These techniques exploit the fact that descent directions that maintain feasibility can be computed efficiently. In particular, following [19], treating as a constant, and given a feasible and the gradient of the objective w.r.t , define
[TABLE]
Using and a predefined step length , the maximization proceeds iteratively via:
[TABLE]
where is the so-called Cayley transform, defined as
[TABLE]
The Cayley transform satisfies several important properties [19]. First, starting from a feasible point, it maintains feasibility over the Stiefel manifold (4.8b) for all . Second, for small enough , it is guaranteed to follow an ascent direction; combined with line-search, convergence is guaranteed to a stationary point. Finally, given by (5.17) can be computed efficiently from , thereby avoiding a full matrix inversion, by using the Sherman-Morrison-Woodbury identity [20]: this results in a complexity for (5.16), which is significantly faster than eigendecomposion when . In our second approach to updating , we apply (5.16) rather than eigendecomposition of when adapting iteratively. Both approaches are summarized in line 7 of Alg. 1; we refer to them as KNetEIG and KNetSMA, respectively, in our experiments in Sec. 6.
6 Experimental Evaluation
Datasets. The datasets we use are summarized in Table 1. The first three datasets (Moon, Spiral1, Spiral2) are synthetic and comprise non-convex clusters; they are shown in Figure 3. Among the remaining four real-life datasets, the features of Breast Cancer [21, 22] are discrete integer values between 0 to 10. The features of the Wine dataset [23] consist of a mix of real and integer values. The Reuters dataset (RCV) is a collection of news articles labeled by topic. We represent each article via a tf-idf vector using the 500 most frequent words and apply PCA to further reduce the dimension to . The Face dataset [24] consists of grey scale, -pixel images of 20 faces in different orientations. We reduce the dimension to via PCA. As a final preprocessing step, we center and scale all datasets so that their mean is 0 and the standard deviation of each feature is 1.
**Clustering Algorithms. ** We evaluate 8 algorithms, including our two versions of KNet described in Alg. 1. For existing algorithms, we use architecture designs (e.g., depth, width) as recommended by the respective authors during training. We provide details for each algorithm below.
**-means: ** We use the CUDA implementation111https://github.com/src-d/kmcuda by [25].
**SC: ** We use the python scikit implementation of the classic spectral clustering algorithm by [26].
**AEC: ** Proposed by [2], this algorithm incorporates a -means objective in an autoencoder. As suggested by the authors, we use 3 hidden layers of width 1000, 250, and 50, respectively, with an output layer dimension of 10.222https://github.com/developfeng/DeepClustering
**DEC: ** Proposed by [5], this algorithm couples an autoencoder with a soft cluster assignment via a KL-divergence penalty. As recommended, we use 3 hidden layers of width 500, 500, and 2000 with an output layer dimension of 10.333https://github.com/XifengGuo/DEC-keras
**IMSAT: ** Proposed by [7], this algorithm trains a network adversarially by generating augmented datasets. It uses 2 hidden layers of width 1200 each, the output layer size equals the number of clusters .444https://github.com/weihua916/imsat
**SN: ** Proposed by [8], SN uses an objective motivated by spectral clustering to map to a target similarity matrix.555https://github.com/KlugerLab/SpectralNet
KNetEIG and **KNetSMA: ** These are the two versions of KNet, as described in Alg. 1, in which is updated via eigendecomposition and Stiefel Manifold Ascent, respectively. For both versions, the encoder and decoder have 3 layers. For Cancer, Wine, RCV, and Face dataset, we set the width of all hidden layers to . For Moon and Spiral1, we set the width of all hidden layers to 20. We set the Gaussian kernel to be median of the pairwise Euclidean distance between samples in each dataset (see Table 1).666https://github.com/neu-spiral/kernel_net
Evaluation Metrics. We evaluate the clustering quality of each algorithm by comparing the clustering assignment generated to the ground truth assignment via the Normalized Mutual Information (NMI). NMI is a similarity metric lying in , with 0 denoting no similarity and 1 as an identical match between the assignments. Originally recommended by [27], this statistic has been widely used for clustering quality validation [13, 28, 14, 29]. We provide a formal definition in Appendix A in the supplement.
For each algorithm, we also measure the execution time, separating it into preprocessing time (Prep) and runtime (RT); in doing so, we separately evaluate the cost of, e.g., parameter initialization from training.
**Experimental Setup. ** We execute all algorithms over a machine with 16 dual-core CPUs (Intel Xeon E5-2630 v3 @ 2.40GHz) with 32 GB of RAM with a NVIDIA 1b80 GPU. For methods we can parallelize either over the GPU or over the 16 CPUs (IMSAT,SN,-means,KNet), we ran both executions and recorded the fastest time. The code provided for DEC could only be parallelized over the GPU, while methods AEC and SC could only be executed over the CPUs. For each dataset in Table 2, we run all algorithms on the full dataset 10 times and report the mean and standard deviation of the NMI of their resulting clustering performance against the ground truth. As SGA is randomized, we repeat experiments 10 times and report NMI averages and standard deviations.
For algorithms that can be executed out-of-sample (AEC, DEC, IMSAT, SN, KNET), we repeat the above experiment by training the embedding only on a random subset of the dataset. Subsequently, we apply the trained embedding to the entire dataset and cluster it via -means. For comparison, we also find cluster centers via -means on the subset and new samples to the nearest cluster center. For each dataset, we set the size of the subset (reported in Table 4) so that the spectrum of the resulting subset is close, in the sense, to the spectrum of .
6.1 Results
Selecting . As clustering is unsupervised, we cannot rely on ground truth labels to identify the best hyperparameter . We, therefore, need an unsupervised method for selecting this value. We find that, in practice, just like [30], selecting works quite well. Because the problem is not convex, local optima reached by KNet depend highly on the initialization. Initializing (a) to approximate the identity map via (5.12), and (b) to be the spectral embedding indeed leads to a local maximum that is highly dependent on the input , eschewing the need for the reconstruction error in the objective (4.8).
Our choie of is further grounded in experiments shown in Table 6 which shows results for the Wine dataset. We found that at smaller values of result in improved performance both in terms of HSIC and NMI; tables for additional datasets can be found in Appendix 7 in the supplement. Alongwith the HSIC we also provide AE reconstruction error at convergence. Beyond the good performance of , the table suggests that an alternative unsupervised method is to select so that the ratio of the two terms at convergence is close to one. Of course, this comes at the cost of parameter exploration; in the remainder of this section, we report the performance of KNet with .
**Comparison Against State-of-the-Art. ** Table 2 shows the NMI performance of different algorithms over different datasets. With the exception of RCV dataset, we see that KNet outperforms every algorithm in Table 2. AEC, DEC, and IMSAT perform especially poorly when encountering non-convex clusters as shown in the first two rows of the table. Spectral clustering (SC), and SN that is also based on a spectral-clustering motivated objective, perform equally well as KNet on discovering non-convex clusters. Nevertheless, KNet outperforms them for real datasets: e.g., for the Face dataset, KNetEIG surpasses SN by 28%. Note that, for the RCV dataset, -means outperformed all methods, though overall performance is quite poor; a reason for this may be the poor quality of features extracted via TFIDFs and PCA.
KNet’s ability to handle non-convex clusters is evident in the improvement over -means to KNet for the first two datasets. The kernel matrix shown in Fig. 1(b) illustrates why -means performs poorly on this dataset. In contrast, the increasingly convex cluster images learned by KNet, as shown in Fig. 1(c), lead to much better separability. This is consistently observed for both the Moon and Spiral1 dataset, for which KNet achieves NMI; we elaborate on this further in Appendix A. demonstrate KNet’s ability to generate convex representations even when the initial representation is non-convex.
We also note that KNet consistently outperforms spectral clustering. This is worth noting because, as discussed in Sec. 4, KNet’s initialization of both and are tied to the spectral embedding. Table 2 indicates that alternatively learning both the kernel and the corresponding spectral embedding indeed leads to improved clustering performance.
Table 3 shows the time performance of each algorithm. In terms of total time, KNet is faster than AEC and DEC. We also observe that SN is faster than most algorithms in terms of run time. However, SN does require extensive hyperparameter tuning to reach the reported NMI performance (see App. 8). We note that a significant percentage of the total time for KNet is spent in the preprocessing step, with KNetSMA being faster than KNetEIG. This is due to the initialization of , i.e., training the corresponding autoencoder. Improving this initialization process could dramatically speed up the total runtime. Alternatively, as we discuss in the next section, using only a small subset to train the embedding and clustering out-of-sample can also significantly accelerate the total runtime, without a considerable NMI degradation.
**Out-of-Sample Clustering. ** We report out-of-sample clustering NMI performance in Table 4; note that SC cannot be executed out-of-sample. Each algorithm is trained using only a subset of samples, whose size is indicated on the table’s first column. Once trained, we report in the table the clustering quality of applying each algorithm to the full set without retraining. We observe that, with the exception of RCV, KNet clearly outperforms all benchmark algorithms in terms of clustering quality. This implies that KNet is capable of generalizing the results by using as litle as 6% of the data.
By comparing Table 2 against 4, we see that AEC, DEC, and IMSAT suffer a significant drop in performance, while KNet suffers only a maximum degradation of 3%. Therefore, training on a small subset of the data not only yields high-quality results, the results are almost as good as training on the full set itself. Table 5, reporting corresponding times, indicates that this can also lead to a significant acceleration, especially of the preprocessing step. Together, these two observations indicate that KNet can indeed be applied to clustering of large non-convex datasets by training the embedding on only a small subset of the provided samples.
7 Conclusions
KNet performs unsupervised kernel discovery using only a subset of the data. By discovering a kernel that optimizes the Spectral Clustering objective, it simultaneously discovers an approximation of its embedding through a DNN. Furthermore, experimental results have confirmed that KNet can be trained using only a subset.
Appendix A Relating HSIC to Spectral Clustering
- Proof.
Using Eq. (3.4) to compute HSIC emprically, Eq (3.5) can be rewritten as
[TABLE]
where and are the kernel matrices computed from and . As shown by [13], if we let be a linear kernel such that , add the constraint such that and rotate the trace terms we get
[TABLE]
By setting the Laplacian as , the formulation becomes identical to Spectral Clustering as
[TABLE]
Appendix A Effect of on HSIC, AE reconstruction error, and NMI
Appendix A Derivation for Eq. (4.11)
By ignoring the reconstruction error, the loss from Eq. (4.8) becomes
[TABLE]
From [1], this expression can be expanded into
[TABLE]
Since is a constant matrix, we let it equal to , and the objective becomes
[TABLE]
The trace term can be converted into a matrix sum as
[TABLE]
By replacing the kernel with the Gaussian kernel, Eq. (4.11) emerges as
[TABLE]
Appendix A Normalized Mutual Information
Consider two clustering assignments assigning labels in to samples in dataset . We represent these two assignments through two partitions of , namely , , where
[TABLE]
are the sets of samples receiving label under each assignment. Define the empirical distribution of labels to be:
[TABLE]
The NMI is then given by the ratio
[TABLE]
where is the mutual information and are the entropies of the marginals of the joint distribution (1.26).
Appendix A Moon and Spiral dataset
In this appendix, we illustrate that for the Moon and the Spiral datasets, we are able to learn (a) convex images for the clusters and (b) kernels that produces block diagonal structures. The kernel matrix is constructed with a Gaussian kernel, therefore the values are between 0 to 1. The kernel matrices shown in the figures below use white as 0 and dark blue as 1; all values in between are shown as a gradient between the two colors.
In Figure 4, the Moon dataset is plotted in Fig. 4(a) and its kernel block structure in Fig. 4(b). After training , the image of is shown in Fig. 4(c) along with its block diagonal structure in Fig. 4(d). Using the same trained on , we distorted with Gaussian noise and plot it in Fig. 5(a) along with its kernel matrix in Fig. 5(b). We then pass the distorted into and plot the resulting image in Fig. 5(c) along with its kernel matrix in Fig. 5(d). From this example, we demonstrate KNet’s ability to embed data into convex clusters even under Gaussian noise.
In Figure 6, a subset of 300 samples of the Spiral dataset is plotted in Fig. 6(a) and its kernel block structure in Fig. 6(b). After training , the image of is shown in Fig. 6(c) along with its block diagonal structure in Fig. 6(d). The full dataset is shown in (a) of Figure 7 along with its kernel matrix in Fig. 7(b). Using the same trained from Fig. 6(a), we pass the full dataset into and plot the resulting image in Fig. 6(b) along with its kernel matrix in Fig. 6(d). From this example, we demonstrate KNet’s ability to generalize convex embedding using only 1% of the data.
Appendix A Algorithm Hyperparameter Details
The hyperparameters for each network are set as outlined in the respective papers. In the case of SN, the hyper-parameters included the number of neighbors for calculation, the number of neighbors to use for graph Laplacian affinity matrix, the number of neighbors to use to calculate the scale of the Gaussian graph Laplacian, and the threshold for calculating the closest neighbors in the Siamese network. They were set by doing a grid search over values ranging from 2 to 10 and by using the loss over of the training data.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf, Measuring statistical dependence with Hilbert-Schmidt norms , International Conference on Algorithmic Learning Theory (2005), pp. 63–77.
- 2[2] C. Song, F. Liu, Y. Huang, L. Wang, and T. Tan, Auto-encoder based data clustering , Iberoamerican Congress on Pattern Recognition (2013), pp. 117–124.
- 3[3] P. Ji, T. Zhang, H. Li, M. Salzmann and I. Reid, Deep Subspace clustering Network , Advances in Neural Information Processing Systems (2017).
- 4[4] F. Tian, B. Gao, Q. Cui, E. Chen, and T. Liu, Learning deep representations for graph clustering. , AAAI (2014), pp. 1293–1299.
- 5[5] J. Xie, R. Girshick, and A. Farhadi, Unsupervised deep embedding for clustering analysis , International Conference on Machine Learning (2016), pp. 478–487.
- 6[6] X. Guo, X. Liu, E. Zhu, and J. Yin, Deep Clustering with Convolutional Autoencoders , International Conference on Neural Information Processing (2017), pp. 373–382.
- 7[7] W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama, Learning Discrete Representations via Information Maximizing Self Augmented Training , ar Xiv preprint ar Xiv:1702.08720 (2017).
- 8[8] U. Shaham, K. Stanton, H. Li, R. Basri, B. Nadler and Y. Kluger, Spectral Net: Spectral Clustering using Deep Neural Networks , International Conference on Learning Representations (2018).
