Validity of Clusters Produced By kernel-$k$-means With Kernel-Trick
Mieczys{\l}aw A. K{\l}opotek

TL;DR
This paper revises foundational theorems related to kernel-$k$-means clustering, ensuring the mathematical validity of the kernel trick by correcting previous proofs about kernel functions and their Euclidean embeddings.
Contribution
It provides corrected proofs for key theorems in Gower's work, clarifying the conditions under which kernel functions are valid for clustering.
Findings
Corrected proof of the existence of kernel functions from distance matrices.
Clarified conditions for kernel matrices to be embeddable in Euclidean space.
Ensured the mathematical soundness of kernel-$k$-means clustering methods.
Abstract
This paper corrects the proof of the Theorem 2 from the Gower's paper \cite[page 5]{Gower:1982} as well as corrects the Theorem 7 from Gower's paper \cite{Gower:1986}. The first correction is needed in order to establish the existence of the kernel function used commonly in the kernel trick e.g. for -means clustering algorithm, on the grounds of distance matrix. The correction encompasses the missing if-part proof and dropping unnecessary conditions. The second correction deals with transformation of the kernel matrix into a one embeddable in Euclidean space.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Advanced Clustering Algorithms Research · Remote-Sensing Image Classification
Validity of Clusters Produced By kernel--means With Kernel-Trick
Mieczysław A. Kłopotek Institute of Computer Science
of the Polish Academy of Sciences
ul. Jana Kazimierza 5, 01-248 Warszawa Poland,
Abstract
This paper, constituting an extension to the conference paper [8], corrects the proof of the Theorem 2 from the Gower‘s paper [4, page 5]
as well as corrects the Theorem 7 from Gower‘s paper [6] . The first correction is needed in order to establish the existence of the kernel function used commonly in the kernel trick e.g. for -means clustering algorithm, on the grounds of distance matrix. The correction encompasses the missing if-part proof and dropping unnecessary conditions.
The second correction deals with transformation of the kernel matrix into a one embeddable in Euclidean space.
1 The Problem
A number of approaches to solving various data mining problems, including clustering, is based on so-called kernel approach. The kernel approach may be seen as application of a mapping to the data points in such a way that they are represented in a high dimensional Euclidean space (called feature space) in which it is hoped to separate the data points easier via simpler geometrical constructs (e.g. hyperplanes), compared for example to their original low dimensional representation space. In this way, a number of data mining methods requiring linear data separation can be applied to non-linearly separated data sets.
The kernel approach is most frequently applied in conjunction with Support Vector Machine based analysis methods, but it is also used in case of -means clustering algorithm111For an overview of kernel -means algorithm see e.g. [3]., in which we are interested in this paper. We will introduced this algorithm in Section 3
The kernel-based approaches assume the availability of a similarity function and in particular of the similarity matrix , called also a kernel function and kernel matrix resp., which express similarities between data points at hand. This similarity function/matrix must have the property that, for any two data points in the original apace space we have ( operator indicates a dot product between vectors), and for any two data points in the data set under consideration the similarity matrix is available such that .
For a number of algorithms, including -means, the so-called kernel trick has been elaborated. The essence of the kernel trick is that we can perform the kernel algorithms based on the kernel matrix alone, without an explicit knowledge of the mapping . Section 3 explains the usage of kernel trick for -means algorithm.
Nonetheless, the very existence of the mapping , and hence of the kernel function is of vital importance to the validity of application of the -means algorithm in the feature space. transforms the data to points in an Euclidean space so that -means can be applied at all. Inversion of will provide with cluster centers produced by kernel--means. Furthermore, not similarities but rather distances are used by -means. We can easily imagine that no kernel function exists for a given similarity matrix . We can also have to do with the situation that there exist multiple kernel functions as well as related to the same kernel matrix . Can it mean that there exist multiple feature spaces in which the very same data set can be clustered differently via kernel--means depending on the function we choose? Closely related is the following issue: For algorithms like -means, instead of the kernel matrix the distance matrix between the objects in the feature space may be available, being the Euclidean distance matrix. We will call Euclidean matrix.
We are faced with the following questions:
- (1)
what properties the kernel matrix should have in be really a matrix of dot products?
- (2)
what properties the kernel matrix should have in to enable to recover function at the data points from the kernel matrix?
- (3)
can we obtain the matrix from distance data matrix ?
- (4)
can we obtain from the matrix the function such that the distances in the feature space are exactly the same as given by the matrix?
- (5)
if we derived the matrix from and turns out to yield , can we know then that was really an Euclidean distance matrix?
Questions (1), (2) may seem to be pretty easy and were partially addressed e.g. by Schölkopf [17]. Schölkopf investigates what kinds of kernel functions may lead to a distance measure in the feature space. However, he does not consider the inverse, that is Euclidean distance matrix leading to a kernel function. He does not investigate finding explicit form of the function either.
The answer to the third question seems to be easily derivable from the paper by Balaji et al. [1]. One should use the transformation
[TABLE]
(where is a matrix containing as entries squared distances from ) a result going back to a paper by Schoenberg [15]. The problem is that this paper of Schoenberg does not contain any such statement. This result should be rather ascribed to the paper [16]. 222 Schoenberg [15] proposed still another distance-to-kernel matrix transform
for any positive , which we will not discuss here.
The most general proposal of a distance-to-kernel-matrix transform seems to be that of by Gower [4, Theorem 2, page 5], who generalizes the aforementioned transform (1) to
[TABLE]
for an appropriate choice of . A generally accepted proof of this transformation can be found in the paper by Gower [4, Theorem 2, page 5]. If this proof were correct, the questions (4) and (5) would have been answered. Regrettably, the proof of the validity of the latter is incomplete, as we will explain in Section 4. For this reason, these questions still remain open.
Therefore, we decided to provide with a correction of the proof of the Gower‘s theorem that we will present in Section 5. This correction is needed in order to establish the existence of the kernel function used commonly in the kernel trick e.g. for -means clustering algorithm, on the grounds of distance matrix.
The question that was left open by Gower was: do there exist special cases where two different functions, complying with a given kernel matrix, generate different distance matrices in the feature space, maybe in some special, ”sublimated” cases? This would mean that under some ”special” conditions the output of kernel -means could differ radically not just on the grounds of some random causes but in a systematic way. The answer given to this open question in this paper is definitely NO. We closed all the conceivable gaps in this respect. So usage of (linear and non-linear) kernel matrices that are semipositive definite, is safe in this respect.
Let us underline here that we did not impose any apriorical restrictions on the form of function itself. It may be a linear or non-linear mapping from the sample space to the feature space. But what we insist on is that the feature space has to be Euclidean. This is the requirement for applicability of (kernel) -means clustering algorithm. If the feature space is not metric, the results of (kernel) -means clustering are questionable.
In Section 6 we provide with a numerical example illustrating some distance matrix transformations discussed in Section 5.
The second problem with usage of kernel--means is related to the basic assumption of -means that it has been designed for Euclidean space. In a number of applications, like clustering based on Laplacians, the embeddability of the kernel matrix can be guaranteed from the theoretical standpoint. However, this does not need to be always the case. Therefore we need to answer the questions (6) what does kernel--means produce for non-Euclidean kernel matrices, (7) can a non-Euclidean kernel matrix be turned to an Euclidean kernel matrix, (8) how does the latter matrix transformation impact the results of kernel--means clustering. The questions could have been easily answered if the Theorem 7 of Gower from [6] were correct. Regrettably, this Theorem requires an quantitative correction. We handle these issues in Section 7.
In the subsequent Section 2 we will point at research directions for which the correction proposed here is of importance.
2 The Background
The -means algorithm has the very attractive property of being easy to implement, and there exist various variants of it like -means++ possessing even closeness-to-optimum properties. The drawback of this algorithm is that it accepts numeric attributes only and requires an embedding in Euclidean space. Embeddings into other spaces were investigated, like hyperbolic space, but the computation of cluster centers that is vital and very easy in Euclidean space, is not that easy in the other spaces.
However, real-world objects are frequently described by non-numeric attributes, or are not embedded in any space whatsoever and instead only similarity, dissimilarity or distance between objects is known. In such cases the kernel--means clustering algorithm can be used which at least partially inherits the good properties of -means. In such cases, however, the very existence of embedding into Euclidean space (even if it is not used explicitly), is of vital importance, because otherwise the clustering results may be unreliable. Same holds for other kernel algorithms for which the original algorithm relies on an Euclidean space.
Therefore, research is performed like that of [9], in order to find ways of transforming a similarity matrix into the closest proper positive definite kernel matrix, so that an approximating Euclidean embedding is existent, or one learns the distances themselves.
These efforts in establishing the proper kernel matrix make sense only if the Theorem 2 of Gower [4] is valid. However, a study of the literature seems to reveal that nobody except for Gower himself was aware of the mentioned flaw of his proof of his theorem and the result is used rather as a granted truth.
The Gower‘s paper [4], according to GoogleScholar, is cited over 200 times in a number of research and application contexts. For example, Pekalska et al. [12] derive the necessity of creation of a generalized kernel handling of dissimilarity on the grounds that the kernel according to equation (2) is positive definite if and only if the underlying distance matrix is Euclidean, which has not been proven by Gower [4]. Same motivation lies behind Nikolentzos et al. work [10] on seeking appropriate embeddings. Pavoine et al. [11] relies on the property, suggested by Gower [4], that the decomposition of the kernel can be shifted, while performing PCA analysis.
Kernel-trick based -means algorithms are applied in various areas (e.g. gene expression clustering [7], spectral clustering of graphs [3]).
The validity of the Gower transform underpins various improvements of kernel -means clustering, like single pass clustering [14]. global kernel -means [18], subsampling kernel -means [2] robust kernel -means [19] and other.
Furthermore, let us stress here that the aforementioned papers do not care at all about whether or not the kernel matrices are embeddable in Euclidean space which is the basic assumption of applying the basic form of kernel--means. Non-Euclidean space require a serious modification of -means, accommodating to that fact that gravity center of a cluster cannot serve any more as cluster center (gradient descent methods are needed for example, see [13, Section 6].
For these reasons a definite solving of the Gower theorem dilemma seems to be of uttermost importance.
3 Kernel--means
The well known -means clustering algorithm is claimed to minimize the objective function being the sum of squares of distances of data points to their cluster centers. It consists of the following steps: (1) creating the initial clustering, (2) computation of cluster center for each cluster, (3) creation of a new clustering by assigning each data point to the cluster defined by the closest cluster center (4) repeating steps (2) and (3) till some terminating condition. There exist a large variety of variants of this algorithm. For example step (1) may cosist in random selection of distinct data points as cluster centers and applying step (3). Another variant may replace step (2) with step (2‘) in which a single data point is moved from one cluster to the other if and only if the move decreases the cost function and then perform proper step (2). steps (2) and (2‘) may be applied interchangingly in subsequent iterations and so on.
Kernel based -means clustering algorithm (clustering objects into clusters ) consists in switching to a multidimensional feature space and searching therein for prototypes minimizing the error
[TABLE]
where is a (usually non-linear) mapping of the space of objects into the feature space. The so-called ”kernel trick” means the possibility to apply -means clustering without knowing explicitly the function and using so-called kernel matrix with elements instead.
In analogy to the classical -means algorithm, the prototype vectors are updated according to the equation
[TABLE]
where is the cardinality of the -th cluster. A direct application of this equation is not possible unless the function is known. But it may be still feasible if we would know the so-called Kernel Matrix with elements being dot products of data points in the feature space, that is . Given matrix , it is possible to compute the distances between the object images and prototypes in the feature space by making use of so-called called ”the kernel trick”. The ”kernel trick” relies on the fact that the following transformation is possible:
[TABLE]
where, as already stated, .
In this way, one can update the elements of clusters without determining the prototypes explicitly.
Let be a matrix . Then apparently . Hence for any non-zero vector where so must be positive semidefinite. But a matrix is positive semidefinite iff all its eigenvalues are non-negative. Furthermore, all its eigenvectors are real numbers.
So to identify at data points, one has to find all eigenvalues , and corresponding eigenvectors of the matrix . If all eigenvalues are hereby non-negative, then construct the matrix that has as columns the products . Rows of this matrix (up to permutations) are the values of the function at data points . More formally, if the matrix , and is the vector of eigenvalues, then
[TABLE]
where turns a vector into a diagonal matrix. It may be verified that kernel--means with the above matrix and ordinary -means for would yield same results.
4 Gower formulation of distance-to-kernel-matrix transformation
Let us recall that a matrix is an Euclidean distance matrix between points if and only if there exists a matrix rows of which () are coordinate vectors of these points in an -dimensional Euclidean space and
[TABLE]
. Gower in [4] claims that
Theorem 1
* is Euclidean iff the matrix is positive semidefinite for any vector such that and *
whereas in [6] he claims:
Theorem 2
* is Euclidean iff the matrix is positive semidefinite for any vector such that .*
Apparently both claims do not match quite (with respect to condition ). It must be underlined, however, that the paper [4] provides strong clues how the theorem 2 shall be proven, though incompletely, so that in what follows we use these clues to establish the result. We claim here is that the Gower‘s theorem has the following deficiencies
- •
requirement is not needed in Theorem 1.
- •
the if-part of neither Theorem 1 nor of his theorem correction in [6] was demonstrated.
It should be noted at this point, that in a 1985 paper Gower [5] derives his theorem in the latter version from a paper by Schoenberg [16]. The problem is that first of all Gower‘s result does not need this second derivation and second the paper by Schoenberg [16] does not prove what Gower [5] claims. So the issue is open and we want to address it here more thoroughly. We provide a coorection, completing Gower‘s proof in Section 5. See Section 6 for some numerical examples of matrices and vectors that we operate on in Section 5. In Section 8 we draw some conclusions from the corrective proof.
5 Correrction of Gower‘s result
In this section we shall correct the Gower‘s result from [4].
For construction purposes we need still another formulation of the theorem, which is slightly more elaborate:
Theorem 3
If the matrix is a matrix of Euclidean distances then for each vector such that the matrix
[TABLE]
is positive semidefinite (* being a matrix with entries being squares of entries of the matrix ).* 2. 2.
*If is a symmetric matrix with zero diagonal and for a vector such that . the matrix is positive semidefinite then is Euclidean. * 3. 3.
If is Euclidean then for each vector such that the matrix can be derived from matrix in such a way that its squared entries can be computed as . 4. 4.
If is Euclidean then for each vector such that the matrix can be expressed as where is a real-valued matrix, and the rows of can be considered as coordinates of data points the distances between which are those from the matrix .
Let be a matrix of Euclidean distances between objects . Let be a matrix of squared Euclidean distances between objects with identifiers . This means that there must exist a matrix for some , rows of which represent coordinates of these objects in an -dimensional space. This real-valued matrix represents an embedding of the Euclidean distance matrix into . A distance matrix can be called Euclidean if and only if an embedding exists. If ( with dimensions ), then .
As a rigid set of points in Euclidean space can be moved (shifted, rotated, flipped symmetrically333Gower does not consider flipping.) without changing their relative distances, there may exist many other matrices rows of which represent coordinates of these same objects in the same -dimensional space after some isomorphic transformation. Let us denote the set of all such embeddings . And if a matrix , then for the product we have . We will say that
For an define a matrix . Hence . Obviously then
[TABLE]
(as for all ). This implies that
[TABLE]
that is
[TABLE]
So is of the form
[TABLE]
with components of equal .
Therefore, to find for an Euclidean matrix we need only to consider matrices deviating from by for some . Let us denote with the set of all matrices such that . So for each matrix if then , but not vice versa. We stress that we work with an Euclidean matrix . So we would like to find an such that is decomposable into real-valued matrices such that so that would represent an embedding of an Euclidean distance matrix. But first of all even if is not Euclidean, or even not metric, such an embedding may be found. (see Gower et al. [6]).
As Gower et al. [6] states, see their Theorem 1, any non-metric dissimilarity measure for where is finite, can be turned into a (metric) distance function where is a constant where . Furthermore, Gower et al. [6] recall that any dissimilarity matrix may be turned to an Euclidean distance matrix, see their Theorem 7, by adding an appropriate constant, e.g. where is a constant such that , being the smallest eigenvalue of , is the matrix of squared values of elements of , is the number of rows/columns in .
So even if is actually an Euclidean distance matrix, and , there is no warranty, that the distance matrix induced by corresponding is identical with .
For an consider the matrix . We obtain
[TABLE]
Let us investigate :
[TABLE]
Let us make the following choice (always possible) of with respect to : , .
Then we obtain from the above equation
[TABLE]
By analogy
[TABLE]
By substituting (19) and (20) into (17) we obtain
[TABLE]
So for any , hence an we can find an such that:
[TABLE]
For any matrix for some with we say that is in multiplicative form or .
If , that is is decomposable, then also
[TABLE]
is decomposable. But
[TABLE]
where is a shift vector by which the whole matrix is shifted to a new location in the Euclidean space. So the distances between objects computed from are the same as those from , hence if , then .
Therefore, to find a matrix , yielding an embedding of in the Euclidean dimensional space we need only to consider matrices of the form , subject to the already stated constraint , that is ones from .
So we can conclude: If is a matrix of Euclidean distances, then there must exist a positive semidefinite matrix for some vector such that , and . These last two conditions are implied by the following fact: is known to be not negative semidefinite, so that would not be positive semidefinite in at least the following cases: (see reasoning prior to formula (30)) or (see reasoning prior to formula (31)). So if is an Euclidean distance matrix, then there exists an .
Let us investigate other vectors such that . Note that
[TABLE]
Therefore, for a matrix
[TABLE]
But if , then
[TABLE]
and hence each is also in , though with a different placement (by a shift) in the coordinate systems of the embedded data points. So if one element of is in , then all of them are.
So we have established that: if is an Euclidean distance matrix444 This means that there exists a matrix such that rows are coordinates of objects in an Euclidean space with distances as in , then there exists a decomposable matrix which is in , hence . For each matrix in there exists a multiplicative form matrix in . But if it exists, all multiplicative forms are there:
In this way we have proven points 1,3 and 4 of the Theorem 3. And also the only-if-part of Gower‘s theorem correction in [6].
However, two things remain to be clarified and are not addressed in [4] nor in [6]: the if-part of [6] theorem correction (given a matrix such that is positive semidefinite, is an Euclidean distance matrix? – see point 2 of the Theorem 3) and the status of the additional condition in Theorem 1.
Gower [4] makes the following remark: is to be positive semidefinite for Euclidean . However, for non-zero vectors
[TABLE]
But is known to be not negative semidefinite, so that would not be positive semidefinite in at least the following cases: and . Let us have a brief look at these conditions and why they are neither welcome nor actually existent:
Situation is not welcome, because there exists a vector such that and under we could solve the equation and thus demonstrate that for some
[TABLE]
However this situation is impossible, because for
[TABLE]
which means that the rows are linearly dependent, hence is guaranteed by earlier assumption about ; so this concern by Gower needs to be dismissed as pointless. 2. 2.
Situation is not welcome, because then
[TABLE]
and thus
[TABLE]
denying positive semidefiniteness of . Gower does not consider this further, but such a situation is impossible. Recall that because is Euclidean, there must exist a vector such that and
[TABLE]
is in . Hence for any such that
[TABLE]
is positive semidefinite. This allows us to conclude that for such . Therefore if then . What is more, if then implies , for which of course .
Hence the last assumption of if-part of Theorem 1 needs to be dropped as unnecessary which simplifies it to corrected theorem in [6].
As we can see from the first point above, , given by
[TABLE]
does not need to identify uniquely a matrix , as is not invertible. Though of course it identifies an Euclidean distance matrix.
Let us now demonstrate the missing part of Gower‘s proof that is uniquely defined given a decomposable .
So assume that for some (of which we do not know if it is Euclidean, but is symmetric and with zero diagonal), and is decomposable that is . Let be the distance matrix derived from (that is the distance matrix for which is an embedding). That means is decomposable into properly distanced points with respect to . And is in additive form with respect to it, that is Therefore there must exist some such that the as valid multiplicative form with respect to , and it holds that . But recall that
[TABLE]
Hence
[TABLE]
So we need to demonstrate that for two symmetric matrices with zero diagonals such that the equation holds.
It is easy to see that . Denote .
[TABLE]
[TABLE]
With denote the vector and with the scaler . So we have
[TABLE]
So in the row , column of the above equation we have: . Let us add cells and and subtract from them cells and . . But as the diagonals of and are zeros, hence . So . But because are symmetric. Hence so . This means that .
This means that and are identical. Hence decomposition of is sufficient to prove Euclidean space embedding of and yields this embedding. This proves the if-part of Gower‘s Theorem 1 and of the corrected theorem in [6] 2 and point 2 of Theorem 3.
6 A numerical example
Let us illustrate the process of generating a kernel matrix from a distance table and show that the distances between the objects in the feature space really match the distances of the original distance matrix. We took a -dimensional data matrix with objects.
[TABLE]
and derived from it an original Euclidean distance matrix
[TABLE]
We applied to it the transformation from equation (8) using the vector
[TABLE]
and obtained the (kernel) matrix
[TABLE]
After eigen-decomposition of , we get via equation (6) the embedding matrix (after ignoring columns with next to zero eigenvalues)
[TABLE]
which produces the distance matrix
[TABLE]
The sum of squared differences between the corresponding entries in the distance matrices and amounts to 5.727256e-25.
It can be easily seen that is (nearly) identical with , though the embeddings and differ. The -means algorithm as implemented in (kmeans, centers=2,nstart=100) was run both for the embedding and yielding the clustering [ 2, 2, 2, 1, 1, 1, 1].
A version of kernel--means, as described in this paper, was also implemented and produced for the kernel matrix the very same clustering. [2, 2, 2, 1, 1, 1, 1].
Note that same distance matrix can be turned to a kernel matrix using different verctors. We applied to it the transformation from equation (8) using the vector
[TABLE]
and obtained the (kernel) matrix
[TABLE]
After eigen-decomposition of , we get via equation (6) the embedding matrix (after ignoring columns with next to zero eigenvalues)
[TABLE]
which produces the distance matrix
[TABLE]
The sum of squared differences between the corresponding entries in the distance matrices and amounts to 2.090166e-25.
Not surprisingly, a version of kernel--means, as described in this paper, was also implemented and produced for the kernel matrix the very same clustering. [2, 2, 2, 1, 1, 1, 1].
7 -means under non-Euclidean kernels
In many cases, like Laplacians of graphs, we know in advance that they can be deemed as kernels embedded into Euclidean space, so that there are no obstacles to apply kernel--means clustering. However, this does not always need to be the case. Let us discuss now the concerns for applying kernel--means in such situations and about the validity of the obtained clusters.
Let be non-negative weights of data points . Let be such a subset of that . Define as a weighted center of the datapoints of as follows:
[TABLE]
It is easily seen that it is possible to compute the squared distance of any data point to a weighted center of a set.
[TABLE]
Let us now pay some attention to the consequence of the fact that one may be tempted to apply the kernel--means algorithm under missing Euclidean embedding.
The kernel--means algorithm consists in switching to a multidimensional feature space and it is clamed to search therein for prototypes minimizing the error
[TABLE]
over all possible choices of the set of cluster centers , .
But this is actually not the entire truth. may only be equal to
[TABLE]
for some subset of all the data points and no other vectors in the feature space are taken into account. If the feature space is Euclidean, it is guaranteed that no other vector from the feature space shall ever be considered as cluster center, because the clustering will not be optimal. It is not so in case of non-Euclidean feature spaces. To demonstrate this, we will use an example.
Consider the following non-Euclidean distance matrix
[TABLE]
and the corresponding kernel matrix
[TABLE]
If we apply kernel--means clustering with , this implies a clustering [ 2, 2, 1, 2, 2, 1] with the total value of the cost function 1325 . Other clusterings would not be better. Check e.g. that the clustering [1,1,1,2,2,2] produces the cost function amounting to 1400 which is higher than what kernel--means produces.
But consider now a different clustering, [1,1,1,2,2,2], where you choose weighted cluster centers with weights [10,1,1,10,1,1], instead of the -means cluster centers. Then the cost function will amount to 1175 which is below what kernel--means produces.
In this way we have proven that
Theorem 4
kernel--means does not optimize the cost function
[TABLE]
for non-Euclidean kernel matrices.
We have already mentioned the Gower‘s et al. [6] Theorem 7, stating that any dissimilarity matrix may be turned to an Euclidean distance matrix, by adding constant to the squared distances as follows: where is a constant such that , being the smallest eigenvalue of , is the matrix of squared values of elements of , is the number of rows/columns in .
Gower‘s Theorem 7 is actually wrong. Let us continue the above example. Gowerr‘s constant for amounts to = 757.205 . Upon modifying the distance matrix we get the new kernel matrix
[TABLE]
which is again non Euclidean, because its lowest eigenvalue is equal -378.603
Let us now propose a correction of Gower‘s ”euclidesation” theorem:
Theorem 5
Any dissimilarity matrix may be turned to an Euclidean distance matrix, see their Theorem 7, by adding an appropriate constant (to non-diagonal elements) , e.g. where is a constant such that , being the smallest eigenvalue of , is the matrix of squared values of elements of , is the number of rows/columns in .
Proof 1
The equation (24) allows us to conclude that given
[TABLE]
for a dissimilarity matrix , the following holds:
[TABLE]
Let be an eigenvector of for a non-zero eigenvalue . Therefore
[TABLE]
Assuming that , we get:
[TABLE]
which means that is also an eigenvector of for the same eigenvalue. Notably, The sum of components of is equal zero.
Consider now the following expression for some number .
[TABLE]
[TABLE]
[TABLE]
Now consider an eigenvector of for a non-zero eigenvalue , such that the sum of its components equals zero. For each such a vector always exists. We see immediately that
[TABLE]
[TABLE]
[TABLE]
that is that is an eigenvalue of with eigenvector .
This means that by subtracting from non-diagonal elements of in the computation of we can increase its eigenvalues of eigenvectors with zero sum by . But subtracting from non-diagonal elements of means adding to non-diagonal elements of , or adding to non-diagonal elements of , or just replacing non-diagonal elements of with . If we add at least the negation of the lowest eigenvalue of non-Euclidean to all its eigenvalues, then of course it turns to an Euclidean one, given that all eigenvectors with non-zero eigenvalues have zero sums of components.
How can we now tell if all such eigenvectors have zero sums? In case that all eigenvalues are different, this is simple. As shown, each eigenvalue has the zero sum eigenvector, and this is the only one up to scaling factor.
The details of handling special cases (of identical eigenvalues) follow now. Consider the set of all eigenvectors related to a multiple eigenvalue. The whole set can be represented as a linear combination of some number of orthogonal vectors from this set with the number equal to the multiplicity of the eigenvalue. Let be one of these orthogonal vectors. Then any linear combination of all the other orthogonal vectors is orthogonal to . Let be an example from this combination. Then clearly . But also . Hence . So is orthogonal to . As the latter represents any vector orthogonal to of the subspace co-spanned by , so must be identical to up to scaling factor. So the subspace of eigenvectors can be spanned by a set of orthogonal vectors with component sums equal zero. Therefore all the eigenvectors of have this property and hence adding the respective constant adds to all the eigenvalues of the matrix . This completes the proof.
Let us illustrate the Theorem refthKlvopotekEuclidesation by continuing the previous example. The euclidesation of the kernel , according to Theorem 5, will lead to the following kernel matrix:
Upon modifying the distance matrix according to our Theorem we get the new kernel matrix
[TABLE]
which is now Euclidean, because its lowest eigenvalue is equal 0 The kernel matrix implies a clustering [ 1, 2, 1, 1, 2, 1] with the total value of the cost function 4353.821 . Other clusterings would not do better. Check e.g. that the clustering [1,1,1,2,2,2] produces the cost function amounting to 4428.821 which is higher than what kernel--means produces.
Consider now a different clustering, [1,1,1,2,2,2], where you choose weighted cluster centers with weights [10,1,1,10,1,1], instead of the -means cluster centers. Then the cost function will amount to 5907.533 which is again higher than what kernel--means produces. In Euclidean space, kernel--means produces appropriate results.
Note that the clustering obtained is identical with the clustering delivered by kernel--means from the original kernel matrix .
Let us investigate this phenomenon more generally.
Theorem 6
If we pursue the kernel--means clustering when seeking the optimum among cluster center sets being a subset of the set of that may only be equal to
[TABLE]
for some subset of all the data points and no other vectors in the feature space are taken into account, then after adding a constant to the distance matrix as follows: then the optimal clustering will remain the same.
Proof 2
If we add in a cluster of cardinality for an element to all its distances , then its squared distance to the cluster center will increase by because is unchanged . So in all the cluster cost function will change by . So the overall cost function of all clusters will increase by . That is it is independent of the actual cost function. Hence the optimum clustering of -means, achievable by kernel--means, will remain unchanged after this addition.
Under these circumstances
Theorem 7
For kernel--means, adding a constant to squared dissimilarity measures of non-identical elements is a clustering preserving and embeddability preserving operation.
Note that the transformation mentioned above (1) increases all distances, (2) the absolute increase in distances is the largest for the smallest distances, and the smallest for the largest, (3) therefore no new clustering structures occur under this transformation. We define in this way a new axiom/property of -means - in that we require that clustering algorithm yields same result under the mentioned distance change/transformation.
The idea behind is that in the permissible domain for -means (Euclidean) the optimum is unchanged if we add constant to squared distances between different elements. By means of conceptual extension we can carry on this assumption backwards into non-Euclidean distances.
Then we need to define under what regime we compute the permissible optimum of -means, because in the whole space itself it is no true. Only if we limit the permissible space in a reasonable way, we can still assume that we are computing -means optimum. So if we agree that the kernel function for kernel -means is deemed to transmit the data points into the Euclidean space under the mentioned invariance transformation, then it is permissible to apply kernel--means without checking for embeddability.
8 Concluding remarks
In this paper we corrected the proof of the Theorem 2 from the Gower‘s paper [4, page 5]. This correction was needed in order to establish the existence of the kernel function used commonly in the kernel trick e.g. for -means clustering algorithm, on the grounds of distance matrix.
Let us underline here that we did not impose any apriorical restrictions on the form of function itself. It may be a linear or non-linear mapping from the sample space to the feature space. But what we insist on is that the feature space has to be Euclidean. This is the requirement for applicability of (kernel) -means clustering algorithm. If the feature space is not metric, the results of (kernel) -means clustering are questionable.
But this is not enough. The same kernel matrix may be related to infinitely many functions.
The question that was left open by Gower was: do there exist special cases where two different functions, complying with a given kernel matrix, generate different distance matrices in the feature space, maybe in some special, ”sublimated” cases? The answer given to this open question in this paper is definitely NO. We closed all the conceivable gaps in this respect. So usage of (linear and non-linear) kernel matrices that are semipositive definite, is safe in this respect.
Furthermore we resolved the issue of applicability of kernel--means for non-embeddable kernel matrices. If we accept the eigen-value-shift transformation as a legitimate kernel matrix transformation and the kernel--means clustering in the kernel matrix obtained via such euclidesation as the valid clustering for the original kernel matrix, then we can apply kernel--means also in the non-Euclidean space.
Software
Please feel free to experiment with an R package (source code) implementing kernel -means functionality install.packages(”https://home.ipipan.waw.pl/m.klopotek/ipi˙archiv/kernelKmeansAndPlusPlusDemo˙1.0.tar.gz”,repos=NULL,type=”source”)
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] R. Balaji 1 and R.B. Bapat. On euclidean distance matrices. Linear Algebra and its Applications , 424(1):108––117, 2007.
- 2[2] Radha Chitta, Rong Jin, Timothy C. Havens, and Anil K. Jain. Approximate kernel k-means: Solution to large scale kernel clustering. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD ‘11, pages 895–903, New York, NY, USA, 2011. ACM.
- 3[3] I.S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: Spectral clustering and normalized cuts. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD ‘04, pages 551–556, New York, NY, USA, 2004. ACM.
- 4[4] J. C. Gower. Euclidean distance geometry. Math. Scientist , 7:1–14, 1982.
- 5[5] J. C. Gower. Properties of Euclidean and non-Euclidean distance matrices. Linear Algebra and its Applications , 67:81–97, 1985.
- 6[6] J.C. Gower and P. Legendre. Metric and Euclidean properties of dissimilarity coefficients. Journal of classification , 3(1):5–48, 1986. Here Gower:1982 is cited in theorem 4, but with a different form of condditions for D and s.
- 7[7] T. Handhayania and L. Hiryantob. Intelligent kernel k-means for clustering gene expression. In International Conference on Computer Science and Computational Intelligence (ICCSCI 2015) Procedia Computer Science , volume 59, pages 171–177, 2015.
- 8[8] Mieczyslaw A. Klopotek. On the existence of kernel function for kernel-trick of k-means. In Foundations of Intelligent Systems - 23rd International Symposium, ISMIS 2017, Warsaw, Poland, June 26-29, 2017, Proceedings , pages 97–104, 2017.
