Validity of Clusters Produced By kernel-$k$-means With Kernel-Trick

Mieczys{\l}aw A. K{\l}opotek

arXiv:1701.05335·cs.LG·December 24, 2018

Validity of Clusters Produced By kernel-$k$-means With Kernel-Trick

Mieczys{\l}aw A. K{\l}opotek

PDF

Open Access

TL;DR

This paper revises foundational theorems related to kernel-$k$-means clustering, ensuring the mathematical validity of the kernel trick by correcting previous proofs about kernel functions and their Euclidean embeddings.

Contribution

It provides corrected proofs for key theorems in Gower's work, clarifying the conditions under which kernel functions are valid for clustering.

Findings

01

Corrected proof of the existence of kernel functions from distance matrices.

02

Clarified conditions for kernel matrices to be embeddable in Euclidean space.

03

Ensured the mathematical soundness of kernel-$k$-means clustering methods.

Abstract

This paper corrects the proof of the Theorem 2 from the Gower's paper \cite[page 5]{Gower:1982} as well as corrects the Theorem 7 from Gower's paper \cite{Gower:1986}. The first correction is needed in order to establish the existence of the kernel function used commonly in the kernel trick e.g. for $k$ -means clustering algorithm, on the grounds of distance matrix. The correction encompasses the missing if-part proof and dropping unnecessary conditions. The second correction deals with transformation of the kernel matrix into a one embeddable in Euclidean space.

Equations147

K = - \frac{1}{2} (I - \frac{1 1 ^{T}}{m}) D_{s q} (I - \frac{1 1 ^{T}}{m})

K = - \frac{1}{2} (I - \frac{1 1 ^{T}}{m}) D_{s q} (I - \frac{1 1 ^{T}}{m})

κ_{d} (x, y) = e x p (- γ d^{2} (x, y))

κ_{d} (x, y) = e x p (- γ d^{2} (x, y))

K = (I - 1 s^{T}) (- \frac{1}{2} D_{s q}) (I - s 1^{T})

K = (I - 1 s^{T}) (- \frac{1}{2} D_{s q}) (I - s 1^{T})

i = 1 \sum m 1 \leq j \leq k min ∥Φ (i) - μ_{j}^{Φ} ∥^{2}

i = 1 \sum m 1 \leq j \leq k min ∥Φ (i) - μ_{j}^{Φ} ∥^{2}

μ_{j}^{Φ} = \frac{1}{m _{j}} i \in C_{j} \sum Φ (i)

μ_{j}^{Φ} = \frac{1}{m _{j}} i \in C_{j} \sum Φ (i)

\begin{array}[]{l}\|\Phi(i)-\boldsymbol{\mu}_{j}^{\Phi}\|^{2}=\displaystyle\big{(}\Phi(i)-\boldsymbol{\mu}_{j}^{\Phi}\big{)}^{T}\big{(}\Phi(i)-\boldsymbol{\mu}_{j}^{\Phi}\big{)}\vspace{0.3cm}\\ \phantom{\|\Phi(i)-\boldsymbol{\mu}_{j}^{\Phi}\|^{2}}=\Phi(i)^{T}\Phi(i)-2\Phi(i)^{T}\boldsymbol{\mu}_{j}^{\Phi}+(\boldsymbol{\mu}_{j}^{\Phi})^{T}\boldsymbol{\mu}_{j}^{\Phi}\vspace{0.3cm}\\ \phantom{\|\Phi(i)-\boldsymbol{\mu}_{j}^{\Phi}\|^{2}}=\Phi(i)^{T}\Phi(i)-\displaystyle\frac{2}{m_{j}}\sum_{h\in C_{j}}\Phi(i)^{T}\Phi(h)+\vspace{0.3cm}\\ \hskip 91.04872pt+\displaystyle\frac{1}{m_{j}^{2}}\sum_{r\in C_{j}}\sum_{s\in C_{j}}\Phi(r)^{T}\Phi(s)\vspace{0.3cm}\\ \phantom{\|\Phi(i)-\boldsymbol{\mu}_{j}^{\Phi}\|^{2}}=\displaystyle k_{ii}-\frac{2}{m_{j}}\sum_{h\in C_{j}}k_{hi}+\frac{1}{m_{j}^{2}}\sum_{r\in C_{j}}\sum_{s\in C_{j}}^{m}k_{rs}\end{array}

\begin{array}[]{l}\|\Phi(i)-\boldsymbol{\mu}_{j}^{\Phi}\|^{2}=\displaystyle\big{(}\Phi(i)-\boldsymbol{\mu}_{j}^{\Phi}\big{)}^{T}\big{(}\Phi(i)-\boldsymbol{\mu}_{j}^{\Phi}\big{)}\vspace{0.3cm}\\ \phantom{\|\Phi(i)-\boldsymbol{\mu}_{j}^{\Phi}\|^{2}}=\Phi(i)^{T}\Phi(i)-2\Phi(i)^{T}\boldsymbol{\mu}_{j}^{\Phi}+(\boldsymbol{\mu}_{j}^{\Phi})^{T}\boldsymbol{\mu}_{j}^{\Phi}\vspace{0.3cm}\\ \phantom{\|\Phi(i)-\boldsymbol{\mu}_{j}^{\Phi}\|^{2}}=\Phi(i)^{T}\Phi(i)-\displaystyle\frac{2}{m_{j}}\sum_{h\in C_{j}}\Phi(i)^{T}\Phi(h)+\vspace{0.3cm}\\ \hskip 91.04872pt+\displaystyle\frac{1}{m_{j}^{2}}\sum_{r\in C_{j}}\sum_{s\in C_{j}}\Phi(r)^{T}\Phi(s)\vspace{0.3cm}\\ \phantom{\|\Phi(i)-\boldsymbol{\mu}_{j}^{\Phi}\|^{2}}=\displaystyle k_{ii}-\frac{2}{m_{j}}\sum_{h\in C_{j}}k_{hi}+\frac{1}{m_{j}^{2}}\sum_{r\in C_{j}}\sum_{s\in C_{j}}^{m}k_{rs}\end{array}

Y = V d ia g (Λ)

Y = V d ia g (Λ)

d_{ij} = (x_{i} - x_{j})^{T} (x_{i} - x_{j})

d_{ij} = (x_{i} - x_{j})^{T} (x_{i} - x_{j})

F = (I - 1 s^{T}) (- \frac{1}{2}) D_{s q} (I - s 1^{T})

F = (I - 1 s^{T}) (- \frac{1}{2}) D_{s q} (I - s 1^{T})

d_{ij}^{2}

d_{ij}^{2}

= (g_{ii} - \frac{1}{2} d_{ii}^{2}) + (g_{j j} - \frac{1}{2} d_{j j}^{2}) - 2 (g_{ij} - \frac{1}{2} d_{ij}^{2})

= g_{ii} + g_{j j} - 2 g_{ij} + d_{ij}^{2}

0 = g_{ii} + g_{j j} - 2 g_{ij}

0 = g_{ii} + g_{j j} - 2 g_{ij}

g_{ij} = \frac{g _{ii} + g _{j j}}{2}

g_{ij} = \frac{g _{ii} + g _{j j}}{2}

G = g 1^{T} + 1 g^{T}

G = g 1^{T} + 1 g^{T}

F^{*} =

F^{*} =

=

=

- \frac{1}{2} (I - 1 s^{T}) D_{s q} (I - 1 s^{T})^{T}

(I - 1 s^{T}) 1 g^{T} (I - s 1^{T}) = 1 g^{T} - 1 g^{T} s 1^{T} - 1 s^{T} 1 g^{T} + 1 s^{T} 1 g^{T} s 1^{T}

(I - 1 s^{T}) 1 g^{T} (I - s 1^{T}) = 1 g^{T} - 1 g^{T} s 1^{T} - 1 s^{T} 1 g^{T} + 1 s^{T} 1 g^{T} s 1^{T}

(I - 1 s^{T}) 1 g^{T} (I - s 1^{T}) = 1 g^{T} - 10 1^{T} - 1 g^{T} + 1 s^{T} 1 \cdot 0 \cdot 1^{T} = 0 0^{T}

(I - 1 s^{T}) 1 g^{T} (I - s 1^{T}) = 1 g^{T} - 10 1^{T} - 1 g^{T} + 1 s^{T} 1 \cdot 0 \cdot 1^{T} = 0 0^{T}

(I - 1 s^{T}) g 1^{T} (I - 1 s^{T})^{T} = ((I - 1 s^{T}) 1 g^{T} (I - s 1^{T}))^{T} = 0 0^{T}

(I - 1 s^{T}) g 1^{T} (I - 1 s^{T})^{T} = ((I - 1 s^{T}) 1 g^{T} (I - s 1^{T}))^{T} = 0 0^{T}

F^{*}

F^{*}

(I - 1 s^{T}) F (I - 1 s^{T})^{T} = - \frac{1}{2} (I - 1 s^{T}) D_{s q} (I - 1 s^{T})^{T}

(I - 1 s^{T}) F (I - 1 s^{T})^{T} = - \frac{1}{2} (I - 1 s^{T}) D_{s q} (I - 1 s^{T})^{T}

F^{*} = (I - 1 s^{T}) Y Y^{T} (I - 1 s^{T})^{T} = ((I - 1 s^{T}) Y) ((I - 1 s^{T}) Y)^{T} = Y^{*} Y^{*}^{T}

F^{*} = (I - 1 s^{T}) Y Y^{T} (I - 1 s^{T})^{T} = ((I - 1 s^{T}) Y) ((I - 1 s^{T}) Y)^{T} = Y^{*} Y^{*}^{T}

Y^{*} = (I - 1 s^{T}) Y = Y - 1 s^{T} Y = Y - 1 v^{T}

Y^{*} = (I - 1 s^{T}) Y = Y - 1 s^{T} Y = Y - 1 v^{T}

(I - 1 t^{T}) (I - 1 s^{T})

(I - 1 t^{T}) (I - 1 s^{T})

= I - 1 t^{T} - 1 s^{T} + 1 s^{T} = I - 1 t^{T}

(I - 1 t^{T}) F (I - 1 t^{T})^{T}

(I - 1 t^{T}) F (I - 1 t^{T})^{T}

= - \frac{1}{2} (I - 1 t^{T}) D_{s q} (I - 1 t^{T})^{T}

F^{'}

F^{'}

= (I - 1 t^{T}) Y Y^{T} (I - 1 t^{T})^{T}

= (Y - 1 (t^{T}) Y) (Y - 1 (t^{T}) Y)^{T}

u^{T} F u =

u^{T} F u =

=

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition · Advanced Clustering Algorithms Research · Remote-Sensing Image Classification

Full text

Validity of Clusters Produced By kernel- $k$ -means With Kernel-Trick

Mieczysław A. Kłopotek Institute of Computer Science

of the Polish Academy of Sciences

ul. Jana Kazimierza 5, 01-248 Warszawa Poland,

[email protected]

Abstract

This paper, constituting an extension to the conference paper [8], corrects the proof of the Theorem 2 from the Gower‘s paper [4, page 5]

as well as corrects the Theorem 7 from Gower‘s paper [6] . The first correction is needed in order to establish the existence of the kernel function used commonly in the kernel trick e.g. for $k$ -means clustering algorithm, on the grounds of distance matrix. The correction encompasses the missing if-part proof and dropping unnecessary conditions.

The second correction deals with transformation of the kernel matrix into a one embeddable in Euclidean space.

1 The Problem

A number of approaches to solving various data mining problems, including clustering, is based on so-called kernel approach. The kernel approach may be seen as application of a mapping $\Phi$ to the data points in such a way that they are represented in a high dimensional Euclidean space (called feature space) in which it is hoped to separate the data points easier via simpler geometrical constructs (e.g. hyperplanes), compared for example to their original low dimensional representation space. In this way, a number of data mining methods requiring linear data separation can be applied to non-linearly separated data sets.

The kernel approach is most frequently applied in conjunction with Support Vector Machine based analysis methods, but it is also used in case of $k$ -means clustering algorithm111For an overview of kernel $k$ -means algorithm see e.g. [3]., in which we are interested in this paper. We will introduced this algorithm in Section 3

The kernel-based approaches assume the availability of a similarity function $\kappa()$ and in particular of the similarity matrix $K$ , called also a kernel function and kernel matrix resp., which express similarities between data points at hand. This similarity function/matrix must have the property that, for any two data points $\mathbf{i},\mathbf{j}$ in the original apace space we have $\kappa(\mathbf{i},\mathbf{j})=\Phi(\mathbf{i})\circ\Phi(\mathbf{j})$ ( $\circ$ operator indicates a dot product between vectors), and for any two data points in the data set under consideration the similarity matrix $K$ is available such that $K_{ij}=\kappa(\mathbf{i},\mathbf{j})$ .

For a number of algorithms, including $k$ -means, the so-called kernel trick has been elaborated. The essence of the kernel trick is that we can perform the kernel algorithms based on the kernel matrix $K$ alone, without an explicit knowledge of the mapping $\Phi$ . Section 3 explains the usage of kernel trick for $k$ -means algorithm.

Nonetheless, the very existence of the mapping $\Phi$ , and hence of the kernel function $\kappa()$ is of vital importance to the validity of application of the $k$ -means algorithm in the feature space. $\Phi$ transforms the data to points in an Euclidean space so that $k$ -means can be applied at all. Inversion of $\Phi$ will provide with cluster centers produced by kernel- $k$ -means. Furthermore, not similarities but rather distances are used by $k$ -means. We can easily imagine that no kernel function $\kappa()$ exists for a given similarity matrix $K$ . We can also have to do with the situation that there exist multiple kernel functions $\kappa$ as well as $\Phi$ related to the same kernel matrix $K$ . Can it mean that there exist multiple feature spaces in which the very same data set can be clustered differently via kernel- $k$ -means depending on the $\Phi$ function we choose? Closely related is the following issue: For algorithms like $k$ -means, instead of the kernel matrix the distance matrix $D$ between the objects in the feature space may be available, being the Euclidean distance matrix. We will call $D$ Euclidean matrix.

We are faced with the following questions:

(1)

what properties the kernel matrix should have in be really a matrix of dot products?

(2)

what properties the kernel matrix should have in to enable to recover function $\Phi()$ at the data points from the kernel matrix?

(3)

can we obtain the matrix $K$ from distance data matrix $D$ ?

(4)

can we obtain from the matrix $K$ the function $\Phi()$ such that the distances in the feature space are exactly the same as given by the $D$ matrix?

(5)

if we derived the matrix $K$ from $D$ and $K$ turns out to yield $\Phi()$ , can we know then that $D$ was really an Euclidean distance matrix?

Questions (1), (2) may seem to be pretty easy and were partially addressed e.g. by Schölkopf [17]. Schölkopf investigates what kinds of kernel functions may lead to a distance measure in the feature space. However, he does not consider the inverse, that is Euclidean distance matrix leading to a kernel function. He does not investigate finding explicit form of the $\Phi$ function either.

The answer to the third question seems to be easily derivable from the paper by Balaji et al. [1]. One should use the transformation

[TABLE]

(where $D_{sq}$ is a matrix containing as entries squared distances from $D$ ) a result going back to a paper by Schoenberg [15]. The problem is that this paper of Schoenberg does not contain any such statement. This result should be rather ascribed to the paper [16]. 222 Schoenberg [15] proposed still another distance-to-kernel matrix transform

$\kappa_{d}(\mathbf{x},\mathbf{y})=exp(-\gamma d^{2}(\mathbf{x},\mathbf{y}))$

for any positive $\gamma$ , which we will not discuss here.

The most general proposal of a distance-to-kernel-matrix transform seems to be that of by Gower [4, Theorem 2, page 5], who generalizes the aforementioned transform (1) to

[TABLE]

for an appropriate choice of $\mathbf{s}$ . A generally accepted proof of this transformation can be found in the paper by Gower [4, Theorem 2, page 5]. If this proof were correct, the questions (4) and (5) would have been answered. Regrettably, the proof of the validity of the latter is incomplete, as we will explain in Section 4. For this reason, these questions still remain open.

Therefore, we decided to provide with a correction of the proof of the Gower‘s theorem that we will present in Section 5. This correction is needed in order to establish the existence of the kernel function used commonly in the kernel trick e.g. for $k$ -means clustering algorithm, on the grounds of distance matrix.

The question that was left open by Gower was: do there exist special cases where two different $\Phi()$ functions, complying with a given kernel matrix, generate different distance matrices in the feature space, maybe in some special, ”sublimated” cases? This would mean that under some ”special” conditions the output of kernel $k$ -means could differ radically not just on the grounds of some random causes but in a systematic way. The answer given to this open question in this paper is definitely NO. We closed all the conceivable gaps in this respect. So usage of (linear and non-linear) kernel matrices that are semipositive definite, is safe in this respect.

Let us underline here that we did not impose any apriorical restrictions on the form of $\Phi()$ function itself. It may be a linear or non-linear mapping from the sample space to the feature space. But what we insist on is that the feature space has to be Euclidean. This is the requirement for applicability of (kernel) $k$ -means clustering algorithm. If the feature space is not metric, the results of (kernel) $k$ -means clustering are questionable.

In Section 6 we provide with a numerical example illustrating some distance matrix transformations discussed in Section 5.

The second problem with usage of kernel- $k$ -means is related to the basic assumption of $k$ -means that it has been designed for Euclidean space. In a number of applications, like clustering based on Laplacians, the embeddability of the kernel matrix can be guaranteed from the theoretical standpoint. However, this does not need to be always the case. Therefore we need to answer the questions (6) what does kernel- $k$ -means produce for non-Euclidean kernel matrices, (7) can a non-Euclidean kernel matrix be turned to an Euclidean kernel matrix, (8) how does the latter matrix transformation impact the results of kernel- $k$ -means clustering. The questions could have been easily answered if the Theorem 7 of Gower from [6] were correct. Regrettably, this Theorem requires an quantitative correction. We handle these issues in Section 7.

In the subsequent Section 2 we will point at research directions for which the correction proposed here is of importance.

2 The Background

The $k$ -means algorithm has the very attractive property of being easy to implement, and there exist various variants of it like $k$ -means++ possessing even closeness-to-optimum properties. The drawback of this algorithm is that it accepts numeric attributes only and requires an embedding in Euclidean space. Embeddings into other spaces were investigated, like hyperbolic space, but the computation of cluster centers that is vital and very easy in Euclidean space, is not that easy in the other spaces.

However, real-world objects are frequently described by non-numeric attributes, or are not embedded in any space whatsoever and instead only similarity, dissimilarity or distance between objects is known. In such cases the kernel- $k$ -means clustering algorithm can be used which at least partially inherits the good properties of $k$ -means. In such cases, however, the very existence of embedding into Euclidean space (even if it is not used explicitly), is of vital importance, because otherwise the clustering results may be unreliable. Same holds for other kernel algorithms for which the original algorithm relies on an Euclidean space.

Therefore, research is performed like that of [9], in order to find ways of transforming a similarity matrix into the closest proper positive definite kernel matrix, so that an approximating Euclidean embedding is existent, or one learns the distances themselves.

These efforts in establishing the proper kernel matrix make sense only if the Theorem 2 of Gower [4] is valid. However, a study of the literature seems to reveal that nobody except for Gower himself was aware of the mentioned flaw of his proof of his theorem and the result is used rather as a granted truth.

The Gower‘s paper [4], according to GoogleScholar, is cited over 200 times in a number of research and application contexts. For example, Pekalska et al. [12] derive the necessity of creation of a generalized kernel handling of dissimilarity on the grounds that the kernel according to equation (2) is positive definite if and only if the underlying distance matrix is Euclidean, which has not been proven by Gower [4]. Same motivation lies behind Nikolentzos et al. work [10] on seeking appropriate embeddings. Pavoine et al. [11] relies on the property, suggested by Gower [4], that the decomposition of the kernel can be shifted, while performing PCA analysis.

Kernel-trick based $k$ -means algorithms are applied in various areas (e.g. gene expression clustering [7], spectral clustering of graphs [3]).

The validity of the Gower transform underpins various improvements of kernel $k$ -means clustering, like single pass clustering [14]. global kernel $k$ -means [18], subsampling kernel $k$ -means [2] robust kernel $k$ -means [19] and other.

Furthermore, let us stress here that the aforementioned papers do not care at all about whether or not the kernel matrices are embeddable in Euclidean space which is the basic assumption of applying the basic form of kernel- $k$ -means. Non-Euclidean space require a serious modification of $k$ -means, accommodating to that fact that gravity center of a cluster cannot serve any more as cluster center (gradient descent methods are needed for example, see [13, Section 6].

For these reasons a definite solving of the Gower theorem dilemma seems to be of uttermost importance.

3 Kernel- $k$ -means

The well known $k$ -means clustering algorithm is claimed to minimize the objective function being the sum of squares of distances of data points to their cluster centers. It consists of the following steps: (1) creating the initial clustering, (2) computation of cluster center for each cluster, (3) creation of a new clustering by assigning each data point to the cluster defined by the closest cluster center (4) repeating steps (2) and (3) till some terminating condition. There exist a large variety of variants of this algorithm. For example step (1) may cosist in random selection of $k$ distinct data points as cluster centers and applying step (3). Another variant may replace step (2) with step (2‘) in which a single data point is moved from one cluster to the other if and only if the move decreases the cost function and then perform proper step (2). steps (2) and (2‘) may be applied interchangingly in subsequent iterations and so on.

Kernel based $k$ -means clustering algorithm (clustering objects $1,\dots,m$ into $k$ clusters $1,\dots,k$ ) consists in switching to a multidimensional feature space $\mathcal{F}$ and searching therein for prototypes $\boldsymbol{\mu}_{j}^{\Phi}$ minimizing the error

[TABLE]

where $\Phi\colon\{1,\dots,m\}\to\mathcal{F}$ is a (usually non-linear) mapping of the space of objects into the feature space. The so-called ”kernel trick” means the possibility to apply $k$ -means clustering without knowing explicitly the $\Phi(i)$ function and using so-called kernel matrix with elements $k_{ij}=\Phi(i)^{T}\Phi(j)=K(i,j)$ instead.

In analogy to the classical $k$ -means algorithm, the prototype vectors are updated according to the equation

[TABLE]

where $m_{j}$ is the cardinality of the $j$ -th cluster. A direct application of this equation is not possible unless the function $\Phi$ is known. But it may be still feasible if we would know the so-called Kernel Matrix $K$ with elements being dot products of data points in the feature space, that is $k_{ij}=\Phi(i)^{T}\Phi(j)=K(i,j)$ . Given matrix $K$ , it is possible to compute the distances between the object images and prototypes in the feature space by making use of so-called called ”the kernel trick”. The ”kernel trick” relies on the fact that the following transformation is possible:

[TABLE]

where, as already stated, $k_{ij}=\Phi(i)^{T}\Phi(j)=K(i,j)$ .

In this way, one can update the elements of clusters without determining the prototypes explicitly.

Let $Y$ be a matrix $Y=(\Phi(1),\Phi(2),\dots,\Phi(m))^{T}$ . Then apparently $K=YY^{T}$ . Hence for any non-zero vector $\mathbf{u}$ $\mathbf{u}^{T}K\mathbf{u}=\mathbf{u}^{T}YY^{T}\mathbf{u}=(Y^{T}\mathbf{u})^{(}Y^{T}\mathbf{u})=\mathbf{y}^{T}\mathbf{y}\geq 0$ where $\mathbf{y}=Y^{T}\mathbf{u}$ so $K$ must be positive semidefinite. But a matrix is positive semidefinite iff all its eigenvalues are non-negative. Furthermore, all its eigenvectors are real numbers.

So to identify $\Phi()$ at data points, one has to find all eigenvalues $\lambda_{l}$ , $l=1,\dots,m$ and corresponding eigenvectors $\mathbf{v}_{l}$ of the matrix $K$ . If all eigenvalues are hereby non-negative, then construct the matrix $Y$ that has as columns the products $\sqrt{\lambda_{l}}\mathbf{v}_{l}$ . Rows of this matrix (up to permutations) are the values of the function $\Phi()$ at data points $1,\dots,m$ . More formally, if the matrix $V=(\mathbf{v}_{1},\dots,\mathbf{v}_{m})$ , and $\Lambda$ is the vector of eigenvalues, then

[TABLE]

where $diag()$ turns a vector into a diagonal matrix. It may be verified that kernel- $k$ -means with the above $K$ matrix and ordinary $k$ -means for $Y$ would yield same results.

4 Gower formulation of distance-to-kernel-matrix transformation

Let us recall that a matrix $D\in\mathbb{R}^{m\times m}$ is an Euclidean distance matrix between points $1,\dots,m$ if and only if there exists a matrix $X\in\mathbb{R}^{m\times n}$ rows of which ( $\mathbf{x_{1}}^{T},\dots,\mathbf{x_{m}}^{T}$ ) are coordinate vectors of these points in an $n$ -dimensional Euclidean space and

[TABLE]

. Gower in [4] claims that

Theorem 1

$D$ * is Euclidean iff the matrix $F=\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)(-\frac{1}{2})D_{sq}\left(\mathbf{I}-\mathbf{s}\mathbf{1}^{T}\right)$ is positive semidefinite for any vector $\mathbf{s}$ such that $\mathbf{s}^{T}\mathbf{1}=1$ and $D_{sq}\mathbf{s}\neq\mathbf{0}$ *

whereas in [6] he claims:

Theorem 2

$D$ * is Euclidean iff the matrix $F=\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)(-\frac{1}{2})D_{sq}\left(\mathbf{I}-\mathbf{s}\mathbf{1}^{T}\right)$ is positive semidefinite for any vector $\mathbf{s}$ such that $\mathbf{s}^{T}\mathbf{1}=1$ .*

Apparently both claims do not match quite (with respect to condition $D_{sq}\mathbf{s}\neq\mathbf{0}$ ). It must be underlined, however, that the paper [4] provides strong clues how the theorem 2 shall be proven, though incompletely, so that in what follows we use these clues to establish the result. We claim here is that the Gower‘s theorem has the following deficiencies

•

requirement $D_{sq}\mathbf{s}\neq\mathbf{0}$ is not needed in Theorem 1.

•

the if-part of neither Theorem 1 nor of his theorem correction in [6] was demonstrated.

It should be noted at this point, that in a 1985 paper Gower [5] derives his theorem in the latter version from a paper by Schoenberg [16]. The problem is that first of all Gower‘s result does not need this second derivation and second the paper by Schoenberg [16] does not prove what Gower [5] claims. So the issue is open and we want to address it here more thoroughly. We provide a coorection, completing Gower‘s proof in Section 5. See Section 6 for some numerical examples of matrices and vectors that we operate on in Section 5. In Section 8 we draw some conclusions from the corrective proof.

5 Correrction of Gower‘s result

In this section we shall correct the Gower‘s result from [4].

For construction purposes we need still another formulation of the theorem, which is slightly more elaborate:

Theorem 3

If the matrix $D$ is a matrix of Euclidean distances then for each vector $\mathbf{s}$ such that $\mathbf{s}^{T}\mathbf{1}=1$ the matrix

[TABLE]

is positive semidefinite ( $D_{sq}$ * being a matrix with entries being squares of entries of the matrix $D$ ).* 2. 2.

*If $D$ is a symmetric matrix with zero diagonal and for a vector $\mathbf{s}$ such that $\mathbf{s}^{T}\mathbf{1}=1$ . the matrix $F=\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)(-\frac{1}{2})D_{sq}\left(\mathbf{I}-\mathbf{s}\mathbf{1}^{T}\right)$ is positive semidefinite then $D$ is Euclidean. * 3. 3.

If $D$ is Euclidean then for each vector $\mathbf{s}$ such that $\mathbf{s}^{T}\mathbf{1}=1$ the matrix $D$ can be derived from matrix $F=\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)(-\frac{1}{2})D_{sq}\left(\mathbf{I}-\mathbf{s}\mathbf{1}^{T}\right)$ in such a way that its squared entries can be computed as $d_{ij}^{2}=f_{ii}+f_{jj}-2f_{ij}$ . 4. 4.

If $D$ is Euclidean then for each vector $\mathbf{s}$ such that $\mathbf{s}^{T}\mathbf{1}=1$ the matrix $F=\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)(-\frac{1}{2})D_{sq}\left(\mathbf{I}-\mathbf{s}\mathbf{1}^{T}\right)$ can be expressed as $F=YY^{T}$ where $Y$ is a real-valued matrix, and the rows of $Y$ can be considered as coordinates of data points the distances between which are those from the matrix $D$ .

Let $D\in\mathbb{R}^{m\times m}$ be a matrix of Euclidean distances $d_{ij}$ between objects $i,j\in\{1,\dots,m\}$ . Let $D_{sq}$ be a matrix of squared Euclidean distances $d_{ij}^{2}$ between objects with identifiers $1,\dots,m$ . This means that there must exist a matrix $X\in\mathbb{R}^{m\times n}$ for some $n$ , rows of which represent coordinates of these objects in an $n$ -dimensional space. This real-valued matrix $X$ represents an embedding of the Euclidean distance matrix $D$ into $\mathbb{R}^{m\times n}$ . A distance matrix can be called Euclidean if and only if an embedding exists. If $E=XX^{T}$ ( $E$ with dimensions $m\times m$ ), then $d^{2}_{ij}=e_{ii}+e_{jj}-2e_{ij}$ .

As a rigid set of points in Euclidean space can be moved (shifted, rotated, flipped symmetrically333Gower does not consider flipping.) without changing their relative distances, there may exist many other matrices $Y$ rows of which represent coordinates of these same objects in the same $n$ -dimensional space after some isomorphic transformation. Let us denote the set of all such embeddings $\mathcal{E}(D)$ . And if a matrix $Y\in\mathcal{E}(D)$ , then for the product $F=YY^{T}$ we have $d^{2}_{ij}=f_{ii}+f_{jj}-2f_{ij}$ . We will say that $F\in\mathcal{E}_{dp}(D)$

For an $F\in\mathcal{E}_{dp}(D)$ define a matrix $G=F+\frac{1}{2}D_{sq}$ . Hence $F=G-\frac{1}{2}D_{sq}$ . Obviously then

[TABLE]

(as $d_{jj}=0$ for all $j$ ). This implies that

[TABLE]

that is

[TABLE]

So $G$ is of the form

[TABLE]

with components of $\mathbf{g}\in\mathbb{R}^{m}$ equal $g_{i}=\frac{1}{2}g_{ii}$ .

Therefore, to find $F\in\mathcal{E}_{dp}(D)$ for an Euclidean matrix $D$ we need only to consider matrices deviating from $-\frac{1}{2}D_{sq}$ by $\mathbf{g}\mathbf{1}^{T}+\mathbf{1}\mathbf{g}^{T}$ for some $\mathbf{g}$ . Let us denote with $\mathcal{G}(D)$ the set of all matrices $F$ such that $F=\mathbf{g}\mathbf{1}^{T}+\mathbf{1}\mathbf{g}^{T}-\frac{1}{2}D_{sq}$ . So for each matrix $F$ if $F\in\mathcal{E}_{dp}(D)$ then $F\in\mathcal{G}(D)$ , but not vice versa. We stress that we work with an Euclidean matrix $D$ . So we would like to find an $F$ such that $F$ is decomposable into real-valued matrices $Y$ such that $F=YY^{T}$ so that $Y$ would represent an embedding of an Euclidean distance matrix. But first of all even if $D$ is not Euclidean, or even not metric, such an embedding may be found. (see Gower et al. [6]).

As Gower et al. [6] states, see their Theorem 1, any non-metric dissimilarity measure $d(\mathfrak{z},\mathfrak{y})$ for $\mathfrak{z},\mathfrak{y}\in{\mathfrak{X}}$ where ${\mathfrak{X}}$ is finite, can be turned into a (metric) distance function $d^{\prime}(\mathfrak{z},\mathfrak{y})=d(\mathfrak{z},\mathfrak{y})+c$ where $c$ is a constant where $c\geq\max_{\mathfrak{x},\mathfrak{y},\mathfrak{z}\in\mathfrak{X}}\|d(\mathfrak{x},\mathfrak{y})+d(\mathfrak{y},\mathfrak{z})-d(\mathfrak{z},\mathfrak{x})\|$ . Furthermore, Gower et al. [6] recall that any dissimilarity matrix $D$ may be turned to an Euclidean distance matrix, see their Theorem 7, by adding an appropriate constant, e.g. $d^{\prime}(\mathfrak{z},\mathfrak{y})=\sqrt{d(\mathfrak{z},\mathfrak{y})^{2}+\sigma}$ where $\sigma$ is a constant such that $\sigma\geq-\lambda_{m}$ , $\lambda_{m}$ being the smallest eigenvalue of $(\mathbf{I}-\mathbf{1}\mathbf{1}^{T}/m)(-\frac{1}{2}D_{sq})(\mathbf{I}-\mathbf{1}\mathbf{1}^{T}/m)$ , $D_{sq}$ is the matrix of squared values of elements of $D$ , $m$ is the number of rows/columns in $D$ .

So even if $D$ is actually an Euclidean distance matrix, and $F=-\frac{1}{2}D_{sq}+\mathbf{g}\mathbf{1}^{T}+\mathbf{1}\mathbf{g}^{T}$ , there is no warranty, that the distance matrix induced by corresponding $Y$ is identical with $D$ .

For an $F\in\mathcal{G}(D)$ consider the matrix $F^{*}=\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)F\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)^{T}$ . We obtain

[TABLE]

Let us investigate $\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)\mathbf{1}\mathbf{g}^{T}\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)^{T}$ :

[TABLE]

Let us make the following choice (always possible) of $\mathbf{s}$ with respect to $\mathbf{g}$ : $\mathbf{s}^{T}\mathbf{1}=1$ , $\mathbf{s}^{T}\mathbf{g}=0$ .

Then we obtain from the above equation

[TABLE]

By analogy

[TABLE]

By substituting (19) and (20) into (17) we obtain

[TABLE]

So for any $\mathbf{g}$ , hence an $F\in\mathcal{G}(D)$ we can find an $\mathbf{s}$ such that:

[TABLE]

For any matrix $F=-\frac{1}{2}\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)D_{sq}\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)^{T}$ for some $\mathbf{s}$ with $\mathbf{1}^{T}\mathbf{s}=1$ we say that $F$ is in multiplicative form or $F\in\mathcal{M}(D)$ .

If $F=YY^{T}$ , that is $F$ is decomposable, then also

[TABLE]

is decomposable. But

[TABLE]

where $\mathbf{v}=Y^{T}\mathbf{s}$ is a shift vector by which the whole matrix $Y$ is shifted to a new location in the Euclidean space. So the distances between objects computed from $Y^{*}$ are the same as those from $Y$ , hence if $F\in\mathcal{E}_{dp}(D)$ , then $Y^{*}\in\mathcal{E}(D)$ .

Therefore, to find a matrix $F\in\mathcal{E}_{dp}(D)$ , yielding an embedding of $D$ in the Euclidean $n$ dimensional space we need only to consider matrices of the form $-\frac{1}{2}\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)D_{sq}\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)^{T}$ , subject to the already stated constraint $\mathbf{s}^{T}\mathbf{1}=1$ , that is ones from $\mathcal{M}(D)$ .

So we can conclude: If $D$ is a matrix of Euclidean distances, then there must exist a positive semidefinite matrix $F=-\frac{1}{2}\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)D_{sq}\left(\mathbf{I}-\mathbf{s}\mathbf{1}^{T}\right)$ for some vector $\mathbf{s}$ such that $\mathbf{s}^{T}\mathbf{1}=1$ , $\det(\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right))=0$ and $D_{sq}\mathbf{s}\neq\mathbf{0}$ . These last two conditions are implied by the following fact: $D_{sq}$ is known to be not negative semidefinite, so that $F$ would not be positive semidefinite in at least the following cases: $\det(\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right))\neq 0$ (see reasoning prior to formula (30)) or $D_{sq}\mathbf{s}=\mathbf{0}$ (see reasoning prior to formula (31)). So if $D$ is an Euclidean distance matrix, then there exists an $F\in\mathcal{M}(D)\cap\mathcal{E}_{dp}(D)$ .

Let us investigate other vectors $\mathbf{t}$ such that $\mathbf{t}^{T}\mathbf{1}=1$ . Note that

[TABLE]

Therefore, for a matrix $F\in\mathcal{M}(D)$

[TABLE]

But if $F=YY^{T}\in\mathcal{E}_{dp}(D)$ , then

[TABLE]

and hence each $-\frac{1}{2}(\mathbf{I}-\mathbf{1}\mathbf{t}^{T})D_{sq}(\mathbf{I}-\mathbf{1}\mathbf{t}^{T})^{T}$ is also in $\mathcal{E}_{dp}(D)$ , though with a different placement (by a shift) in the coordinate systems of the embedded data points. So if one element of $\mathcal{M}(D)$ is in $\mathcal{E}_{dp}(D)$ , then all of them are.

So we have established that: if $D$ is an Euclidean distance matrix444 This means that there exists a matrix $X$ such that rows are coordinates of objects in an Euclidean space with distances as in $D$ , then there exists a decomposable matrix $F=YY^{T}\in\mathcal{E}_{dp}(D)$ which is in $\mathcal{G}(D)$ , hence $\mathcal{E}_{dp}(D)\subset\mathcal{G}(D)$ . For each matrix in $\mathcal{G}(D)\cap\mathcal{E}_{dp}(D)$ there exists a multiplicative form matrix in $\mathcal{M}(D)\cap\mathcal{E}_{dp}(D)$ . But if it exists, all multiplicative forms are there: $\mathcal{M}(D)\subset\mathcal{E}_{dp}(D)$

In this way we have proven points 1,3 and 4 of the Theorem 3. And also the only-if-part of Gower‘s theorem correction in [6].

However, two things remain to be clarified and are not addressed in [4] nor in [6]: the if-part of [6] theorem correction (given a matrix $D$ such that $-0.5\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)D_{sq}\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)^{T}$ is positive semidefinite, is $D$ an Euclidean distance matrix? – see point 2 of the Theorem 3) and the status of the additional condition $D_{sq}\mathbf{s}\neq\mathbf{0}$ in Theorem 1.

Gower [4] makes the following remark: $F=\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)(-\frac{1}{2}D_{sq})\left(\mathbf{I}-\mathbf{s}\mathbf{1}^{T}\right)$ is to be positive semidefinite for Euclidean $D$ . However, for non-zero vectors $\mathbf{u}$

[TABLE]

But $D_{sq}$ is known to be not negative semidefinite, so that $F$ would not be positive semidefinite in at least the following cases: $\det(\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right))\neq 0$ and $D_{sq}\mathbf{s}=\mathbf{0}$ . Let us have a brief look at these conditions and why they are neither welcome nor actually existent:

Situation $\det(\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right))\neq 0$ is not welcome, because there exists a vector $\mathbf{u^{\prime}}$ such that $\mathbf{u^{\prime}}^{T}D_{sq}\mathbf{u^{\prime}}>0$ and under $\det(\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right))\neq 0$ we could solve the equation $\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)^{T}\mathbf{u}=\mathbf{u^{\prime}}$ and thus demonstrate that for some $\mathbf{u}$

[TABLE]

However this situation is impossible, because for $F\in\mathcal{M}(D)$

[TABLE]

which means that the rows are linearly dependent, hence $\det(\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right))=0$ is guaranteed by earlier assumption about $\mathbf{s}$ ; so this concern by Gower needs to be dismissed as pointless. 2. 2.

Situation $D_{sq}\mathbf{s}=\mathbf{0}$ is not welcome, because then

[TABLE]

and thus

[TABLE]

denying positive semidefiniteness of $F$ . Gower does not consider this further, but such a situation is impossible. Recall that because $D$ is Euclidean, there must exist a vector $\mathbf{r}$ such that $\mathbf{r}^{T}\mathbf{1}=1$ and

[TABLE]

is in $\mathcal{E}_{dp}(D)$ . Hence for any $\mathbf{s}$ such that $\mathbf{s}^{T}\mathbf{1}=1$

[TABLE]

is positive semidefinite. This allows us to conclude that for such $\mathbf{s}$ $D\mathbf{s}\neq\mathbf{0}$ . Therefore if $D\mathbf{s}=\mathbf{0}$ then $\mathbf{s}^{T}\mathbf{1}=0$ . What is more, if $det(D)\neq 0$ then $D_{sq}\mathbf{s}=\mathbf{0}$ implies $\mathbf{s}=\mathbf{0}$ , for which of course $\mathbf{s}^{1}\mathbf{1}=0$ .

Hence the last assumption of if-part of Theorem 1 needs to be dropped as unnecessary which simplifies it to corrected theorem in [6].

As we can see from the first point above, $F$ , given by

[TABLE]

does not need to identify uniquely a matrix $D$ , as $\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)$ is not invertible. Though of course it identifies an Euclidean distance matrix.

Let us now demonstrate the missing part of Gower‘s proof that $D$ is uniquely defined given a decomposable $F$ .

So assume that for some $D$ (of which we do not know if it is Euclidean, but is symmetric and with zero diagonal), $F=-0.5\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)D_{sq}\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)^{T}$ and $F$ is decomposable that is $F=YY^{T}$ . Let $\mathcal{D}(Y)$ be the distance matrix derived from $Y$ (that is the distance matrix for which $Y$ is an embedding). That means $F$ is decomposable into properly distanced points with respect to $\mathcal{D}(Y)$ . And $F$ is in additive form with respect to it, that is $F\in\mathcal{G}(\mathcal{D}(Y))$ Therefore there must exist some $\mathbf{s^{\prime}}$ such that the $F^{\prime}=-0.5\left(\mathbf{I}-\mathbf{1}\mathbf{s^{\prime}}^{T}\right)\mathcal{D}(Y)_{sq}\left(\mathbf{I}-\mathbf{s^{\prime}}\mathbf{1}^{T}\right)$ as valid multiplicative form with respect to $\mathcal{D}(Y)$ , and it holds that $F^{\prime}=\left(\mathbf{I}-\mathbf{1}\mathbf{s^{\prime}}^{T}\right)F\left(\mathbf{I}-\mathbf{s^{\prime}}\mathbf{1}^{T}\right)$ . But recall that

[TABLE]

Hence

[TABLE]

So we need to demonstrate that for two symmetric matrices with zero diagonals $D,D^{\prime}$ such that $-\frac{1}{2}\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)D_{sq}\left(\mathbf{I}-\mathbf{s}\mathbf{1}^{T}\right)=-\frac{1}{2}\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)D^{\prime}_{sq}\left(\mathbf{I}-\mathbf{s}\mathbf{1}^{T}\right)$ the equation $D=D^{\prime}$ holds.

It is easy to see that $-\frac{1}{2}\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)(D_{sq}-D^{\prime}_{sq})\left(\mathbf{I}-\mathbf{s}\mathbf{1}^{T}\right)=\mathbf{0}\mathbf{0}^{T}$ . Denote $\Delta=D_{sq}-D^{\prime}_{sq}$ .

[TABLE]

With $\boldsymbol{\overline{\Delta}}$ denote the vector $\Delta\mathbf{s}$ and with $c$ the scaler $\mathbf{s}^{T}\Delta\mathbf{s}$ . So we have

[TABLE]

So in the row $i$ , column $j$ of the above equation we have: $\delta_{ij}+c-\overline{\delta}_{i}-\overline{\delta}_{j}=0$ . Let us add cells $ii$ and $jj$ and subtract from them cells $ij$ and $ji$ . $\delta_{ii}+c-\overline{\delta}_{i}-\overline{\delta}_{i}+\delta_{jj}+c-\overline{\delta}_{j}-\overline{\delta}_{j}-\delta_{ij}-c+\overline{\delta}_{i}+\overline{\delta}_{j}-\delta_{ji}-c+\overline{\delta}_{j}+\overline{\delta}_{i}=\delta_{ii}+\delta_{jj}-\delta_{ij}-\delta_{ji}=0$ . But as the diagonals of $D$ and $D^{\prime}$ are zeros, hence $\delta_{ii}=\delta_{jj}=0$ . So $-\delta_{ij}-\delta_{ji}=0$ . But $\delta_{ij}=\delta_{ji}$ because $D,D^{\prime}$ are symmetric. Hence $-2\delta_{ji}=0$ so $\delta_{ji}=0$ . This means that $D=D^{\prime}$ .

This means that $D$ and $\mathcal{D}(Y)$ are identical. Hence decomposition of $F=-0.5\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)D_{sq}\left(\mathbf{I}-\mathbf{1}\mathbf{s}^{T}\right)^{T}$ is sufficient to prove Euclidean space embedding of $D$ and yields this embedding. This proves the if-part of Gower‘s Theorem 1 and of the corrected theorem in [6] 2 and point 2 of Theorem 3.

6 A numerical example

Let us illustrate the process of generating a kernel matrix from a distance table and show that the distances between the objects in the feature space really match the distances of the original distance matrix. We took a $n=4$ -dimensional data matrix with $m=7$ objects.

[TABLE]

and derived from it an original Euclidean distance matrix

[TABLE]

We applied to it the transformation from equation (8) using the vector

[TABLE]

and obtained the (kernel) matrix

[TABLE]

After eigen-decomposition of $F$ , we get via equation (6) the embedding matrix (after ignoring columns with next to zero eigenvalues)

[TABLE]

which produces the distance matrix

[TABLE]

The sum of squared differences between the corresponding entries in the distance matrices $D$ and $D_{0}$ amounts to 5.727256e-25.

It can be easily seen that $D$ is (nearly) identical with $D_{0}$ , though the embeddings $X$ and $Y$ differ. The $k$ -means algorithm as implemented in $R$ (kmeans, centers=2,nstart=100) was run both for the embedding $X$ and $Y$ yielding the clustering [ 2, 2, 2, 1, 1, 1, 1].

A version of kernel- $k$ -means, as described in this paper, was also implemented and produced for the kernel matrix $F$ the very same clustering. [2, 2, 2, 1, 1, 1, 1].

Note that same distance matrix can be turned to a kernel matrix using different $\mathbf{s}$ verctors. We applied to it the transformation from equation (8) using the vector

[TABLE]

and obtained the (kernel) matrix

[TABLE]

After eigen-decomposition of $F^{\prime}$ , we get via equation (6) the embedding matrix (after ignoring columns with next to zero eigenvalues)

[TABLE]

which produces the distance matrix

[TABLE]

The sum of squared differences between the corresponding entries in the distance matrices $D^{\prime}$ and $D_{0}$ amounts to 2.090166e-25.

Not surprisingly, a version of kernel- $k$ -means, as described in this paper, was also implemented and produced for the kernel matrix $F^{\prime}$ the very same clustering. [2, 2, 2, 1, 1, 1, 1].

7 $k$ -means under non-Euclidean kernels

In many cases, like Laplacians of graphs, we know in advance that they can be deemed as kernels embedded into Euclidean space, so that there are no obstacles to apply kernel- $k$ -means clustering. However, this does not always need to be the case. Let us discuss now the concerns for applying kernel- $k$ -means in such situations and about the validity of the obtained clusters.

Let $w_{1},\dots,w_{m}$ be non-negative weights of data points $1,\dots,m$ . Let $C$ be such a subset of $\{1,\dots,m\}$ that $\sum_{i\in C}w_{i}\neq 0$ . Define $\boldsymbol{\mu}_{\mathbf{w}}^{\Phi}(C)$ as a weighted center of the datapoints of $C$ as follows:

[TABLE]

It is easily seen that it is possible to compute the squared distance of any data point to a weighted center of a set.

[TABLE]

Let us now pay some attention to the consequence of the fact that one may be tempted to apply the kernel- $k$ -means algorithm under missing Euclidean embedding.

The kernel- $k$ -means algorithm consists in switching to a multidimensional feature space $\mathcal{F}$ and it is clamed to search therein for prototypes $\boldsymbol{\mu}_{j}^{\Phi}$ minimizing the error

[TABLE]

over all possible choices of the set of cluster centers $\boldsymbol{\mu}_{j}^{\Phi}$ , $j=1,\dots,k$ .

But this is actually not the entire truth. $\boldsymbol{\mu}_{j}^{\Phi}$ may only be equal to

[TABLE]

for some subset $C_{j}$ of all the data points and no other vectors in the feature space are taken into account. If the feature space is Euclidean, it is guaranteed that no other vector from the feature space shall ever be considered as cluster center, because the clustering will not be optimal. It is not so in case of non-Euclidean feature spaces. To demonstrate this, we will use an example.

Consider the following non-Euclidean distance matrix

[TABLE]

and the corresponding kernel matrix

[TABLE]

If we apply kernel- $k$ -means clustering with $k=2$ , this implies a clustering [ 2, 2, 1, 2, 2, 1] with the total value of the cost function 1325 . Other clusterings would not be better. Check e.g. that the clustering [1,1,1,2,2,2] produces the cost function amounting to 1400 which is higher than what kernel- $k$ -means produces.

But consider now a different clustering, [1,1,1,2,2,2], where you choose weighted cluster centers with weights [10,1,1,10,1,1], instead of the $k$ -means cluster centers. Then the cost function will amount to 1175 which is below what kernel- $k$ -means produces.

In this way we have proven that

Theorem 4

kernel- $k$ -means does not optimize the cost function

[TABLE]

for non-Euclidean kernel matrices.

We have already mentioned the Gower‘s et al. [6] Theorem 7, stating that any dissimilarity matrix $D$ may be turned to an Euclidean distance matrix, by adding constant $\sigma$ to the squared distances as follows: $d^{\prime}(\mathfrak{z},\mathfrak{y})=\sqrt{d(\mathfrak{z},\mathfrak{y})^{2}+\sigma}$ where $\sigma$ is a constant such that $\sigma\geq-\lambda_{m}$ , $\lambda_{m}$ being the smallest eigenvalue of $(\mathbf{I}-\mathbf{1}\mathbf{1}^{T}/m)(-\frac{1}{2}D_{sq})(\mathbf{I}-\mathbf{1}\mathbf{1}^{T}/m)$ , $D_{sq}$ is the matrix of squared values of elements of $D$ , $m$ is the number of rows/columns in $D$ .

Gower‘s Theorem 7 is actually wrong. Let us continue the above example. Gowerr‘s constant for ${}_{E}F$ amounts to $\sigma$ = 757.205 . Upon modifying the distance matrix we get the new kernel matrix

[TABLE]

which is again non Euclidean, because its lowest eigenvalue is equal -378.603

Let us now propose a correction of Gower‘s ”euclidesation” theorem:

Theorem 5

Any dissimilarity matrix $D$ may be turned to an Euclidean distance matrix, see their Theorem 7, by adding an appropriate constant (to non-diagonal elements) , e.g. $d^{\prime}(\mathfrak{z},\mathfrak{y})=\sqrt{d(\mathfrak{z},\mathfrak{y})^{2}+2\sigma}$ where $\sigma$ is a constant such that $\sigma\geq-\lambda_{m}$ , $\lambda_{m}$ being the smallest eigenvalue of $(\mathbf{I}-\mathbf{1}\mathbf{1}^{T}/m)(-\frac{1}{2}D_{sq})(\mathbf{I}-\mathbf{1}\mathbf{1}^{T}/m)$ , $D_{sq}$ is the matrix of squared values of elements of $D$ , $m$ is the number of rows/columns in $D$ .

Proof 1

The equation (24) allows us to conclude that given

[TABLE]

for a dissimilarity matrix $D$ , the following holds:

[TABLE]

Let $\mathbf{v}$ be an eigenvector of $F$ for a non-zero eigenvalue $\lambda$ . Therefore

[TABLE]

Assuming that $\mathbf{v^{\prime}}=(\mathbf{I}-\frac{\mathbf{1}\mathbf{1}^{T}}{m})\mathbf{v}$ , we get:

[TABLE]

which means that $\mathbf{v^{\prime}}$ is also an eigenvector of $F$ for the same eigenvalue. Notably, The sum of components of $\mathbf{v^{\prime}}$ is equal zero.

Consider now the following expression for some number $\sigma$ .

[TABLE]

Now consider an eigenvector $\mathbf{v^{\prime}}$ of $F^{\prime}$ for a non-zero eigenvalue $\lambda^{\prime}$ , such that the sum of its components equals zero. For each $\lambda$ such a vector always exists. We see immediately that

[TABLE]

that is that $(\lambda^{\prime}-\sigma)$ is an eigenvalue of $F$ with eigenvector $\mathbf{v^{\prime}}$ .

This means that by subtracting $\sigma$ from non-diagonal elements of $-\frac{1}{2}D_{sq}$ in the computation of $F$ we can increase its eigenvalues of eigenvectors with zero sum by $\sigma$ . But subtracting $\sigma$ from non-diagonal elements of $-\frac{1}{2}D_{sq}$ means adding $\sigma$ to non-diagonal elements of $\frac{1}{2}D_{sq}$ , or adding $2\sigma$ to non-diagonal elements of $D_{sq}$ , or just replacing non-diagonal elements $d_{ij}$ of $D$ with $\sqrt{d_{ij}^{2}+\sigma}$ . If we add at least the negation of the lowest eigenvalue of non-Euclidean $F$ to all its eigenvalues, then of course it turns to an Euclidean one, given that all eigenvectors with non-zero eigenvalues have zero sums of components.

How can we now tell if all such eigenvectors have zero sums? In case that all eigenvalues are different, this is simple. As shown, each eigenvalue has the zero sum eigenvector, and this is the only one up to scaling factor.

The details of handling special cases (of identical eigenvalues) follow now. Consider the set of all eigenvectors related to a multiple eigenvalue. The whole set can be represented as a linear combination of some number of orthogonal vectors from this set with the number equal to the multiplicity of the eigenvalue. Let $\mathbf{v}$ be one of these orthogonal vectors. Then any linear combination of all the other orthogonal vectors is orthogonal to $\mathbf{v}$ . Let $\mathbf{v"}$ be an example from this combination. Then clearly $\mathbf{v"}^{T}\mathbf{v}=0$ . But also $\mathbf{v"}^{T}(F\mathbf{v})=\lambda\mathbf{v"}^{T}\mathbf{v}=0$ . Hence $\mathbf{v"}^{T}(F(\mathbf{I}-\frac{\mathbf{1}\mathbf{1}^{T}}{m})\mathbf{v})=\mathbf{v"}^{T}\lambda\mathbf{v^{\prime}}=0$ . So $\mathbf{v^{\prime}}=(\mathbf{I}-\frac{\mathbf{1}\mathbf{1}^{T}}{m})\mathbf{v}$ is orthogonal to $\mathbf{v"}$ . As the latter represents any vector orthogonal to $\mathbf{v}$ of the subspace co-spanned by $\mathbf{v}$ , so $\mathbf{v^{\prime}}$ must be identical to $\mathbf{v}$ up to scaling factor. So the subspace of eigenvectors can be spanned by a set of orthogonal vectors with component sums equal zero. Therefore all the eigenvectors of $F$ have this property and hence adding the respective constant adds to all the eigenvalues of the matrix $F$ . This completes the proof.

Let us illustrate the Theorem refthKlvopotekEuclidesation by continuing the previous example. The euclidesation of the kernel ${}_{nE}F$ , according to Theorem 5, will lead to the following kernel matrix:

Upon modifying the distance matrix according to our Theorem we get the new kernel matrix

[TABLE]

which is now Euclidean, because its lowest eigenvalue is equal 0 The kernel matrix ${}_{E}F$ implies a clustering [ 1, 2, 1, 1, 2, 1] with the total value of the cost function 4353.821 . Other clusterings would not do better. Check e.g. that the clustering [1,1,1,2,2,2] produces the cost function amounting to 4428.821 which is higher than what kernel- $k$ -means produces.

Consider now a different clustering, [1,1,1,2,2,2], where you choose weighted cluster centers with weights [10,1,1,10,1,1], instead of the $k$ -means cluster centers. Then the cost function will amount to 5907.533 which is again higher than what kernel- $k$ -means produces. In Euclidean space, kernel- $k$ -means produces appropriate results.

Note that the clustering obtained is identical with the clustering delivered by kernel- $k$ -means from the original kernel matrix ${}_{nE}F$ .

Let us investigate this phenomenon more generally.

Theorem 6

If we pursue the kernel- $k$ -means clustering when seeking the optimum among cluster center sets being a subset of the set of $\boldsymbol{\mu}_{j}^{\Phi}$ that may only be equal to

[TABLE]

for some subset $C_{j}$ of all the data points and no other vectors in the feature space are taken into account, then after adding a constant $\sigma$ to the distance matrix as follows: $d^{\prime}(\mathfrak{z},\mathfrak{y})=\sqrt{d(\mathfrak{z},\mathfrak{y})^{2}+2\sigma}$ then the optimal clustering will remain the same.

Proof 2

If we add in a cluster $C_{j}$ of cardinality $m_{j}$ for an element $i$ to all its distances $\sigma$ , then its squared distance to the cluster center will increase by $\sigma\frac{m_{j}-1}{m_{j}}$ because $d(i,i)=0$ is unchanged . So in all the cluster cost function will change by $\sigma\frac{m_{j}-1}{m_{j}}m_{j}=\sigma\cdot(m_{j}-1)$ . So the overall cost function of all $k$ clusters will increase by $\sigma\cdot(m-k)$ . That is it is independent of the actual cost function. Hence the optimum clustering of $k$ -means, achievable by kernel- $k$ -means, will remain unchanged after this addition.

Under these circumstances

Theorem 7

For kernel- $k$ -means, adding a constant to squared dissimilarity measures of non-identical elements is a clustering preserving and embeddability preserving operation.

Note that the transformation mentioned above (1) increases all distances, (2) the absolute increase in distances is the largest for the smallest distances, and the smallest for the largest, (3) therefore no new clustering structures occur under this transformation. We define in this way a new axiom/property of $k$ -means - in that we require that clustering algorithm yields same result under the mentioned distance change/transformation.

The idea behind is that in the permissible domain for $k$ -means (Euclidean) the optimum is unchanged if we add constant to squared distances between different elements. By means of conceptual extension we can carry on this assumption backwards into non-Euclidean distances.

Then we need to define under what regime we compute the permissible optimum of $k$ -means, because in the whole space itself it is no true. Only if we limit the permissible space in a reasonable way, we can still assume that we are computing $k$ -means optimum. So if we agree that the kernel function $\Phi()$ for kernel $k$ -means is deemed to transmit the data points into the Euclidean space under the mentioned invariance transformation, then it is permissible to apply kernel- $k$ -means without checking for embeddability.

8 Concluding remarks

In this paper we corrected the proof of the Theorem 2 from the Gower‘s paper [4, page 5]. This correction was needed in order to establish the existence of the kernel function used commonly in the kernel trick e.g. for $k$ -means clustering algorithm, on the grounds of distance matrix.

Let us underline here that we did not impose any apriorical restrictions on the form of $\Phi()$ function itself. It may be a linear or non-linear mapping from the sample space to the feature space. But what we insist on is that the feature space has to be Euclidean. This is the requirement for applicability of (kernel) $k$ -means clustering algorithm. If the feature space is not metric, the results of (kernel) $k$ -means clustering are questionable.

But this is not enough. The same kernel matrix may be related to infinitely many $\Phi()$ functions.

The question that was left open by Gower was: do there exist special cases where two different $\Phi()$ functions, complying with a given kernel matrix, generate different distance matrices in the feature space, maybe in some special, ”sublimated” cases? The answer given to this open question in this paper is definitely NO. We closed all the conceivable gaps in this respect. So usage of (linear and non-linear) kernel matrices that are semipositive definite, is safe in this respect.

Furthermore we resolved the issue of applicability of kernel- $k$ -means for non-embeddable kernel matrices. If we accept the eigen-value-shift transformation as a legitimate kernel matrix transformation and the kernel- $k$ -means clustering in the kernel matrix obtained via such euclidesation as the valid clustering for the original kernel matrix, then we can apply kernel- $k$ -means also in the non-Euclidean space.

Software

Please feel free to experiment with an R package (source code) implementing kernel $k$ -means functionality install.packages(”https://home.ipipan.waw.pl/m.klopotek/ipi˙archiv/kernelKmeansAndPlusPlusDemo˙1.0.tar.gz”,repos=NULL,type=”source”)

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] R. Balaji 1 and R.B. Bapat. On euclidean distance matrices. Linear Algebra and its Applications , 424(1):108––117, 2007.
2[2] Radha Chitta, Rong Jin, Timothy C. Havens, and Anil K. Jain. Approximate kernel k-means: Solution to large scale kernel clustering. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD ‘11, pages 895–903, New York, NY, USA, 2011. ACM.
3[3] I.S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: Spectral clustering and normalized cuts. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD ‘04, pages 551–556, New York, NY, USA, 2004. ACM.
4[4] J. C. Gower. Euclidean distance geometry. Math. Scientist , 7:1–14, 1982.
5[5] J. C. Gower. Properties of Euclidean and non-Euclidean distance matrices. Linear Algebra and its Applications , 67:81–97, 1985.
6[6] J.C. Gower and P. Legendre. Metric and Euclidean properties of dissimilarity coefficients. Journal of classification , 3(1):5–48, 1986. Here Gower:1982 is cited in theorem 4, but with a different form of condditions for D and s.
7[7] T. Handhayania and L. Hiryantob. Intelligent kernel k-means for clustering gene expression. In International Conference on Computer Science and Computational Intelligence (ICCSCI 2015) Procedia Computer Science , volume 59, pages 171–177, 2015.
8[8] Mieczyslaw A. Klopotek. On the existence of kernel function for kernel-trick of k-means. In Foundations of Intelligent Systems - 23rd International Symposium, ISMIS 2017, Warsaw, Poland, June 26-29, 2017, Proceedings , pages 97–104, 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Validity of Clusters Produced By kernel-kkk-means With Kernel-Trick

Abstract

1 The Problem

2 The Background

3 Kernel-kkk-means

4 Gower formulation of distance-to-kernel-matrix transformation

Theorem 1

Theorem 2

5 Correrction of Gower‘s result

Theorem 3

6 A numerical example

7 kkk-means under non-Euclidean kernels

Theorem 4

Theorem 5

Proof 1

Theorem 6

Proof 2

Theorem 7

8 Concluding remarks

Software

Validity of Clusters Produced By kernel- $k$ -means With Kernel-Trick

3 Kernel- $k$ -means

7 $k$ -means under non-Euclidean kernels