Spherical Wards clustering and generalized Voronoi diagrams

Marek \'Smieja; Jacek Tabor

arXiv:1705.02232·cs.LG·May 8, 2017

Spherical Wards clustering and generalized Voronoi diagrams

Marek \'Smieja, Jacek Tabor

PDF

TL;DR

This paper introduces a novel spherical Gaussian-based clustering method that operates in non-Euclidean spaces, automatically determines the number of clusters, and uses generalized Voronoi diagrams for visualization.

Contribution

It combines spherical Cross-Entropy Clustering with a generalized Wards approach to handle arbitrary dissimilarity measures in non-Euclidean spaces.

Findings

01

Automatically finds the optimal number of clusters.

02

Supports scale-invariant, spherical clusters of arbitrary sizes.

03

Uses generalized Voronoi diagrams for visualization.

Abstract

Gaussian mixture model is very useful in many practical problems. Nevertheless, it cannot be directly generalized to non Euclidean spaces. To overcome this problem we present a spherical Gaussian-based clustering approach for partitioning data sets with respect to arbitrary dissimilarity measure. The proposed method is a combination of spherical Cross-Entropy Clustering with a generalized Wards approach. The algorithm finds the optimal number of clusters by automatically removing groups which carry no information. Moreover, it is scale invariant and allows for forming of spherically-shaped clusters of arbitrary sizes. In order to graphically represent and interpret the results the notion of Voronoi diagram was generalized to non Euclidean spaces and applied for introduced clustering method.

Tables1

Table 1. TABLE I: UCI evaluation. Comparison of clustering results (measured by Rand index) of UCI data sets between sWards , Wards k-means and Spectral Clustering for Euclidean and RBF dissimilarities. The estimated numbers of clusters (Est. Cl.) by sWards were used for other algorithms. The MLE was applied for setting the parameter N 𝑁 N .

			Euclidean dissimilarity				RBF dissimilarity
Data	True Cl.	$N$	Est. cl.	sWards	k-means	Specc	Est. cl.	sWards	k-means	Specc
Cmc	3	2.64	4	0.61	0.57	0.58	2	0.55	0.51	0.51
Ecoli	8	3.72	9	0.88	0.83	0.77	10	0.84	0.79	0.8
Glass	7	3.07	8	0.71	0.7	0.68	7	0.7	0.71	0.71
Hayes-r.	3	1.85	5	0.62	0.58	0.59	5	0.61	0.5	0.6
Ionosph.	2	5.03	4	0.55	0.52	0.61	4	0.57	0.61	0.58
Iris	3	2.49	4	0.85	0.81	0.83	5	0.85	0.84	0.83
Tae	3	2.06	6	0.61	0.61	0.6	5	0.62	0.58	0.6
Wine	3	1.64	4	0.75	0.63	0.68	5	0.58	0.55	0.55
Yeast	10	4.81	11	0.64	0.73	0.73	10	0.63	0.73	0.73

Equations68

y \in Y \sum d^{2} (y, m_{Y}) = \frac{1}{2∣ Y ∣} y, z \in Y \sum d^{2} (y, z) .

y \in Y \sum d^{2} (y, m_{Y}) = \frac{1}{2∣ Y ∣} y, z \in Y \sum d^{2} (y, z) .

d^{2} (x; Y) := \frac{1}{∣ Y ∣} y \in Y \sum d^{2} (x, y) - \frac{1}{∣ Y ∣} ss (Y) .

d^{2} (x; Y) := \frac{1}{∣ Y ∣} y \in Y \sum d^{2} (x, y) - \frac{1}{∣ Y ∣} ss (Y) .

ss (Y) = y \in Y \sum ∥ y - m_{Y} ∥^{2},

ss (Y) = y \in Y \sum ∥ y - m_{Y} ∥^{2},

j = 1 \sum k ss (Y_{j})

j = 1 \sum k ss (Y_{j})

\begin{array}[]{c}\sum\limits_{y\in Y}\!\|y-\mathrm{m}_{Y}\|^{2}\,=\frac{1}{2|Y|}\sum\limits_{y\in Y}\sum\limits_{z\in Y}\!\|y-z\|^{2}\,,\end{array}

\begin{array}[]{c}\sum\limits_{y\in Y}\!\|y-\mathrm{m}_{Y}\|^{2}\,=\frac{1}{2|Y|}\sum\limits_{y\in Y}\sum\limits_{z\in Y}\!\|y-z\|^{2}\,,\end{array}

\begin{array}[]{c}D\langle Y,Z\rangle:=\sum\limits_{y\in Y}\sum\limits_{z\in Z}\!d^{2}(y,z)\,.\end{array}

\begin{array}[]{c}D\langle Y,Z\rangle:=\sum\limits_{y\in Y}\sum\limits_{z\in Z}\!d^{2}(y,z)\,.\end{array}

\begin{array}[]{c}\mathrm{ss}(Y):=\frac{1}{2|Y|}D\langle Y,Y\rangle=\frac{1}{2|Y|}\sum\limits_{y\in Y}\sum\limits_{z\in Y}\!d^{2}(y,z)\,.\end{array}

\begin{array}[]{c}\mathrm{ss}(Y):=\frac{1}{2|Y|}D\langle Y,Y\rangle=\frac{1}{2|Y|}\sum\limits_{y\in Y}\sum\limits_{z\in Y}\!d^{2}(y,z)\,.\end{array}

E_{Wards} (Y_{1}, \dots, Y_{k}) := i = 1 \sum k ss (Y_{i}),

E_{Wards} (Y_{1}, \dots, Y_{k}) := i = 1 \sum k ss (Y_{i}),

\begin{array}[]{l}\frac{N}{2}\ln(\frac{2\pi e}{N})+\sum\limits_{i=1}^{k}\frac{|Y_{i}|}{|X|}\cdot\left[-\ln\frac{|Y_{i}|}{|X|}+\frac{N}{2}\ln\left(\frac{|X|}{|Y_{i}|}\mathrm{tr}(\Sigma_{Y_{i}})\right)\right],\end{array}

\begin{array}[]{l}\frac{N}{2}\ln(\frac{2\pi e}{N})+\sum\limits_{i=1}^{k}\frac{|Y_{i}|}{|X|}\cdot\left[-\ln\frac{|Y_{i}|}{|X|}+\frac{N}{2}\ln\left(\frac{|X|}{|Y_{i}|}\mathrm{tr}(\Sigma_{Y_{i}})\right)\right],\end{array}

tr (Σ_{Y}) = ss (Y) .

tr (Σ_{Y}) = ss (Y) .

\hat{N}_{k} (x) = \frac{1}{k - 1} j = 1 \sum k - 1 lo g \frac{d ( x , x _{k} )}{d ( x , x _{j} )},

\hat{N}_{k} (x) = \frac{1}{k - 1} j = 1 \sum k - 1 lo g \frac{d ( x , x _{k} )}{d ( x , x _{j} )},

\begin{array}[]{l}\mathrm{E}_{\mathrm{sWards}}(Y_{1},\ldots,Y_{k};N):=\\ \frac{N}{2}\ln(\frac{2\pi e}{N})+\sum\limits_{i=1}^{k}\frac{|Y_{i}|}{|X|}\cdot\left[\frac{N}{2}\ln(\mathrm{ss}(Y_{i}))-\frac{N+2}{2}\ln\left(\frac{|Y_{i}|}{|X|}\right)\right],\end{array}

\begin{array}[]{l}\mathrm{E}_{\mathrm{sWards}}(Y_{1},\ldots,Y_{k};N):=\\ \frac{N}{2}\ln(\frac{2\pi e}{N})+\sum\limits_{i=1}^{k}\frac{|Y_{i}|}{|X|}\cdot\left[\frac{N}{2}\ln(\mathrm{ss}(Y_{i}))-\frac{N+2}{2}\ln\left(\frac{|Y_{i}|}{|X|}\right)\right],\end{array}

ss (Y \cup {x}) \mbox an d ss (Y ∖ {x}) .

ss (Y \cup {x}) \mbox an d ss (Y ∖ {x}) .

ss (Y \cup {x}) = \frac{∣ Y ∣}{∣ Y ∣ + 1} ss (Y) + \frac{1}{∣ Y ∣ + 1} D ⟨{x}, Y ⟩ .

ss (Y \cup {x}) = \frac{∣ Y ∣}{∣ Y ∣ + 1} ss (Y) + \frac{1}{∣ Y ∣ + 1} D ⟨{x}, Y ⟩ .

ss (Y ∖ {x}) = \frac{∣ Y ∣}{∣ Y ∣ - 1} ss (Y) - \frac{1}{∣ Y ∣ - 1} D ⟨{x}, Y ⟩ .

ss (Y ∖ {x}) = \frac{∣ Y ∣}{∣ Y ∣ - 1} ss (Y) - \frac{1}{∣ Y ∣ - 1} D ⟨{x}, Y ⟩ .

∥ x - m_{Y} ∥^{2} = \frac{1}{∣ Y ∣} y \in Y \sum ∥ x - y ∥^{2} - \frac{1}{2∣ Y ∣ ^{2}} y \in Y \sum z \in Y \sum ∥ y - z ∥^{2} .

∥ x - m_{Y} ∥^{2} = \frac{1}{∣ Y ∣} y \in Y \sum ∥ x - y ∥^{2} - \frac{1}{2∣ Y ∣ ^{2}} y \in Y \sum z \in Y \sum ∥ y - z ∥^{2} .

d^{2} (x; Y) := \frac{1}{∣ Y ∣} (D ⟨{x}, Y ⟩ - ss (Y)) .

d^{2} (x; Y) := \frac{1}{∣ Y ∣} (D ⟨{x}, Y ⟩ - ss (Y)) .

w:X\ni x\to\left\{\begin{array}[]{ll}w(x)\in[0,+\infty)&,x\in Y,\\ 0&,x\in X\setminus Y,\end{array}\right.

w:X\ni x\to\left\{\begin{array}[]{ll}w(x)\in[0,+\infty)&,x\in Y,\\ 0&,x\in X\setminus Y,\end{array}\right.

Y^{w} = {(y, w (y)) : y \in Y} .

Y^{w} = {(y, w (y)) : y \in Y} .

E_{Wards} (Y_{1}^{w}, \dots, Y_{k}^{w}) = i = 1 \sum k ss (Y_{i}^{w}),

E_{Wards} (Y_{1}^{w}, \dots, Y_{k}^{w}) = i = 1 \sum k ss (Y_{i}^{w}),

\begin{array}[]{l}E^{i}_{x,[Y^{w}_{1},\ldots,Y^{w}_{k}]}:h\to\\[5.16663pt] E(Y^{w}_{1},\ldots,Y^{w}_{i-1},(Y_{i}\cup\{x\})^{w+h\delta_{x}},Y^{w}_{i+1},\ldots,Y^{w}_{k}),\end{array}

\begin{array}[]{l}E^{i}_{x,[Y^{w}_{1},\ldots,Y^{w}_{k}]}:h\to\\[5.16663pt] E(Y^{w}_{1},\ldots,Y^{w}_{i-1},(Y_{i}\cup\{x\})^{w+h\delta_{x}},Y^{w}_{i+1},\ldots,Y^{w}_{k}),\end{array}

\partial_{i} E (x, [Y_{1}^{w}, \dots, Y_{k}^{w}]) := (E_{x, [Y_{1}^{w}, \dots, Y_{k}^{w}]}^{i})^{'} (0),

\partial_{i} E (x, [Y_{1}^{w}, \dots, Y_{k}^{w}]) := (E_{x, [Y_{1}^{w}, \dots, Y_{k}^{w}]}^{i})^{'} (0),

\partial_{i} E (x, [Y_{1}^{w}, \dots, Y_{k}^{w}]) = d^{2} (x; Y_{i}),

\partial_{i} E (x, [Y_{1}^{w}, \dots, Y_{k}^{w}]) = d^{2} (x; Y_{i}),

\begin{array}[]{l}\frac{1}{h}[E(Y^{w}_{1},\ldots,Y^{w}_{i-1},Y^{w}_{i}\cup\{(x,h)\},Y^{w}_{i+1},\ldots,Y^{w}_{k})\\[5.16663pt] \,\,\,\,\,\,\,-E(Y^{w}_{1},\ldots,Y^{w}_{k})]\\[5.16663pt] =\frac{1}{h}\left[\mathrm{ss}((Y_{i}\cup\{x\})^{w+h\delta_{x}})-\mathrm{ss}(Y^{w}_{i})\right]\\[5.16663pt] =\frac{1}{h}\left[\frac{|Y_{i}^{w}|\mathrm{ss}(Y_{i}^{w})+D\langle\{(x,h)\},Y_{i}^{w}\rangle}{|Y_{i}^{w}|+h}-\mathrm{ss}(Y_{i}^{w})\right]\\[5.16663pt] =\frac{1}{h}\frac{|Y_{i}^{w}|\mathrm{ss}(Y_{i}^{w})+hD\langle\{(x,1)\},Y_{i}^{w}\rangle-(|Y_{i}^{w}|+h)\mathrm{ss}(Y_{i}^{w})}{|Y_{i}^{w}|+h}\\[5.16663pt] =\frac{D\langle(x,1),Y_{i}^{w}\rangle-\mathrm{ss}(Y_{i}^{w})}{|Y_{i}^{w}|+h}.\end{array}

\begin{array}[]{l}\frac{1}{h}[E(Y^{w}_{1},\ldots,Y^{w}_{i-1},Y^{w}_{i}\cup\{(x,h)\},Y^{w}_{i+1},\ldots,Y^{w}_{k})\\[5.16663pt] \,\,\,\,\,\,\,-E(Y^{w}_{1},\ldots,Y^{w}_{k})]\\[5.16663pt] =\frac{1}{h}\left[\mathrm{ss}((Y_{i}\cup\{x\})^{w+h\delta_{x}})-\mathrm{ss}(Y^{w}_{i})\right]\\[5.16663pt] =\frac{1}{h}\left[\frac{|Y_{i}^{w}|\mathrm{ss}(Y_{i}^{w})+D\langle\{(x,h)\},Y_{i}^{w}\rangle}{|Y_{i}^{w}|+h}-\mathrm{ss}(Y_{i}^{w})\right]\\[5.16663pt] =\frac{1}{h}\frac{|Y_{i}^{w}|\mathrm{ss}(Y_{i}^{w})+hD\langle\{(x,1)\},Y_{i}^{w}\rangle-(|Y_{i}^{w}|+h)\mathrm{ss}(Y_{i}^{w})}{|Y_{i}^{w}|+h}\\[5.16663pt] =\frac{D\langle(x,1),Y_{i}^{w}\rangle-\mathrm{ss}(Y_{i}^{w})}{|Y_{i}^{w}|+h}.\end{array}

\begin{array}[]{l}\frac{D\langle(x,1),Y_{i}^{w}\rangle-\mathrm{ss}(Y_{i}^{w})}{|Y_{i}^{w}|+h}=\frac{D\langle x,Y_{i}\rangle-\mathrm{ss}(Y_{i})}{|Y_{i}^{w}|+h}\to\\[5.16663pt] \frac{1}{|Y_{i}|}(D\langle x,Y_{i}\rangle-\mathrm{ss}(Y_{i}))\text{ , as }h\to 0,\end{array}

\begin{array}[]{l}\frac{D\langle(x,1),Y_{i}^{w}\rangle-\mathrm{ss}(Y_{i}^{w})}{|Y_{i}^{w}|+h}=\frac{D\langle x,Y_{i}\rangle-\mathrm{ss}(Y_{i})}{|Y_{i}^{w}|+h}\to\\[5.16663pt] \frac{1}{|Y_{i}|}(D\langle x,Y_{i}\rangle-\mathrm{ss}(Y_{i}))\text{ , as }h\to 0,\end{array}

\begin{array}[]{l}\partial_{i}E(x,[Y^{w}_{1},\ldots,Y^{w}_{k}])\\[5.16663pt] =\frac{1}{|X|}\left[\frac{N}{2}\left(\ln(\mathrm{ss}(Y_{i}))+|Y_{i}|\frac{d^{2}(x;Y_{i})}{\mathrm{ss}(Y_{i})}\right)-\frac{N+2}{2}(\ln|Y_{i}|+1)\right].\end{array}

\begin{array}[]{l}\partial_{i}E(x,[Y^{w}_{1},\ldots,Y^{w}_{k}])\\[5.16663pt] =\frac{1}{|X|}\left[\frac{N}{2}\left(\ln(\mathrm{ss}(Y_{i}))+|Y_{i}|\frac{d^{2}(x;Y_{i})}{\mathrm{ss}(Y_{i})}\right)-\frac{N+2}{2}(\ln|Y_{i}|+1)\right].\end{array}

E_{Y_{i}} (x) = ln (ss (Y_{i})) + ∣ Y_{i} ∣ \frac{d ^{2} ( x ; Y _{i} )}{ss ( Y _{i} )} - (1 + \frac{2}{N}) ln ∣ Y_{i} ∣.

E_{Y_{i}} (x) = ln (ss (Y_{i})) + ∣ Y_{i} ∣ \frac{d ^{2} ( x ; Y _{i} )}{ss ( Y _{i} )} - (1 + \frac{2}{N}) ln ∣ Y_{i} ∣.

\frac{1}{2} G_{1} (r) + \frac{1}{2} G_{2} (1 - r)

\frac{1}{2} G_{1} (r) + \frac{1}{2} G_{2} (1 - r)

C_{1} = (r 0 0 r), C_{2} = (1 - r 0 0 1 - r), for r \in (0, 1),

C_{1} = (r 0 0 r), C_{2} = (1 - r 0 0 1 - r), for r \in (0, 1),

m_{1} = (- 1, 0), m_{2} = (1, 0) .

m_{1} = (- 1, 0), m_{2} = (1, 0) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Spherical Wards clustering and generalized Voronoi diagrams

Marek Śmieja

Faculty of Mathematics and Computer Science

Jagiellonian University

Lojasiewicza 6, 30-348 Krakow, Poland

Email: [email protected]

Jacek Tabor

Faculty of Mathematics and Computer Science

Jagiellonian University

Lojasiewicza 6, 30-348 Krakow, Poland

Email: [email protected]

Abstract

Gaussian mixture model is very useful in many practical problems. Nevertheless, it cannot be directly generalized to non Euclidean spaces. To overcome this problem we present a spherical Gaussian-based clustering approach for partitioning data sets with respect to arbitrary dissimilarity measure. The proposed method is a combination of spherical Cross-Entropy Clustering with a generalized Wards approach. The algorithm finds the optimal number of clusters by automatically removing groups which carry no information. Moreover, it is scale invariant and allows for forming of spherically-shaped clusters of arbitrary sizes. In order to graphically represent and interpret the results the notion of Voronoi diagram was generalized to non Euclidean spaces and applied for introduced clustering method.

I Introduction

Distribution-based clustering, such as Gaussian mixture model (GMM), has been proven to be very useful in many practical problems [1]. This technique has been widely applied in object detection [2], learning and modeling [3], feature selection [4] or classification [5]. The constructed groups are described by optimally fitted probability distributions. Nevertheless, this kind of methods is limited for the case of Euclidean spaces and the clustering of data with respect to Gaussian-like probability distributions in arbitrary data spaces where only distance or (dis)similarity measure is provided still remains a challenge.

In this paper we show how to partially overcome this problem and propose a spherical Wards clustering (sWards) which divides data sets with respect to arbitrary dissimilarity measure into groups described by spherical Gaussian-like distributions. Figure 1 shows the relationship between sWards and related methods. Moreover, we extend the notion of Voronoi diagram to the case of arbitrary criterion function in non Euclidean spaces and apply it for sWards clustering.

Introduced method permits an informal interpretation of the notion of spherical Gaussian probability distribution in non Euclidean spaces. The algorithm is capable of discovering spherically-shaped groups of arbitrary sizes (see Example V.2). Moreover the clustering results are invariant with respect to the scaling of data (see Example V.1). In fact, data sets with unbalanced groups appear very often in practice, e.g in chemoinformatics where finding of chemical compounds acting on specific disease is rare [6, 7] or in Natural Language Processing where the numbers of documents that belong to particular domains are different [8]. Our method can be successfully applied in discovering of populations districts in biological systems modeled by a random walk procedure (see Examples V.5, V.6). The method is easy to implement and has the same numerical complexity as the k-means version adapted to non Euclidean spaces [9]. Moreover, our algorithm automatically finds the resultant number of groups by reducing unnecessary clusters on-line. Voronoi diagrams for sWards, k-means and their kernelized versions for a mouse-like set with non Euclidean distance function are presented in Figure 2.

Proposed sWards method is a combination of spherical variant of Cross-Entropy Clustering (CEC) [10] with the generalized Wards approach [9, 11]. Generally, spherical CEC describes clusters by optimally fitted spherical Gaussian distributions while Wards method allows for its adaptation to non Euclidean case. Spherical CEC performs a clustering by optimizing a cross-entropy criterion function (4). Its form is very flexible since it is based on the within clusters sums of squares, the cardinalities of clusters and the dimension of space.

Applied Wards approach allows for a generalization of the notion of within cluster sum of squares for the case of any dissimilarity measure [9, 11]. The key lies in the observation that this quantity can be rewritten in Euclidean space without the use of a mean $\mathrm{m}_{Y}$ of a cluster $Y$ in the form:

[TABLE]

On the other hand, note that a dimension in arbitrary space does not have to be defined. Therefore, to adapt spherical CEC criterion function to general case we recommend to estimate its value from data with use of Maximum Likelihood Estimator of intrinsic dimension [12, 13].

To graphically represent and interpret the results of clustering the notion of Voronoi diagram is widely applied. Its construction requires the answer for the question: to which cluster we should associate an arbitrary unclustered point? In the case of classical k-means the answer is simple: we assign the point to the cluster with the nearest center. In the Wards method we replace it by a generalization of distance of point $x$ from the center of cluster $Y$ given by [9]

[TABLE]

In our work we calculate the analogue of above formula (1) for the case of sWards criterion function (5) (see (8) for precise formula and Figure 2 for sample effects).

The practical properties of proposed method are illustrated and examined on synthetic data sets and examples retrieved form the UCI repository [14]. We compare sWards with similar methods which can be applied for non Euclidean data as k-means, Spectral Clustering and their kernelized versions. Our tests demonstrate that introduced method can be applied for populations detection in simple biological systems.

The paper is organized as follows. Next section gives a brief description of related clustering methods. In section 3 we recall Wards approach to k-means and present its application for spherical CEC criterion function. Section 4 demonstrates the generalization of Voronoi diagrams to the case of arbitrary criterion functions in non Euclidean data paying particular attention on sWards method. The results of experiments and potential applications are given in section 5 while section 6 contains the conclusion.

II Related works

The hierarchical clustering is probably one of the most popular methods to partition data based on any kind of (dis)similarity measure [15]. The well-known k-means algorithm [16] can also be adapted to non Euclidean data by defining a medoid [17] which plays a role of a generalized notion of mean or by using the Wards method [9, 11] which reformulates the within cluster sum of squares without the notion of the cluster mean. Despite the wide use of these methods, they are sometimes unable to discover groups with complex structures and different sizes. A lot of modifications were also considered to describe clusters with arbitrary shapes [18, 19]. Spectral Clustering uses eigenvectors of similarity matrix to divide elements into groups [20].

Another issue of clustering non Euclidean data sets is the appropriate selection of dissimilarity measure. Examples showed that interesting effects can be obtained by applying Gaussian radial basis function (RBF) [21]. The difficulty is that there is no unified methodology how to choose the radius of this function for particular situation [22, 23].

In order to perform a distribution-based clustering a GMM is widely used in Euclidean space [1]. Nevertheless it cannot be directly generalized to arbitrary data sets with dissimilarity measures. On the other hand, a family of density based clustering such as DBSCAN [24] can be applied for non Euclidean data. Although the method is capable of discovering clusters of arbitrary shapes and does not require the specification of the number of groups, it does not adopt well to clusters with large differences in densities.

Proposed sWards method joins the simplicity and flexibility of k-means with the effects of GMM. Its can be applied in non Euclidean spaces and is based on Gaussian-like probability distributions.

III Clustering method

The proposed sWards clustering is a combination of spherical Cross-Entropy Clustering (sCEC) [10] with a generalized Wards approach [9, 11]. In this section we first introduce a basic notation and recall the Wards version of k-means. Then, we show how sCEC can be generalized to non Euclidean data sets via Wards method.

III-A Wards method

Generally, k-means method aims at producing a splitting of data set which optimizes a squared error criterion function. For a group $Y\subset\mathbb{R}^{N}$ the within cluster sum of squares is defined as:

[TABLE]

where $\mathrm{m}_{Y}$ is a mean of $Y$ . The k-means looks for a partition of $X\subset\mathbb{R}^{N}$ into $k$ pairwise disjoint sets $Y_{1},\ldots,Y_{k}$ such that the function

[TABLE]

is minimal.

Note that the above formulas cannot be used directly for non vector data since the mean is not well-defined for general data sets. There are several alternatives [25, 26, 27, 28, 29] which allow to partially overcome this difficulty as k-medoids [30] or k-clustering [31]. The technique related to k-clustering and k-means is the generalized Wards method [9, 11] which plays the basic role in our investigations. The key idea is the observation that the within cluster sum of squares in Euclidean space can be formulated equivalently without the notion of the center of cluster:

Proposition III.1

[11]** If $Y\subset\mathbb{R}^{N}$ , then

[TABLE]

where $|Y|$ is a cardinality of $Y$ .

This allows to reasonably generalize the within cluster sum of squares to general non Euclidean data set. For this purpose let $X$ be an arbitrary data set and let $d~{}:~{}X~{}\times~{}X~{}\to~{}[~{}0~{},~{}+~{}\infty~{})$ be a symmetric dissimilarity measure on $X$ , i.e,

•

$d(y,y)=0$ ,

•

$d(y,z)=d(z,y)$ ,

for $y,z\in X$ . Given two subsets $Y,Z$ of $X$ we define a function [31] connected with the average linkage function (also called average neighbor function) [32, 27] as:

[TABLE]

As a generalized within cluster $Y\subset X$ sum of squares we put [9]:

[TABLE]

Then, the goal of Wards method is formulated as follows:

Wards Optimization Problem [9]. Let $X$ be a data set with a dissimilarity measure $d$ and let $k\in\mathbb{N}$ . Find a splitting of $X$ into $k$ pairwise disjoint sets $Y_{1},\ldots,Y_{k}$ which minimizes the generalized squared error function:

[TABLE]

where $\mathrm{ss}(\cdot)$ is defined by (2).

III-B Spherical Wards criterion function

The Cross-Entropy Clustering (CEC) is a kind of distribution-based clustering which divides an Euclidean data set into groups such that each group is described by optimally fitted Gaussian probability distribution [10]. The effects of the clustering are similar to those obtained by GMM, but the optimizing criterion function is different. Its value determines the statistical code length of memorization of an arbitrary element of a data set in the case when each cluster uses its own coding algorithm. In particular, the introducing of one more cluster (coding algorithm) requires an additional cost of its identification (increase of the entropy). In consequence, the maintaining of too many clusters is not optimal and it allows for the automatic reduction of unnecessary groups. Another advantage of CEC is that the clustering is performed in a comparable time to computationally efficient k-means method. For more details the reader is referred to [10, 33, 34].

Spherical Cross-Entropy Clustering (sCEC) is a variant of CEC which takes into account the family of spherical Gaussian distributions. Since for every group the optimal spherical Gaussian distribution is matched, then data set is partitioned into spherically-shaped clusters. For a splitting $Y_{1},\ldots,Y_{k}$ of $X$ the associated criterion function is defined by [10]

[TABLE]

where $\Sigma_{Y}$ is a covariance matrix of group $Y$ and $\mathrm{tr}(\Sigma_{Y})$ is a trace of $\Sigma_{Y}$ .

Let us first observe that the notion of covariance matrix can be easily removed from the expression (4).

Proposition III.2

If $Y\subset\mathbb{R}^{N}$ then [10]:

[TABLE]

In consequence the application of Wards approach (2) facilitates its interpretation in non Euclidean case for a fixed $N>0$ .

For fully explanation of the formula (4) in the context of non Euclidean space, the value of dimension $N$ has to be specified. As the most reasonable way to set this value we recommend to use the estimation of a dimension of $X$ . In the present study we apply the Maximum Likelihood Estimation (MLE) of intrinsic dimension of $X$ proposed in [12] and modified in [13]. More precisely, given $X=\{x_{1},\ldots,x_{n}\}$ the maximum likelihood estimator of a dimension $N$ of $X$ calculated for each $x\in X$ equals [12]:

[TABLE]

for $k\in\{1,\ldots,n\}$ . Since the above value is dependent on the choice of $k$ and $x$ , then one should average the results over $x\in X$ and $\tilde{K}\subset\{1,\ldots,n\}$ to obtain the final estimator of $N$ [13].

Nevertheless, one can tune this value in the learning process as well as may set it to any positive number. In the experimental section we show that for high values of $N$ more clusters are created in the clustering while for low values of $N$ the method prefers to reduce a number of groups. From now on, $N$ will be treated as a free parameter selected by the user, but we keep in mind that the easiest way to tune this value is to use the MLE procedure described above.

All in all, the generalized Wards approach and the appropriate choice of the dimension parameter $N$ allow for the understanding of spherical cross-entropy criterion function in arbitrary data set with a dissimilarity measure. In consequence, the informal notion of spherical Gaussian probability distribution based on any dissimilarity measure could be considered. We conclude this subsection with a formulation of spherical Wards (sWards) optimization problem:

**Spherical Wards Optimization Problem. ** Let $X$ be a data set with a dissimilarity measure $d$ , $n\in\mathbb{N}$ be an initial number of clusters and $N>0$ be a free parameter. Find $k\leq n$ and a partition $Y_{1},\ldots,Y_{k}$ of $X$ which minimizes spherical Wards criterion function

[TABLE]

where $\mathrm{ss}(\cdot)$ is defined by (2).

III-C Clustering algorithm

One can show that the natural modification of the Hartigan algorithm [9, 16, 10] can be used to minimize the sWards criterion function (5). We will now discuss its technical aspects.

The procedure can be divided into two parts: initialization and iteration. In the initialization phase $n\in\mathbb{N}$ groups are created randomly. During iteration the algorithm reassigns elements between clusters in order to minimize the sWards criterion function (5).

More precisely, in the iteration part we repeatedly go over all elements of $X$ applying the following steps:

Reassign $x\in X$ to this cluster for which the decrease of energy (5) is maximal, 2. 2.

If a probability of some cluster is less than a fixed number $\varepsilon>0$ , then remove this cluster and assign its elements to these groups for which the increase of energy (5) is minimal,

until no group membership has been changed.

The number $\varepsilon$ was introduced to speed up the reduction of redundant clusters. In our experiments we always use the value $\varepsilon=1\%$ . Thus, the group is removed if it contains less than $1\%$ of all elements of $X$ . Clearly, the procedure is not deterministic and leads to a local minimum of (5) [25]. Therefore, to provide the satisfactory results the algorithm should be evaluated several times – the final result is that which gives the minimal value of sWards criterion function.

The above algorithm can be seen as an online version of standard partitional clustering procedure which is able to reduce unnecessary groups. Every time the element is processed the clusters parameters are recalculated. This implies that to efficiently apply this procedure we have to recompute

[TABLE]

For this purpose the following formulas are useful:

Proposition III.3

[11]** Let $Y\subset X$ and $x\in X$ .

a) If $x\not\in Y$ , then

[TABLE]

b) If $x\in Y$ , then

[TABLE]

Given $k$ clusters, the computational complexity of one iteration of standard Hartigan procedure requires about $k\cdot N\cdot|X|$ operations (for data sets contained in $\mathbb{R}^{N}$ ). When applying the Wards approach this complexity changes to $k\cdot|X|^{2}$ operations. Since the mean of cluster is not defined in general situation, one has to pay an additional cost of recalculating the within cluster sum of squares during every reassigning. However, we do not need to recalculate the distance between the reassigning elements and the mean of a cluster which decreases the computational cost $N$ times.

IV Generalized Voronoi diagram

There arises a natural problem how to graphically present the clustering results. Clearly, we can mark the elements of each cluster with different label. However, in practice it is usually more clear to show the division of the whole space. In this section we show that we can naturally obtain an equivalence of the Voronoi diagram for any criterion function in non Euclidean space. In particular we apply these results to define the Voronoi diagram for sWards.

IV-A Classical diagram

Let us recall that in the case of classical version of Voronoi diagram ( $k$ -means method) the point $x$ is associated with this cluster whose center is the closest to $x$ . More precisely, it is classified to this cluster $Y_{i}$ which minimizes $d(x;\mathrm{m}_{i})$ , where $m_{i}$ is a mean of $Y_{i}$ . We would like to mention that one can consider the alternative to the Voronoi diagrams as described in [35]. It provides the partition of data but does not induce a natural partition of the space (see [35] for more details).

To generalize the notion of the Voronoi diagram to non Euclidean space (Wards k-means), we need to be able to compute the distance of a point from the center of the cluster (without using it in the computations).

Proposition IV.1

[11]** Let $x\in\mathbb{R}^{N}$ be fixed and $Y\subset\mathbb{R}^{N}$ be a subset of $\mathbb{R}^{N}$ with mean $\mathrm{m}_{Y}$ . Then

[TABLE]

The above allows the formulation of the analogue of the square of the “classical” distance of a point $x$ from the center of $Y$ . Let $Y$ be a subset of data space $X$ with a dissimilarity measure $d$ and let $x\in X$ be fixed. We define the mean square distance of $x$ from $Y$ by

[TABLE]

Applying the above formula one can draw the equivalence of the Voronoi diagram for Wards k-means, i.e. an element $x\in X$ is classified to this cluster which minimizes (6).

IV-B Diagram for arbitrary criterion function

We are now going to present a reasoning which allows to create a kind of Voronoi diagram for arbitrary criterion function. This will be useful for constructing a division of the space for the case of sWards method. Obtained results are consistent with the classical Voronoi diagram in the case of Wards $k$ -means presented in previous section.

Let $X$ be a space with a dissimilarity measure $d$ and let $Y\subset X$ represent our data. We extend $X$ by introducing a weight function

[TABLE]

which assigns a weight to every element of $X$ . Then we consider an extended data set

[TABLE]

We define the operations $D\langle\cdot,\cdot\rangle$ and $\mathrm{ss}(\cdot)$ adapted for $Y^{w}$ . Given $Z,Y_{1},Y_{2}\subset Y$ we put:

$|Z^{w}|:=\sum\limits_{z\in Z}w(z)$ , 2. 2.

$D\langle Y_{1}^{w},Y_{2}^{w}\rangle:=\sum\limits_{y_{1}\in Y_{1}}\sum\limits_{y_{2}\in Y_{2}}d^{2}(y_{1},y_{2})w(y_{1})w(y_{2})$ , 3. 3.

$\mathrm{ss}(Z^{w}):=\frac{1}{2|Z^{w}|}D\langle Z^{w},Z^{w}\rangle$ .

Then the analogue of k-means criterion function equals:

[TABLE]

where $Y_{1},\ldots,Y_{k}$ is a splitting of $Y$ . If $w_{|Y}\equiv 1$ then (7) coincides with (3).

In order to explain our technique assume that $Y_{1},\ldots,Y_{k}$ is a splitting of data set $Y$ and $E$ is an arbitrary criterion function. For a fixed point $x\in X$ we consider a mapping

[TABLE]

where $h\geq 0$ and $i\in\{1,\ldots,k\}$ . It determines the value of criterion function $E$ when $x\in X$ is associated with $i$ -th cluster with a weight increased by $h$ .

We define the functions (wherever they exist)

[TABLE]

for $i\in\{1,\ldots,k\}$ . Observe that $\partial_{i}E$ coincides with the infinitesimal change in energy when we add $x$ to the $i$ -th cluster. Thus, in Voronoi diagram the point $x\in X$ should be assigned to this cluster which minimizes $\partial_{i}E(x,[Y^{w}_{1},\ldots,Y^{w}_{k}])$ .

Let us show that the above reasoning is consistent with the classical results (6) for Wards $k$ -means criterion function (7):

Theorem IV.1

Let $Y$ be a subset of a space $X$ with a dissimilarity measure $d$ and let $w(y)=1$ , for all $y\in Y$ , be a weight function. If $E$ denotes the squared error function (7) and $Y_{1},\ldots,Y_{k}$ is a fixed splitting of $Y$ then

[TABLE]

for $x\in X$ and $i\in\{1,\ldots,k\}$ .

Proof:

Let $h>0$ . By Corollary III.3, we have

[TABLE]

Since $w_{|Y}\equiv 1$ then

[TABLE]

which yields the assertion of the theorem. ∎

IV-C Voronoi diagram for sWards

The following theorem presents how to create the Voronoi diagram for sWards criterion function:

Theorem IV.2

Let $Y$ be a subset of a space $X$ with a dissimilarity measure $d$ and let $w(y)=1$ , for all $y\in Y$ , be a weight function. If $E$ denotes the sWards criterion function for a data set with weights and $Y_{1},\ldots,Y_{k}$ is a fixed splitting of $Y$ then

[TABLE]

Proof:

Roughly speaking, Theorem IV.1 says that $\partial_{i}\mathrm{ss}(Y^{w}_{i})=d^{2}(x;Y_{i})$ . Moreover, $\partial_{i}|Y^{w}_{i}|=1$ . Applying the operator $\partial_{i}$ and the above to (5) we easily get the assertion of the theorem. ∎

Consequently, given a partition $Y_{1},\ldots,Y_{k}$ of $Y$ , to associate a point $x\in X$ to a cluster it is sufficient to find $i\in\{1,\ldots,k\}$ which minimizes

[TABLE]

If $X$ is infinite, then one can apply its quantization into a finite number of regions before applying a Voronoi diagram. The reader is referred to Figure 3 for more detailed explanation of the above described procedure.

V Experiments

In this section we discuss some fundamental properties as well as the potential applications of proposed clustering method and present a short evaluation study. The implementation of sWards is available from http://www.ii.uj.edu.pl/~smieja/sWards-app.zip111Contact the first author for the explanations..

V-A Synthetic data sets

In order to show the capabilities of sWards we examined its resistance on the change of scale and its sensitivity on the unbalanced data. We compared the clustering results with the ones obtained with use of related methods which can be applied for non Euclidean spaces: Wards k-means and Spectral Clustering (kernlab R package was used for the implementations of this algorithm [36]). Since sWards automatically detects the resultant number of groups, then we ran it with 10 initial clusters while the other methods used the number of groups returned by sWards222Such a technique for a detection of clusters number was chosen in order to provide the correspondence between clustering results for all methods.. The value of parameter $N$ (dimension of space) for sWards was set automatically with use of MLE method [12, 13]. To provide more stable results, each algorithm was run 10 times and the result with the lowest value of criterion function was chosen.

Example V.1

Scale invariance**

In the first experiment we examined the invariance of algorithms on the change of scale. A data set was generated from the mixture of two spherical Gaussian distributions,

[TABLE]

with different covariance matrices

[TABLE]

centered at

[TABLE]

The parameter $r$ controls the width of Gaussians.

The Figure 4 presents the ratios of resulted clusters sizes. The sWards method is robust to the change of scale – the clusters remained almost equally-sized for all $r\in(0,1)$ . The clustering result was the most dependent on the widths of Gaussians in the case of k-means.

Example V.2

Unbalanced data**

We have also tested how the number of elements generated from the individual distributions affects the clustering results. For this purpose data was generated from the mixture of two Gaussians

[TABLE]

with identical covariance matrices

[TABLE]

but different centers

[TABLE]

The number of elements generated from each Gaussian is determined by the value of parameter $\omega$ .

The ratios of clusters sizes are shown in the Figure 5. One can observe that the proportions specified by $\omega$ was preserved by sWards method. In the Spectral Clustering the results are less stable. On the other hand Wards k-means has a tendency to build equally-sized clusters.

V-B Dimension estimation

To apply the sWards criterion function in the case of arbitrary non Euclidean space the value of dimension parameter $N$ needs to be specified. In the previous subsection we showed that the reasonable clustering results can be obtained calculating this value using MLE method [12, 13]. We will experimentally show how the clustering effects differ when the value of $N$ changes.

Example V.3

Clusters number detection**

Let us first examine the impact of the value of parameter $N$ on the detection of the resultant number of groups. For this purpose a mouse-like set (see Figure 2) was clustered with different values of $N$ starting from 100 initial groups. The resultant number of groups are illustrated in the Figure 6.

The immediate observation is that the increase of the value of $N$ results in the increase of the detected number of groups. One can observe that for $N<1$ the entire data set was recognized as one group. For $N\in(1,2)$ the mouse-like set was partitioned into three groups which seems to be the most appropriate partitioning. For $N>2$ the number of groups began to grow rapidly.

Example V.4

Shape of criterion function**

To get more insight into the influence of dimension parameter on the discovered number of clusters, we analyzed the shape of sWards criterion function for different values of $N$ . Since the sWards automatically reduces unnecessary clusters, it is not possible to directly specify the number of groups. Therefore, a mouse-like data set (see Figure 2) was first partitioned into expected number of groups with use of k-means. Then, the sWards criterion function was calculated for each partition.

It is clear from Figure 8 that the criterion function yields a global minimum for 3 clusters when $N=2$ . Therefore, in most cases the algorithm ends with 3 groups. For $N=1$ the cost of maintaining clusters increases and the algorithm generally includes all elements into one group. The function is decreasing for $N=3$ . It means that the method rarely reduces clusters. The last case can be a very useful variant of sWards when the resulting number of groups should not be discovered by the algorithm but specified directly by the user.

V-C Applications

In this section we show that the proposed method is very useful in the analysis of biological models of populations. It is assumed that a population follows a random walk model $P(x,n,t)$ on a plane [37], where at each unit of time an instance moves randomly in one of four directions: left, right, up or down. More precisely, given a starting point (seed) $x\in X$ , $n$ -instances are generated from a random walk model assuming $t$ time units. It is worth to mention that a probability distribution of a population converges to spherical Gaussian one [37]. Given a data set consisting of $k$ populations we would like to discover them during a clustering process. Constructed Voronoi diagram determines the corresponding populations districts in the whole space.

Let us observe that, in practice the environment does not represent an Euclidean space. Indeed, a plane is usually crossed by rivers and barriers. Moreover, the environment can be divided into various regions, e.g. meadows, seas, forests etc., which changes the speed of movement of individuals. These modifications change the classical Euclidean metric – the distance between elements has to take into account all the aforementioned circumstances. In the experiments we analyze two cases of populations environments.

Example V.5

Environment with barriers.* Let us consider three populations living in the environment showed in Figure 7 crossed by two barriers which modify Euclidean distance function. Basically, the distance between elements located on the opposite sides of barrier is calculated as a shortest path which does not cross the barrier.*

Regions occupied by populations can be obtained with use of Voronoi diagram. Is is clear from Figure 7(a) that Wards k-means discovered populations districts as horizontal stripes which is not an appropriate model. More accurate partition results from sWards (see Figure 7(b)), where detected regions form circular shapes. Partitions agreement measured by Rand index [38] for Wards k-means equals $96\%$ while for sWards is $98\%$ .

Example V.6

Environment with regions.* In the second example let us assume that a data space $X$ is divided into two regions $X_{1}$ and $X_{5}$ . In $X_{5}$ the individuals moves 5 times faster than in $X_{1}$ . This inducts a dissimilarity measure on $X$ by:*

[TABLE]

where $d_{E}(\cdot,\cdot)$ denotes the Euclidean distance. We consider two populations showed in Figure 9 with starting points marked with white dots in $X_{1}$ and $X_{5}$ respectively.

One can observe in Figure 9(b) that despite the form of the above dissimilarity measure, sWards detected the circular-like districts of populations very well. This result can be compared with k-means clustering (see Figure 9(a)) where a produced partition does not match populations distributions. The value of Rand index for sWards equals $92\%$ while for Wards k-means is $61\%$ .

V-D Evaluation

After establishing the properties as well as demonstrating basic capabilities and potential applications of introduced method we present a short evaluation. We carried out the experiments on selected UCI data sets [14]. In all experiments the initial number of clusters for sWards was fixed two times higher than the actual number of groups. In order to provide the correspondence between the clustering results the other examined methods assumed the number of groups returned by sWards as the input clusters number.

As a measure of agreements between partitions the Rand index (RI) was used [38]. It is defined as a ratio between pairs of true positives and false negatives, and all pairs of examples. The values close to $1$ indicate that two partitions are very similar. MLE was used to calculate the optimal value of parameter $N$ . Two kinds of dissimilarity measures were considered: the Euclidean distance and the dissimilarity determined by the Gaussian radial basis function (RBF). The value of sigma for RBF was estimated as a median of the squared distances between all pairs of data set elements [39].

The results presented in Table I show that sWards reasonably well determined the final number of groups. The advantage of our method over k-means and Spectral Clustering is the most evident for the case of Ecoli data set and Euclidean distance. The worst results were obtained for Ionosphere data set. The use of RBF similarity rarely improved the accuracy of clustering. It could be caused by the fact that it is very difficult to set the optimal value for RBF sigma parameter in particular situation.

To extend the above evaluation, in the Figure 10 we present the clustering accuracies of UCI data sets for a wide range of dimension parameter $N\in(0,10)$ . One can observe that in most cases the best results were obtained when $N$ was estimated as a dimension of data. The exceptions are Glass and Yeast data sets where a slight improvement was achieved for higher values of $N$ .

VI Conclusion

In this paper a generalization of spherical Cross-Entropy Clustering to non Euclidean spaces was presented. The proposed method uses a Wards approach to modify the cross-entropy criterion function for the case of arbitrary data sets. In consequence, obtained method allows for partitioning of non vector data into spherically-shaped clusters of arbitrary sizes. It is scale invariant technique which detects the final number of groups automatically. Our method works in comparable time to generalized Wards method while the clustering effects are similar to those produced by GMM when focusing on spherical Gaussian distributions in Euclidean spaces.

Moreover, we generalized the notion of Voronoi diagram for the case of arbitrary criterion function based on Wards approach. This leads to identical results in the case of classical methods as k-means while it allows for formal division of data space when focusing on non Euclidean methods as sWards.

Acknowledgment

This work was partially funded by the Polish Ministry of Science and Higher Education from the budget for science in the years 2013–2015, Grant no. IP2012 055972 and by the National Science Centre (Poland), Grant No. 2014/13/B/ST6/01792.

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] G. Mc Lachlan and T. Krishnan, The EM algorithm and extensions . John Wiley & Sons, 2007, vol. 382.
2[2] M. A. T. Figueiredo and A. K. Jain, “Unsupervised learning of finite mixture models,” Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol. 24, no. 3, pp. 381–396, 2002.
3[3] J. Samuelsson, “Waveform quantization of speech using Gaussian mixture models,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 , vol. 1. IEEE, 2004, pp. I–165.
4[4] E. P. Xing, M. I. Jordan, R. M. Karp et al. , “Feature selection for high-dimensional genomic microarray data,” in Proceedings of the 18th International Conference on Machine Learning , vol. 1. Citeseer, 2001, pp. 601–608.
5[5] R. J. Povinelli, M. T. Johnson, A. C. Lindgren, and J. Ye, “Time series classification using Gaussian mixture models of reconstructed phase spaces,” Knowledge and Data Engineering, IEEE Transactions on , vol. 16, no. 6, pp. 779–783, 2004.
6[6] J. Gasteiger, Handbook of chemoinformatics . Wiley Online Library, 2003, vol. 1.
7[7] S. L. Dixon and H. O. Villar, “Investigation of classification methods for the prediction of activity in diverse chemical libraries,” Journal of Computer-Aided Molecular Design , vol. 13, no. 5, pp. 533–545, 1999.
8[8] O. Zamir and O. Etzioni, “Web document clustering: A feasibility demonstration,” in Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval . ACM, 1998, pp. 46–54.