An Aposteriorical Clusterability Criterion for $k$-Means++ and   Simplicity of Clustering

Mieczys{\l}aw A. K{\l}opotek

arXiv:1704.07139·cs.LG·April 7, 2020

An Aposteriorical Clusterability Criterion for $k$-Means++ and Simplicity of Clustering

Mieczys{\l}aw A. K{\l}opotek

PDF

TL;DR

This paper introduces a new a posteriori criterion for assessing the clusterability of data sets in $k$-means clustering, enabling efficient validation of clustering quality after algorithm execution.

Contribution

It proposes a novel clusterability check that is computationally feasible and does not require identifying the optimal clustering, unlike previous methods.

Findings

01

The criterion can be applied after running $k$-means to verify clusterability.

02

If $k$-means++ fails to find a well-clusterable clustering, the data is likely not well-clusterable.

03

The check has polynomial complexity, making it practical for real-world data sets.

Abstract

We define the notion of a well-clusterable data set combining the point of view of the objective of $k$ -means clustering algorithm (minimising the centric spread of data elements) and common sense (clusters shall be separated by gaps). We identify conditions under which the optimum of $k$ -means objective coincides with a clustering under which the data is separated by predefined gaps. We investigate two cases: when the whole clusters are separated by some gap and when only the cores of the clusters meet some separation condition. We overcome a major obstacle in using clusterability criteria due to the fact that known approaches to clusterability checking had the disadvantage that they are related to the optimal clustering which is NP hard to identify. Compared to other approaches to clusterability, the novelty consists in the possibility of an a posteriori (after running…

Figures10

Click any figure to enlarge with its caption.

Tables14

Table 1. Table 1: Dependence of the number of errors on M 𝑀 M to m 𝑚 m proportion under fixed gap g / r m a x 𝑔 subscript 𝑟 𝑚 𝑎 𝑥 g/r_{max} = 2. Other parameters fixed at k 𝑘 k =2 d 𝑑 d =2 b r e a k g a p 𝑏 𝑟 𝑒 𝑎 𝑘 𝑔 𝑎 𝑝 breakgap =1. .

$M, m =$	15 , 9	480 , 9	720 , 9	960 , 9
Hartigan-Wong 1x	0 $\pm$ 0	961.3 $\pm$ 8.8	974.3 $\pm$ 3.2	980.9 $\pm$ 4.9
1000 execs in s / relQ	0.32 / 1	0.57 / 1.07	0.67 / 1	0.79 / 1
Lloyd 1x	1.7 $\pm$ 3.6	962.5 $\pm$ 5.6	974.1 $\pm$ 4.9	980.4 $\pm$ 4.5
1000 execs in s / relQ	0.31 / 1.01	0.63 / 1.08	0.71 / 1.01	0.79 / 1.01
Forgy 1x	0.8 $\pm$ 1.8	963.3 $\pm$ 6.5	975.3 $\pm$ 3.7	980.9 $\pm$ 4.6
1000 execs in s / relQ	0.3 / 1.01	0.63 / 1.08	0.71 / 1.01	0.79 / 1.01
MacQueen 1x	1.2 $\pm$ 3.2	961.8 $\pm$ 4.6	973.1 $\pm$ 5.2	982.9 $\pm$ 5.9
1000 execs in s / relQ	0.31 / 1.01	0.54 / 1.07	0.64 / 1.01	0.72 / 1
Hartigan-Wong 10x	0 $\pm$ 0	680.9 $\pm$ 25.2	1000 $\pm$ 0	1000 $\pm$ 0
1000 execs in s / relQ	0.81 / 1	7.23 / 1.05	10.24 / 1	13.42 / 1
Lloyd 10x	0 $\pm$ 0	678.6 $\pm$ 10.1	1000 $\pm$ 0	1000 $\pm$ 0
1000 execs in s / relQ	0.72 / 1	7.92 / 1.05	10.73 / 1	13.6 / 1
Forgy 10x	0 $\pm$ 0	666.7 $\pm$ 14.2	1000 $\pm$ 0	1000 $\pm$ 0
1000 execs in s / relQ	0.72 / 1	7.94 / 1.05	10.73 / 1	13.52 / 1
MacQueen 10x	0 $\pm$ 0	677.2 $\pm$ 12	1000 $\pm$ 0	1000 $\pm$ 0
1000 execs in s / relQ	0.73 / 1	7.15 / 1.05	9.99 / 1	12.92 / 1
Hartigan-Wong 20x	0 $\pm$ 0	457.5 $\pm$ 14	1000 $\pm$ 0	1000 $\pm$ 0
1000 execs in s / relQ	1.08 / 1	8.75 / 1.03	12.63 / 1	16.59 / 1
Lloyd 20x	0 $\pm$ 0	451.6 $\pm$ 9.5	1000 $\pm$ 0	1000 $\pm$ 0
1000 execs in s / relQ	0.9 / 1	9.96 / 1.03	13.58 / 1	16.79 / 1
Forgy 20x	0 $\pm$ 0	457.8 $\pm$ 12.4	1000 $\pm$ 0	1000 $\pm$ 0
1000 execs in s / relQ	0.9 / 1	9.95 / 1.03	13.44 / 1	16.78 / 1
MacQueen 20x	0 $\pm$ 0	452.6 $\pm$ 14.7	1000 $\pm$ 0	1000 $\pm$ 0
1000 execs in s / relQ	0.89 / 1	8.54 / 1.03	12.11 / 1	15.69 / 1
cmeans Fuzzy	0 $\pm$ 0	973.7 $\pm$ 7.9	995.4 $\pm$ 6.6	1000 $\pm$ 0
1000 execs in s / relQ	0.67 / 1	5.37 / 1.09	7.37 / 1.02	9.43 / 1.02
ufcl Fuzzy	235.5 $\pm$ 59.6	964.4 $\pm$ 8.9	976.8 $\pm$ 6.5	981.3 $\pm$ 5.4
1000 execs in s / relQ	0.8 / 1.52	6.3 / 1.14	8.79 / 1.06	11.22 / 1.05
single link	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	0.2 / 1	13.76 / 1	34.39 / 1.06	49.01 / 1.15
kmeans++	0 $\pm$ 0	734.7 $\pm$ 17	811 $\pm$ 11.5	844 $\pm$ 12.5
1000 execs in s / relQ	1.14 / 1	16.01 / 1.05	22.53 / 1.01	28.77 / 1.02
kmeans++ 2x	0 $\pm$ 0	544.7 $\pm$ 28.2	959.3 $\pm$ 6.1	976.9 $\pm$ 4.6
1000 execs in s / relQ	2.26 / 1	31.96 / 1.04	44.38 / 1	57.49 / 1

Table 2. Table 2: Reconstruction of the clusters in synthetic data. Notation in text.

Dataset	Errors		WC disc.		WC not disc.
	$k$ -means	$k$ -means++	cc	wc	cc	wc
$k = 2, 𝔭 = 0, g p = 0.5$	0	0	0	0	100	0
$k = 2, 𝔭 = 0.1, g p = 0.5$	0	0	0	0	100	0
$k = 3, 𝔭 = 0, g p = 0.5$	38	0	0	0	100	0
$k = 3, 𝔭 = 0.1, g p = 0.5$	26	0	0	0	100	0
$k = 5, 𝔭 = 0, g p = 0.5$	74	0	0	0	100	0
$k = 5, 𝔭 = 0.1, g p = 0.5$	66	0	0	0	100	0
$k = 2, 𝔭 = 0, g p = 0.7$	0	0	0	0	100	0
$k = 2, 𝔭 = 0.1, g p = 0.7$	0	0	0	0	100	0
$k = 3, 𝔭 = 0, g p = 0.7$	50	2	0	0	98	2
$k = 3, 𝔭 = 0.1, g p = 0.7$	37	0	0	0	100	0
$k = 5, 𝔭 = 0, g p = 0.7$	78	0	0	0	100	0
$k = 5, 𝔭 = 0.1, g p = 0.7$	73	0	0	0	100	0
$k = 2, 𝔭 = 0, g p = 0.9$	0	0	0	0	100	0
$k = 2, 𝔭 = 0.1, g p = 0.9$	0	0	0	0	100	0
$k = 3, 𝔭 = 0, g p = 0.9$	41	1	0	0	99	1
$k = 3, 𝔭 = 0.1, g p = 0.9$	29	1	0	0	99	1
$k = 5, 𝔭 = 0, g p = 0.9$	74	0	0	0	100	0
$k = 5, 𝔭 = 0.1, g p = 0.9$	78	0	0	0	100	0
$k = 2, 𝔭 = 0, g p = 1.1$	0	0	100	0	0	0
$k = 2, 𝔭 = 0.1, g p = 1.1$	0	0	100	0	0	0
$k = 3, 𝔭 = 0, g p = 1.1$	41	0	100	0	0	0
$k = 3, 𝔭 = 0.1, g p = 1.1$	39	0	100	0	0	0
$k = 5, 𝔭 = 0, g p = 1.1$	80	0	100	0	0	0
$k = 5, 𝔭 = 0.1, g p = 1.1$	80	0	100	0	0	0
$k = 2, 𝔭 = 0, g p = 1.3$	0	0	100	0	0	0
$k = 2, 𝔭 = 0.1, g p = 1.3$	0	0	100	0	0	0
$k = 3, 𝔭 = 0, g p = 1.3$	43	0	100	0	0	0
$k = 3, 𝔭 = 0.1, g p = 1.3$	38	0	100	0	0	0
$k = 5, 𝔭 = 0, g p = 1.3$	80	0	100	0	0	0
$k = 5, 𝔭 = 0.1, g p = 1.3$	75	0	100	0	0	0
$k = 2, 𝔭 = 0, g p = 1.5$	0	0	100	0	0	0
$k = 2, 𝔭 = 0.1, g p = 1.5$	0	0	100	0	0	0
$k = 3, 𝔭 = 0, g p = 1.5$	33	1	99	0	0	1
$k = 3, 𝔭 = 0.1, g p = 1.5$	33	0	100	0	0	0
$k = 5, 𝔭 = 0, g p = 1.5$	75	0	100	0	0	0
$k = 5, 𝔭 = 0.1, g p = 1.5$	79	0	100	0	0	0

Table 3. Table 3: Reconstruction of the clusters in real data from R 𝑅 R library d a t a s e t s 𝑑 𝑎 𝑡 𝑎 𝑠 𝑒 𝑡 𝑠 datasets . Notation in text.

Dataset	gap type	clusters	$𝔭$	Errors		WC disc.		WC not disc.
				$k$ -means	$k$ -means++	cc	wc	cc	wc
DNase	orig	2	0	69	39	0	0	61	39
DNase	orig	2	0.1	59	38	0	0	62	38
DNase	g/2	2	0	0	0	0	0	100	0
DNase	g/2	2	0.1	0	0	0	0	100	0
DNase	g	2	0	0	0	100	0	0	0
DNase	g	2	0.1	0	0	100	0	0	0
DNase	2g	2	0	0	0	100	0	0	0
DNase	2g	2	0.1	0	0	100	0	0	0
DNase	orig	3	0	81	25	0	0	75	25
DNase	orig	3	0.1	80	28	0	0	72	28
DNase	g/2	3	0	63	1	0	0	99	1
DNase	g/2	3	0.1	59	1	0	0	99	1
DNase	g	3	0	58	0	100	0	0	0
DNase	g	3	0.1	51	0	100	0	0	0
DNase	2g	3	0	50	0	100	0	0	0
DNase	2g	3	0.1	49	0	100	0	0	0
DNase	orig	5	0	75	35	0	0	65	35
DNase	orig	5	0.1	79	30	0	0	70	30
DNase	g/2	5	0	84	0	0	0	100	0
DNase	g/2	5	0.1	86	0	0	0	100	0
DNase	g	5	0	80	0	100	0	0	0
DNase	g	5	0.1	79	0	100	0	0	0
DNase	2g	5	0	83	0	100	0	0	0
DNase	2g	5	0.1	81	0	100	0	0	0
iris	orig	2	0	0	0	0	0	100	0
iris	orig	2	0.1	0	0	0	0	100	0
iris	g/2	2	0	0	0	0	0	100	0
iris	g/2	2	0.1	0	0	0	0	100	0
iris	g	2	0	0	0	100	0	0	0
iris	g	2	0.1	0	0	100	0	0	0
iris	2g	2	0	0	0	100	0	0	0
iris	2g	2	0.1	0	0	100	0	0	0
iris	orig	3	0	19	17	0	0	83	17
iris	orig	3	0.1	23	15	0	0	85	15
iris	g/2	3	0	35	0	0	0	100	0

$k =$	2	4	8	16
Hartigan-Wong 1x	0 $\pm$ 0	583 $\pm$ 158.5	938.3 $\pm$ 18.7	999.4 $\pm$ 0.8
1000 execs in s / relQ	0.24 / 1	0.3 / 701.98	0.44 / 54638.17	0.68 / 3276669.6
Lloyd 1x	0 $\pm$ 0	655.8 $\pm$ 78.2	952.7 $\pm$ 9.4	999.7 $\pm$ 0.5
1000 execs in s / relQ	0.23 / 1	0.29 / 1056.44	0.43 / 71083.66	0.66 / 4201145
Forgy 1x	0 $\pm$ 0	645 $\pm$ 75.3	950.1 $\pm$ 8.1	999.9 $\pm$ 0.3
1000 execs in s / relQ	0.23 / 1	0.28 / 1112.96	0.43 / 71431.91	0.66 / 4175279.99
MacQueen 1x	0 $\pm$ 0	642.5 $\pm$ 64.2	949.9 $\pm$ 14.7	999.6 $\pm$ 0.7
1000 execs in s / relQ	0.23 / 1	0.28 / 1109.83	0.43 / 73197.95	0.66 / 4131048.66
Hartigan-Wong 10x	0 $\pm$ 0	13 $\pm$ 10.4	540.9 $\pm$ 107.6	991.5 $\pm$ 10.7
1000 execs in s / relQ	0.58 / 1	0.61 / 7.26	0.99 / 4419.42	1.38 / 816542.71
Lloyd 10x	0 $\pm$ 0	18.8 $\pm$ 16.9	621.4 $\pm$ 55.9	995.2 $\pm$ 4.6
1000 execs in s / relQ	0.51 / 1	0.55 / 9.83	0.92 / 6123.88	1.27 / 1063785.95
Forgy 10x	0 $\pm$ 0	20.8 $\pm$ 15.4	613.9 $\pm$ 58.5	996.2 $\pm$ 2.1
1000 execs in s / relQ	0.51 / 1	0.55 / 11.67	0.92 / 5976.84	1.27 / 1059280.68
MacQueen 10x	0 $\pm$ 0	16.2 $\pm$ 11.9	616.8 $\pm$ 68.3	996.1 $\pm$ 4.4
1000 execs in s / relQ	0.51 / 1	0.53 / 9.28	0.87 / 5611.13	1.24 / 1044377.39
Hartigan-Wong 20x	0 $\pm$ 0	0 $\pm$ 0	281.8 $\pm$ 101.9	986 $\pm$ 17.2
1000 execs in s / relQ	0.76 / 1	0.83 / 1	1.28 / 1681.53	1.78 / 536122.93
Lloyd 20x	0 $\pm$ 0	0.4 $\pm$ 0.7	385.6 $\pm$ 66.5	992.8 $\pm$ 6.3
1000 execs in s / relQ	0.63 / 1	0.72 / 1.13	1.14 / 2597.6	1.58 / 693511.33
Forgy 20x	0 $\pm$ 0	0.8 $\pm$ 1	389.2 $\pm$ 62.2	992.6 $\pm$ 6.5
1000 execs in s / relQ	0.63 / 1	0.71 / 1.37	1.14 / 2670.53	1.58 / 680391.36
MacQueen 20x	0 $\pm$ 0	0.7 $\pm$ 0.8	381.4 $\pm$ 91.3	992.7 $\pm$ 6.3
1000 execs in s / relQ	0.63 / 1	0.66 / 1.37	1.06 / 2374.48	1.52 / 695009
cmeans Fuzzy	0 $\pm$ 0	163 $\pm$ 184.5	40 $\pm$ 63.3	35.1 $\pm$ 67.3
1000 execs in s / relQ	0.48 / 1	0.56 / 48.39	1.02 / 148.2	2.05 / 1213.9
ufcl Fuzzy	7.8 $\pm$ 6.4	725 $\pm$ 62	971.4 $\pm$ 7.5	999.8 $\pm$ 0.4
1000 execs in s / relQ	0.56 / 1.14	0.68 / 1646.94	1.51 / 109702.3	3.03 / 5599855.51
single link	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	0.15 / 1	0.18 / 1	0.33 / 1	0.56 / 1
kmeans++	0 $\pm$ 0	2.1 $\pm$ 1.6	0.2 $\pm$ 0.4	0 $\pm$ 0
1000 execs in s / relQ	0.78 / 1	1.78 / 1.99	9.4 / 1.55	33.43 / 1
kmeans++ 2x	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	1.51 / 1	3.53 / 1	18.76 / 1	66.45 / 1

$d =$	2	4	8	16
Hartigan-Wong 1x	718.9 $\pm$ 65.5	695.8 $\pm$ 46	748.9 $\pm$ 43.2	750.8 $\pm$ 28.7
1000 execs in s / relQ	0.34 / 1762.09	0.42 / 749.98	0.59 / 474.19	0.92 / 447.61
Lloyd 1x	741.8 $\pm$ 59.9	721.2 $\pm$ 31.9	779.8 $\pm$ 35.2	764.7 $\pm$ 14.2
1000 execs in s / relQ	0.34 / 2108.83	0.41 / 860.33	0.57 / 520.88	0.9 / 470.34
Forgy 1x	742.5 $\pm$ 52.2	718.5 $\pm$ 28.7	773.6 $\pm$ 41.9	756.3 $\pm$ 16.7
1000 execs in s / relQ	0.33 / 2116.47	0.41 / 844.73	0.57 / 517.54	0.89 / 467.36
MacQueen 1x	747 $\pm$ 47.3	716.2 $\pm$ 38.5	765.4 $\pm$ 46.7	751.2 $\pm$ 22.3
1000 execs in s / relQ	0.34 / 2173.94	0.41 / 831	0.57 / 503.37	0.9 / 459.45
Hartigan-Wong 10x	45 $\pm$ 33.8	35 $\pm$ 21	61.3 $\pm$ 26.4	52.9 $\pm$ 13.6
1000 execs in s / relQ	0.77 / 28.8	0.92 / 14.86	1.28 / 18.92	2.11 / 21.8
Lloyd 10x	57.1 $\pm$ 37	37.2 $\pm$ 14.4	86 $\pm$ 41.6	63.4 $\pm$ 14.8
1000 execs in s / relQ	0.67 / 35.61	0.85 / 15.55	1.21 / 26.3	2 / 26.12
Forgy 10x	56.7 $\pm$ 35	38.7 $\pm$ 15.2	86.1 $\pm$ 47.1	63.3 $\pm$ 12.5
1000 execs in s / relQ	0.68 / 35.61	0.85 / 16.23	1.22 / 26.04	2.01 / 26.02
MacQueen 10x	55.4 $\pm$ 40.6	41.7 $\pm$ 17.7	85.5 $\pm$ 49.9	63.1 $\pm$ 11.6
1000 execs in s / relQ	0.67 / 34.85	0.82 / 17.41	1.18 / 25.78	2 / 25.96
Hartigan-Wong 20x	3.7 $\pm$ 4.1	0.8 $\pm$ 1.3	3.5 $\pm$ 3.5	4.2 $\pm$ 2.8
1000 execs in s / relQ	1 / 3.29	1.16 / 1.33	1.54 / 1.98	2.42 / 2.63
Lloyd 20x	4.1 $\pm$ 4	1.9 $\pm$ 1.8	10.1 $\pm$ 9.5	3.5 $\pm$ 1.7
1000 execs in s / relQ	0.84 / 3.28	1.03 / 1.73	1.42 / 3.79	2.25 / 2.41
Forgy 20x	4.4 $\pm$ 6.9	2 $\pm$ 1.9	9.9 $\pm$ 12.3	4.3 $\pm$ 2.5
1000 execs in s / relQ	0.84 / 3.41	1.03 / 1.75	1.42 / 3.68	2.25 / 2.67
MacQueen 20x	4.8 $\pm$ 5.4	2.3 $\pm$ 2.5	8.6 $\pm$ 9.2	3.2 $\pm$ 1.9
1000 execs in s / relQ	0.82 / 3.85	0.98 / 1.83	1.36 / 3.42	2.21 / 2.27
cmeans Fuzzy	155 $\pm$ 159.3	16 $\pm$ 35.3	19.2 $\pm$ 60.7	0 $\pm$ 0
1000 execs in s / relQ	0.71 / 101.05	0.78 / 7.21	0.95 / 6.14	1.33 / 1
ufcl Fuzzy	800.7 $\pm$ 37.8	797.8 $\pm$ 28.4	867.8 $\pm$ 21.9	883.5 $\pm$ 20.3
1000 execs in s / relQ	0.92 / 5952.77	1.1 / 1097.38	1.57 / 649.79	2.69 / 522.86
single link	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	0.24 / 1	0.29 / 1	0.43 / 1	0.7 / 1
kmeans++	1.2 $\pm$ 0.9	1.9 $\pm$ 1.9	4.8 $\pm$ 3.6	2.7 $\pm$ 1.7
1000 execs in s / relQ	3.45 / 1.75	3.34 / 2.06	3.69 / 2.54	4.52 / 2.17
kmeans++ 2x	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	6.83 / 1	6.58 / 1	7.09 / 1	8.53 / 1

$M, m =$	15 , 9	30 , 9	60 , 9	120 , 9
Hartigan-Wong 1x	723.7 $\pm$ 43.5	801 $\pm$ 58.1	762.9 $\pm$ 49.4	736.1 $\pm$ 76.4
1000 execs in s / relQ	0.36 / 6216.55	0.37 / 10441.43	0.41 / 2427.55	0.45 / 5018.41
Lloyd 1x	730.4 $\pm$ 50.7	804.9 $\pm$ 67.4	777.3 $\pm$ 54.6	733.8 $\pm$ 88.3
1000 execs in s / relQ	0.35 / 7085.08	0.36 / 11054.14	0.4 / 2456.29	0.45 / 5112.28
Forgy 1x	731.3 $\pm$ 50.2	809.8 $\pm$ 67	775.2 $\pm$ 57.9	748.9 $\pm$ 81
1000 execs in s / relQ	0.34 / 7404.61	0.36 / 10798.16	0.41 / 2508.64	0.45 / 5033.54
MacQueen 1x	726.8 $\pm$ 44.9	809 $\pm$ 70.7	777.7 $\pm$ 53	741.3 $\pm$ 88.3
1000 execs in s / relQ	0.35 / 7206.3	0.35 / 11617.69	0.39 / 2508.61	0.43 / 4884.21
Hartigan-Wong 10x	44.4 $\pm$ 26.7	126.3 $\pm$ 73.5	83.8 $\pm$ 52	69 $\pm$ 47.1
1000 execs in s / relQ	1.1 / 24.72	1.32 / 119.48	2.56 / 59.89	3.79 / 47.41
Lloyd 10x	46.6 $\pm$ 29.8	138.8 $\pm$ 88.7	93.2 $\pm$ 55.9	75.2 $\pm$ 51.4
1000 execs in s / relQ	1.01 / 25.78	1.23 / 114.56	2.57 / 63.23	3.83 / 51.64
Forgy 10x	48.7 $\pm$ 34.1	143.5 $\pm$ 91.1	95 $\pm$ 56.1	69.9 $\pm$ 49.1
1000 execs in s / relQ	1.01 / 27.43	1.22 / 149.46	2.56 / 64.55	3.82 / 48.17
MacQueen 10x	46.5 $\pm$ 25.4	138.6 $\pm$ 87.6	82.5 $\pm$ 52.4	66.9 $\pm$ 46.8
1000 execs in s / relQ	0.99 / 26.53	1.19 / 142.97	2.38 / 57.64	3.57 / 46.12
Hartigan-Wong 20x	2.2 $\pm$ 3.6	17.2 $\pm$ 15.7	8.7 $\pm$ 11.2	6.3 $\pm$ 6.8
1000 execs in s / relQ	1.38 / 2.15	1.63 / 9.17	3.04 / 6.78	4.45 / 4.91
Lloyd 20x	3.5 $\pm$ 4.4	26.3 $\pm$ 27.5	11 $\pm$ 13.7	8.6 $\pm$ 10
1000 execs in s / relQ	1.19 / 2.81	1.45 / 16.83	3.08 / 7.88	4.58 / 6.92
Forgy 20x	3.2 $\pm$ 4	27.7 $\pm$ 28.1	11.2 $\pm$ 10.3	7.9 $\pm$ 8.4
1000 execs in s / relQ	1.19 / 2.62	1.45 / 13.49	3.08 / 7.91	4.57 / 6.1
MacQueen 20x	3.8 $\pm$ 6.1	23.7 $\pm$ 24.6	9.8 $\pm$ 12.4	7.4 $\pm$ 8.1
1000 execs in s / relQ	1.17 / 2.89	1.4 / 11.74	2.72 / 7.28	4.07 / 5.74
cmeans Fuzzy	100 $\pm$ 120.3	119.2 $\pm$ 113.8	46.9 $\pm$ 51.9	59 $\pm$ 83
1000 execs in s / relQ	0.99 / 51.88	1.15 / 64.07	2.09 / 35.43	3.07 / 48.86
ufcl Fuzzy	810.6 $\pm$ 19.5	866.6 $\pm$ 23.3	828.8 $\pm$ 31.9	846.4 $\pm$ 25.1
1000 execs in s / relQ	1.48 / 15995.33	1.83 / 14805.01	3.88 / 3690.75	5.91 / 14135.23
single link	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	0.34 / 1	0.44 / 1	1.47 / 1	3.4 / 1
kmeans++	1.3 $\pm$ 1.1	1.3 $\pm$ 1.1	1.4 $\pm$ 1.2	1.1 $\pm$ 1.4
1000 execs in s / relQ	7.25 / 1.7	9.59 / 1.63	23.52 / 2.69	37.31 / 1.69
kmeans++ 2x	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	14.48 / 1	19.12 / 1	47.03 / 1	74.88 / 1

$M, m =$	15 , 9	30 , 18	60 , 36	120 , 72
Hartigan-Wong 1x	714.5 $\pm$ 55.3	726.7 $\pm$ 41	751.6 $\pm$ 35.1	739.4 $\pm$ 48.6
1000 execs in s / relQ	0.36 / 2222.82	0.38 / 9506.9	0.4 / 2452.25	0.48 / 1017.77
Lloyd 1x	731.6 $\pm$ 39.9	737.9 $\pm$ 34.2	743.6 $\pm$ 42.8	733.5 $\pm$ 34.2
1000 execs in s / relQ	0.35 / 2542.88	0.37 / 9537.46	0.4 / 2376.59	0.48 / 1011.3
Forgy 1x	725.9 $\pm$ 46.8	734.9 $\pm$ 49	756.1 $\pm$ 39	739.2 $\pm$ 36.8
1000 execs in s / relQ	0.35 / 2599.24	0.37 / 9968.81	0.39 / 2606.82	0.48 / 1007.41
MacQueen 1x	736.8 $\pm$ 41.3	735.2 $\pm$ 39.8	753 $\pm$ 42.5	734.5 $\pm$ 35.1
1000 execs in s / relQ	0.35 / 2547.24	0.36 / 9369.12	0.38 / 2509.95	0.45 / 970.99
Hartigan-Wong 10x	41.8 $\pm$ 25.7	48.1 $\pm$ 29.8	61.1 $\pm$ 29.7	54.1 $\pm$ 32.9
1000 execs in s / relQ	1.05 / 37.25	1.84 / 28.47	2.36 / 39.86	4.83 / 19.06
Lloyd 10x	47.3 $\pm$ 32.4	53.7 $\pm$ 27.2	64.8 $\pm$ 31.4	53.8 $\pm$ 30.3
1000 execs in s / relQ	0.97 / 40.99	1.63 / 31.44	2.28 / 42.89	4.9 / 18.99
Forgy 10x	52.6 $\pm$ 33.2	49.3 $\pm$ 26.7	64.6 $\pm$ 27.4	51.3 $\pm$ 25.2
1000 execs in s / relQ	0.96 / 45.82	1.63 / 29.53	2.28 / 42.94	4.91 / 18.29
MacQueen 10x	50.2 $\pm$ 31.2	50.5 $\pm$ 27	65.2 $\pm$ 28.9	54 $\pm$ 29.2
1000 execs in s / relQ	0.95 / 44.05	1.57 / 29.42	2.19 / 42.68	4.59 / 19.07
Hartigan-Wong 20x	2.7 $\pm$ 3.5	3.3 $\pm$ 3.3	3.7 $\pm$ 2.8	2.5 $\pm$ 2.3
1000 execs in s / relQ	1.33 / 3.24	2.07 / 2.84	2.81 / 3.58	5.64 / 1.82
Lloyd 20x	4.5 $\pm$ 4.9	2.9 $\pm$ 3	5 $\pm$ 3.9	2.3 $\pm$ 2.9
1000 execs in s / relQ	1.16 / 4.61	1.91 / 2.61	2.67 / 4.13	5.81 / 1.78
Forgy 20x	4 $\pm$ 4.5	3 $\pm$ 3	4.6 $\pm$ 3.9	3.1 $\pm$ 3.1
1000 execs in s / relQ	1.16 / 4.22	1.91 / 2.66	2.66 / 3.87	5.82 / 2.06
MacQueen 20x	2.7 $\pm$ 3.3	2.9 $\pm$ 2.9	3.8 $\pm$ 3.8	2.3 $\pm$ 2.5
1000 execs in s / relQ	1.11 / 3.2	1.81 / 2.61	2.51 / 3.33	5.2 / 1.75
cmeans Fuzzy	23.9 $\pm$ 33.9	70.7 $\pm$ 117.4	82.4 $\pm$ 87.2	55.3 $\pm$ 50.3
1000 execs in s / relQ	0.92 / 22.94	1.45 / 39.24	2 / 62.64	3.94 / 18.83
ufcl Fuzzy	813.5 $\pm$ 24.8	808.6 $\pm$ 15.9	805.2 $\pm$ 18.5	825.3 $\pm$ 20.2
1000 execs in s / relQ	1.39 / 4980.44	2.5 / 12524.72	3.57 / 5516.03	7.71 / 1798.34
single link	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	0.32 / 1	0.68 / 1	1.27 / 1	6.13 / 1
kmeans++	0.8 $\pm$ 1.3	0.6 $\pm$ 0.8	0.6 $\pm$ 0.7	1.9 $\pm$ 1.3
1000 execs in s / relQ	6.57 / 1.61	14.13 / 1.32	21.48 / 1.43	49.4 / 1.73
kmeans++ 2x	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	13.09 / 1	28.24 / 1	42.93 / 1	98.78 / 1

$b r e a k g a p =$	1	2	4	8
Hartigan-Wong 1x	727.8 $\pm$ 47.5	713.8 $\pm$ 35.1	697.3 $\pm$ 69.2	663.3 $\pm$ 42.7
1000 execs in s / relQ	0.36 / 1312.52	0.36 / 710.88	0.37 / 133.68	0.36 / 55.72
Lloyd 1x	735.7 $\pm$ 42.4	728 $\pm$ 35.7	717.7 $\pm$ 29.6	690.8 $\pm$ 32.5
1000 execs in s / relQ	0.35 / 1469.84	0.35 / 720.23	0.37 / 143.39	0.36 / 61.49
Forgy 1x	733.1 $\pm$ 39.7	729.8 $\pm$ 27.5	721.2 $\pm$ 39.9	689.2 $\pm$ 22.3
1000 execs in s / relQ	0.35 / 1421.55	0.34 / 715.94	0.37 / 145.41	0.35 / 63.16
MacQueen 1x	736.1 $\pm$ 43.4	717.4 $\pm$ 35.9	711.9 $\pm$ 51.5	684.9 $\pm$ 29.1
1000 execs in s / relQ	0.35 / 1440.8	0.34 / 764.85	0.36 / 140.76	0.34 / 63.32
Hartigan-Wong 10x	52.6 $\pm$ 25.3	36.7 $\pm$ 14.8	31.4 $\pm$ 21.2	20.9 $\pm$ 12.3
1000 execs in s / relQ	1.01 / 22.4	0.98 / 5.98	1.06 / 2.18	1 / 1.23
Lloyd 10x	49.5 $\pm$ 22	39.7 $\pm$ 17.1	41.4 $\pm$ 25.1	24.8 $\pm$ 4.7
1000 execs in s / relQ	0.92 / 21.14	0.91 / 5.92	0.99 / 2.62	0.96 / 1.29
Forgy 10x	51.4 $\pm$ 22.1	42.6 $\pm$ 14.9	42.3 $\pm$ 20.6	26.6 $\pm$ 8.9
1000 execs in s / relQ	0.93 / 22.14	0.91 / 6.52	1 / 2.65	0.96 / 1.3
MacQueen 10x	50.4 $\pm$ 25.9	39.2 $\pm$ 17.6	40.9 $\pm$ 20.5	20.7 $\pm$ 6.9
1000 execs in s / relQ	0.9 / 21.57	0.88 / 5.86	0.95 / 2.61	0.89 / 1.24
Hartigan-Wong 20x	3.4 $\pm$ 2.6	1.6 $\pm$ 1.9	1.7 $\pm$ 2.9	0.9 $\pm$ 1.5
1000 execs in s / relQ	1.27 / 2.32	1.25 / 1.19	1.35 / 1.06	1.27 / 1.01
Lloyd 20x	2.8 $\pm$ 2.7	1.8 $\pm$ 1.5	1.7 $\pm$ 2.2	0.7 $\pm$ 0.8
1000 execs in s / relQ	1.11 / 2.09	1.11 / 1.23	1.22 / 1.06	1.21 / 1.01
Forgy 20x	3.1 $\pm$ 2.4	3.4 $\pm$ 1.6	2.8 $\pm$ 3.2	0.9 $\pm$ 0.9
1000 execs in s / relQ	1.11 / 2.19	1.12 / 1.42	1.21 / 1.11	1.21 / 1.01
MacQueen 20x	2.4 $\pm$ 2.5	2.2 $\pm$ 2.5	1.1 $\pm$ 1.9	0.3 $\pm$ 0.5
1000 execs in s / relQ	1.07 / 1.92	1.05 / 1.26	1.13 / 1.04	1.06 / 1
cmeans Fuzzy	57.2 $\pm$ 67.9	159.8 $\pm$ 153.7	171.5 $\pm$ 109.1	247.8 $\pm$ 128.8
1000 execs in s / relQ	0.89 / 25.26	0.9 / 26.38	0.98 / 7.94	0.93 / 4.18
ufcl Fuzzy	800.8 $\pm$ 29.2	801.4 $\pm$ 16.7	853.2 $\pm$ 34.1	901.3 $\pm$ 36.2
1000 execs in s / relQ	1.32 / 1989.2	1.3 / 1645.95	1.4 / 289.28	1.31 / 105.34
single link	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	0.3 / 1	0.3 / 1	0.32 / 1	0.3 / 1
kmeans++	1.6 $\pm$ 1.1	4.9 $\pm$ 3.2	12.1 $\pm$ 6.5	47.5 $\pm$ 19.4
1000 execs in s / relQ	6.1 / 1.68	5.95 / 1.66	6.54 / 1.49	6.03 / 1.57
kmeans++ 2x	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	3.2 $\pm$ 2.9
1000 execs in s / relQ	12.12 / 1	11.82 / 1	13.08 / 1	11.96 / 1.03

$k =$	2	4	8	16
Hartigan-Wong 1x	0 $\pm$ 0	531 $\pm$ 93.8	941.3 $\pm$ 11.5	999.3 $\pm$ 0.8
1000 execs in s / relQ	0.25 / 1	0.33 / 1484.76	0.49 / 494624.56	0.81 / 52872221.3
Lloyd 1x	0 $\pm$ 0	520.8 $\pm$ 97.7	942.3 $\pm$ 11.7	999.4 $\pm$ 0.7
1000 execs in s / relQ	0.24 / 1	0.32 / 1492.16	0.48 / 502271.64	0.8 / 58530442.16
Forgy 1x	0 $\pm$ 0	524.9 $\pm$ 81.9	940 $\pm$ 12.2	998.8 $\pm$ 1.1
1000 execs in s / relQ	0.24 / 1	0.32 / 1548	0.48 / 520427.63	0.8 / 54534626.71
MacQueen 1x	0 $\pm$ 0	520.5 $\pm$ 82.6	945 $\pm$ 13.2	999.4 $\pm$ 0.5
1000 execs in s / relQ	0.24 / 1	0.32 / 1495.12	0.47 / 532067.28	0.79 / 57986428.7
Hartigan-Wong 10x	0 $\pm$ 0	2.6 $\pm$ 2.8	541 $\pm$ 62.9	992.9 $\pm$ 2.6
1000 execs in s / relQ	0.63 / 1	0.93 / 3.41	1.54 / 16469.01	2.86 / 7858388.63
Lloyd 10x	0 $\pm$ 0	3.3 $\pm$ 3.1	563.2 $\pm$ 68.1	994.1 $\pm$ 3
1000 execs in s / relQ	0.56 / 1	0.84 / 4.16	1.44 / 18712.34	2.77 / 8511396.87
Forgy 10x	0 $\pm$ 0	3.4 $\pm$ 3	558.3 $\pm$ 71.5	993.5 $\pm$ 3.5
1000 execs in s / relQ	0.56 / 1	0.84 / 4.2	1.44 / 17368.35	2.76 / 8325871.74
MacQueen 10x	0 $\pm$ 0	2.9 $\pm$ 2.5	551.2 $\pm$ 59.2	992.9 $\pm$ 2.7
1000 execs in s / relQ	0.56 / 1	0.83 / 3.82	1.4 / 17832.1	2.66 / 8382457.42
Hartigan-Wong 20x	0 $\pm$ 0	0 $\pm$ 0	299.9 $\pm$ 66.5	986.3 $\pm$ 3.9
1000 execs in s / relQ	0.84 / 1	1.18 / 1	1.94 / 4916.09	3.61 / 4501700.03
Lloyd 20x	0 $\pm$ 0	0 $\pm$ 0	322.2 $\pm$ 69.9	987 $\pm$ 4.4
1000 execs in s / relQ	0.69 / 1	1.01 / 1	1.74 / 5325.33	3.43 / 4989161.44
Forgy 20x	0 $\pm$ 0	0.1 $\pm$ 0.3	318.9 $\pm$ 69.3	987.4 $\pm$ 4.7
1000 execs in s / relQ	0.69 / 1	1 / 1.09	1.74 / 5318.02	3.43 / 4869931.56
MacQueen 20x	0 $\pm$ 0	0 $\pm$ 0	308.4 $\pm$ 68.3	986.3 $\pm$ 3.8
1000 execs in s / relQ	0.69 / 1	0.99 / 1	1.65 / 5183.37	3.27 / 4810210.65
cmeans Fuzzy	0 $\pm$ 0	32.6 $\pm$ 43.8	64.3 $\pm$ 74.2	53.7 $\pm$ 76.1
1000 execs in s / relQ	0.51 / 1	0.78 / 36.04	1.67 / 857.64	5.06 / 14553.66
ufcl Fuzzy	10.3 $\pm$ 8.1	637.2 $\pm$ 22.1	958.8 $\pm$ 8.6	999.4 $\pm$ 1
1000 execs in s / relQ	0.61 / 1.27	1.1 / 3550.22	2.78 / 899036.18	8.83 / 104163506.23
single link	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	0.16 / 1	0.27 / 1	0.57 / 1	1.48 / 1
kmeans++	0 $\pm$ 0	0.6 $\pm$ 0.7	0.2 $\pm$ 0.4	0 $\pm$ 0
1000 execs in s / relQ	0.89 / 1	4.1 / 1.81	21.14 / 3.59	115.57 / 1
kmeans++ 2x	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	1.77 / 1	8.19 / 1	42.03 / 1	230.99 / 1

$d =$	2	4	8	16
Hartigan-Wong 1x	707.1 $\pm$ 50.5	744.6 $\pm$ 38.5	753.3 $\pm$ 21.4	764.7 $\pm$ 19.4
1000 execs in s / relQ	0.36 / 8483.66	0.45 / 3683.63	0.62 / 1761.1	0.97 / 2022.89
Lloyd 1x	717 $\pm$ 52.5	760.8 $\pm$ 39.9	753.4 $\pm$ 18.8	770.1 $\pm$ 16.8
1000 execs in s / relQ	0.35 / 8896.63	0.43 / 3873.14	0.61 / 1802.6	0.95 / 2088.42
Forgy 1x	711.6 $\pm$ 53.1	755.1 $\pm$ 36.5	756 $\pm$ 16.9	765.2 $\pm$ 25.3
1000 execs in s / relQ	0.35 / 8982.63	0.43 / 3741.16	0.61 / 1795.73	0.95 / 2105.67
MacQueen 1x	720.5 $\pm$ 44.3	753 $\pm$ 35.8	748.9 $\pm$ 20.4	769.5 $\pm$ 25
1000 execs in s / relQ	0.35 / 8986.52	0.43 / 3875.21	0.61 / 1797.93	0.95 / 2104.26
Hartigan-Wong 10x	38.9 $\pm$ 20.8	53.8 $\pm$ 24.6	59.8 $\pm$ 13.3	70.7 $\pm$ 23.2
1000 execs in s / relQ	1.07 / 87.64	1.4 / 67.03	2.02 / 76.71	3.27 / 106.07
Lloyd 10x	48 $\pm$ 23.2	65.4 $\pm$ 25.9	62.8 $\pm$ 12.6	81.5 $\pm$ 24.7
1000 execs in s / relQ	0.97 / 106.61	1.31 / 81.6	1.92 / 81.15	3.17 / 122.66
Forgy 10x	43.9 $\pm$ 21.5	68.1 $\pm$ 26.1	64.2 $\pm$ 12.5	82.4 $\pm$ 20.5
1000 execs in s / relQ	0.97 / 98.81	1.3 / 85.49	1.91 / 82.4	3.16 / 124.99
MacQueen 10x	43.6 $\pm$ 20.8	64.6 $\pm$ 23.6	64.5 $\pm$ 12.1	80.5 $\pm$ 23.7
1000 execs in s / relQ	1.35 / 99.1	1.29 / 81.23	1.89 / 83.85	3.12 / 121.61
Hartigan-Wong 20x	2.3 $\pm$ 2.4	3.3 $\pm$ 2.7	4.2 $\pm$ 3.3	5.1 $\pm$ 3.3
1000 execs in s / relQ	1.35 / 5.9	1.71 / 4.99	2.37 / 6.34	3.72 / 8.42
Lloyd 20x	2.9 $\pm$ 2.7	4.2 $\pm$ 3.5	4.3 $\pm$ 2.9	6.1 $\pm$ 3.1
1000 execs in s / relQ	1.15 / 7.34	1.51 / 6.08	2.17 / 6.42	3.52 / 10.17
Forgy 20x	1.6 $\pm$ 1.6	5.9 $\pm$ 3.9	3.2 $\pm$ 0.9	7.5 $\pm$ 3.1
1000 execs in s / relQ	1.15 / 4.41	1.51 / 8.15	2.17 / 4.98	3.51 / 12.02
MacQueen 20x	3.4 $\pm$ 2.6	5.7 $\pm$ 3.2	2.9 $\pm$ 1.7	8.2 $\pm$ 5.3
1000 execs in s / relQ	1.13 / 8.21	1.48 / 7.92	2.13 / 4.57	3.43 / 12.97
cmeans Fuzzy	41.5 $\pm$ 65.6	10.6 $\pm$ 18	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	0.97 / 92.62	1.04 / 14.1	1.21 / 1	1.58 / 1
ufcl Fuzzy	784.6 $\pm$ 20.6	805.6 $\pm$ 15.7	800.3 $\pm$ 19.4	814.9 $\pm$ 13.6
1000 execs in s / relQ	1.46 / 27141.38	1.86 / 4984.43	2.66 / 1940.95	4.39 / 2223.44
single link	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	0.33 / 1	0.41 / 1	0.56 / 1	0.87 / 1
kmeans++	0.4 $\pm$ 0.7	0.5 $\pm$ 0.7	0.7 $\pm$ 0.8	0.5 $\pm$ 0.7
1000 execs in s / relQ	6.89 / 2	7.21 / 1.64	7.44 / 1.96	7.98 / 1.8
kmeans++ 2x	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	14.03 / 1	14.3 / 1	14.62 / 1	15.45 / 1

$M, m =$	15 , 9	30 , 9	60 , 9	120 , 9
Hartigan-Wong 1x	709.6 $\pm$ 40.7	778.2 $\pm$ 20.9	789 $\pm$ 36.6	765.3 $\pm$ 42.5
1000 execs in s / relQ	0.52 / 9331.26	0.51 / 9154.2	0.56 / 11397.3	0.68 / 58999.38
Lloyd 1x	730.6 $\pm$ 38.3	781.8 $\pm$ 19.6	786.1 $\pm$ 35.1	780.7 $\pm$ 49.8
1000 execs in s / relQ	0.5 / 10181.63	0.49 / 9648.81	0.54 / 11275.24	0.67 / 58343.78
Forgy 1x	723.8 $\pm$ 38.2	786.5 $\pm$ 28	792.6 $\pm$ 37	778 $\pm$ 42.6
1000 execs in s / relQ	0.49 / 10529.17	0.49 / 9641	0.55 / 11645.35	0.68 / 58111.99
MacQueen 1x	733.6 $\pm$ 32.4	780.2 $\pm$ 24.7	784.3 $\pm$ 38.6	767.7 $\pm$ 49.8
1000 execs in s / relQ	0.49 / 10804.58	0.48 / 10006.28	0.53 / 11472.56	0.63 / 54103.2
Hartigan-Wong 10x	38.3 $\pm$ 23.1	87.6 $\pm$ 28.5	96.1 $\pm$ 57.6	89.6 $\pm$ 44.1
1000 execs in s / relQ	1.63 / 81.12	2.15 / 239.35	3.04 / 500.28	6.44 / 602.73
Lloyd 10x	48.1 $\pm$ 25.7	87.2 $\pm$ 25	96.7 $\pm$ 46.4	79.1 $\pm$ 40.7
1000 execs in s / relQ	1.47 / 105.41	2.01 / 239.14	2.93 / 509.78	6.5 / 582.16
Forgy 10x	47.9 $\pm$ 19.8	86 $\pm$ 25.5	102.1 $\pm$ 45.6	89.9 $\pm$ 40.4
1000 execs in s / relQ	1.46 / 104.24	2 / 233.97	2.93 / 546.54	6.54 / 653.92
MacQueen 10x	46.6 $\pm$ 22.5	95.3 $\pm$ 26.4	100.8 $\pm$ 58.3	87 $\pm$ 45.5
1000 execs in s / relQ	1.42 / 102.86	1.96 / 259.02	2.8 / 522.96	6.13 / 556.78
Hartigan-Wong 20x	1.6 $\pm$ 3.1	8.2 $\pm$ 5.3	12.7 $\pm$ 15	7.7 $\pm$ 6.8
1000 execs in s / relQ	1.99 / 4.26	2.62 / 23.24	3.62 / 65.01	7.55 / 35.06
Lloyd 20x	2.7 $\pm$ 3.5	7 $\pm$ 4.1	13.3 $\pm$ 15.1	10 $\pm$ 8.3
1000 execs in s / relQ	1.74 / 6.54	2.35 / 19.65	3.41 / 67.85	7.72 / 59.9
Forgy 20x	2.6 $\pm$ 3.3	7.6 $\pm$ 4.1	12.7 $\pm$ 14.5	8.2 $\pm$ 5.9
1000 execs in s / relQ	1.74 / 6.32	2.34 / 21.5	3.42 / 66.49	7.81 / 39.85
MacQueen 20x	3.2 $\pm$ 2.9	8.2 $\pm$ 5.6	12.9 $\pm$ 14.1	9.2 $\pm$ 7.1
1000 execs in s / relQ	1.71 / 7.68	2.27 / 22.95	3.23 / 66.18	6.96 / 42.72
cmeans Fuzzy	43.1 $\pm$ 85.3	29.8 $\pm$ 59.5	4.8 $\pm$ 15.2	29.9 $\pm$ 54.3
1000 execs in s / relQ	1.42 / 89.96	1.84 / 66.83	2.52 / 23.17	5.23 / 164.93
ufcl Fuzzy	786.1 $\pm$ 26.5	815.4 $\pm$ 24.5	825.9 $\pm$ 32.1	830 $\pm$ 33.3
1000 execs in s / relQ	2.25 / 22096.93	3.22 / 13196.22	4.73 / 17181.09	10.65 / 83971.71
single link	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	0.49 / 1	0.82 / 1	1.48 / 1	7.58 / 1
kmeans++	0.6 $\pm$ 0.7	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	10.72 / 2.4	17.33 / 1	27.18 / 1	65.24 / 1
kmeans++ 2x	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	21.64 / 1	34.62 / 1	54.08 / 1	129.98 / 1

$M, m =$	15 , 9	30 , 18	60 , 36	120 , 72
Hartigan-Wong 1x	739.4 $\pm$ 28.1	741 $\pm$ 44	745.7 $\pm$ 70	713 $\pm$ 64.6
1000 execs in s / relQ	0.37 / 42678.96	0.38 / 20676.98	0.42 / 5466.12	0.49 / 138752.72
Lloyd 1x	734.4 $\pm$ 28.5	741.4 $\pm$ 48.9	743.5 $\pm$ 69.7	710.4 $\pm$ 70.6
1000 execs in s / relQ	0.35 / 46792.33	0.37 / 20423.87	0.41 / 5537.44	0.49 / 151687.77
Forgy 1x	734.1 $\pm$ 34.3	745 $\pm$ 41.4	751.3 $\pm$ 74.9	713.5 $\pm$ 69.7
1000 execs in s / relQ	0.35 / 46112.12	0.36 / 20150.75	0.41 / 5548.93	0.49 / 144663.49
MacQueen 1x	738.8 $\pm$ 33.8	739.2 $\pm$ 49.6	743.6 $\pm$ 77.5	710.9 $\pm$ 66.3
1000 execs in s / relQ	0.35 / 45493.36	0.36 / 19576.95	0.4 / 5587.02	0.46 / 137172.51
Hartigan-Wong 10x	49 $\pm$ 19.1	57 $\pm$ 24.6	59.9 $\pm$ 21.6	48.6 $\pm$ 33.5
1000 execs in s / relQ	1.12 / 145.96	1.65 / 191.09	2.8 / 192.09	5 / 122.97
Lloyd 10x	54.2 $\pm$ 22.3	55.4 $\pm$ 29	64.7 $\pm$ 22.2	41.3 $\pm$ 30.7
1000 execs in s / relQ	1.02 / 157.45	1.55 / 186.28	2.75 / 206.05	5.02 / 103.98
Forgy 10x	56 $\pm$ 24.3	59.6 $\pm$ 27.2	66.9 $\pm$ 26.2	43.9 $\pm$ 36.1
1000 execs in s / relQ	1.02 / 205.13	1.55 / 204.96	2.74 / 214.45	5.02 / 113.15
MacQueen 10x	52.1 $\pm$ 21.9	57.2 $\pm$ 28.4	62 $\pm$ 27.3	42.2 $\pm$ 29.9
1000 execs in s / relQ	1.01 / 147.45	1.51 / 190.38	2.61 / 199.76	4.76 / 106.24
Hartigan-Wong 20x	2.8 $\pm$ 3.2	3.4 $\pm$ 2.5	4.8 $\pm$ 3.5	2.8 $\pm$ 3
1000 execs in s / relQ	1.4 / 8.73	1.99 / 12.17	3.3 / 16.17	5.82 / 7.98
Lloyd 20x	3 $\pm$ 2.2	4.8 $\pm$ 5.8	5.5 $\pm$ 2.6	3.5 $\pm$ 4.7
1000 execs in s / relQ	1.22 / 9.45	1.81 / 17.56	3.23 / 18.34	5.89 / 9.85
Forgy 20x	3 $\pm$ 3	5.6 $\pm$ 6.4	5.2 $\pm$ 3.2	2.4 $\pm$ 3.5
1000 execs in s / relQ	1.22 / 9.35	1.81 / 20.08	3.22 / 17.34	5.9 / 6.94
MacQueen 20x	3.3 $\pm$ 3.9	3.4 $\pm$ 2.6	4.8 $\pm$ 2.3	2.9 $\pm$ 3.7
1000 execs in s / relQ	1.19 / 10.02	1.74 / 12.45	2.98 / 16.03	5.39 / 8.23
cmeans Fuzzy	89.6 $\pm$ 106.4	46.4 $\pm$ 112.3	19 $\pm$ 36.1	122.4 $\pm$ 127.9
1000 execs in s / relQ	0.98 / 250.43	1.38 / 136.32	2.22 / 58.89	4.13 / 313.25
ufcl Fuzzy	786 $\pm$ 20.1	806.5 $\pm$ 21.9	795.7 $\pm$ 20.3	801.8 $\pm$ 33
1000 execs in s / relQ	1.51 / 77257.56	2.38 / 33556.99	4.29 / 16111.26	8 / 232714.84
single link	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	0.35 / 1	0.66 / 1	1.91 / 1	7.33 / 1
kmeans++	0.2 $\pm$ 0.4	0.3 $\pm$ 0.7	0.2 $\pm$ 0.4	0.6 $\pm$ 0.7
1000 execs in s / relQ	7.47 / 1.85	13.46 / 2.07	26.66 / 1.67	51.85 / 2.41
kmeans++ 2x	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	14.85 / 1	26.82 / 1	53.17 / 1	103.47 / 1

$b r e a k g a p =$	1	2	4	8
Hartigan-Wong 1x	737.8 $\pm$ 36.7	704.8 $\pm$ 62.6	712.5 $\pm$ 62.9	711.6 $\pm$ 41.7
1000 execs in s / relQ	0.49 / 14832.58	0.49 / 1527.1	0.47 / 874.07	0.47 / 221.09
Lloyd 1x	744.5 $\pm$ 29.1	724.8 $\pm$ 52.9	726.7 $\pm$ 53.6	731.8 $\pm$ 28.8
1000 execs in s / relQ	0.48 / 15121.58	0.47 / 1830.33	0.46 / 923.38	0.46 / 246.31
Forgy 1x	743.1 $\pm$ 30	717.1 $\pm$ 60.7	714.1 $\pm$ 48.2	729.5 $\pm$ 42
1000 execs in s / relQ	0.48 / 15136.66	0.48 / 1764	0.46 / 938.35	0.46 / 239.11
MacQueen 1x	736.7 $\pm$ 21.8	721.8 $\pm$ 58.6	721.6 $\pm$ 57.6	727.1 $\pm$ 31.7
1000 execs in s / relQ	0.48 / 14539.53	0.48 / 1824.64	0.48 / 946.43	0.45 / 235.61
Hartigan-Wong 10x	50.5 $\pm$ 19	36.2 $\pm$ 17.4	39.1 $\pm$ 20	44.4 $\pm$ 17.4
1000 execs in s / relQ	1.56 / 99.45	1.5 / 22.34	1.44 / 6.98	1.43 / 3.32
Lloyd 10x	51.8 $\pm$ 14.8	43.8 $\pm$ 25	45.4 $\pm$ 28.5	49.9 $\pm$ 19.2
1000 execs in s / relQ	1.41 / 101.54	1.39 / 27.1	1.32 / 7.92	1.34 / 3.57
Forgy 10x	55.2 $\pm$ 15.7	42.9 $\pm$ 20.1	43.3 $\pm$ 26.1	47.7 $\pm$ 20.2
1000 execs in s / relQ	1.41 / 108.27	1.36 / 26.29	1.32 / 7.58	1.33 / 3.49
MacQueen 10x	48.1 $\pm$ 19.2	39.6 $\pm$ 21.8	40.5 $\pm$ 25.7	41.8 $\pm$ 19.2
1000 execs in s / relQ	1.39 / 94.4	1.33 / 24.45	1.29 / 7.16	1.29 / 3.26
Hartigan-Wong 20x	1.9 $\pm$ 1.6	2.4 $\pm$ 1.9	1.8 $\pm$ 2.3	2.1 $\pm$ 2.1
1000 execs in s / relQ	1.96 / 4.68	1.85 / 2.4	1.8 / 1.27	1.8 / 1.11
Lloyd 20x	2.5 $\pm$ 2.3	2.9 $\pm$ 2.9	2.2 $\pm$ 3	2.4 $\pm$ 2.6
1000 execs in s / relQ	1.7 / 5.76	1.65 / 2.7	1.59 / 1.35	1.61 / 1.12
Forgy 20x	2.9 $\pm$ 2.6	2.5 $\pm$ 2	2.5 $\pm$ 3.1	2.5 $\pm$ 3.2
1000 execs in s / relQ	1.68 / 6.6	1.63 / 2.46	1.61 / 1.39	1.61 / 1.13
MacQueen 20x	2.9 $\pm$ 2	2.7 $\pm$ 2.7	2.4 $\pm$ 2.7	3.5 $\pm$ 3
1000 execs in s / relQ	1.65 / 6.61	1.56 / 2.58	1.53 / 1.36	1.52 / 1.18
cmeans Fuzzy	48 $\pm$ 90	49.6 $\pm$ 85.4	132 $\pm$ 140.6	146.3 $\pm$ 104.9
1000 execs in s / relQ	1.38 / 95.98	1.35 / 27.62	1.31 / 23.67	1.34 / 8.83
ufcl Fuzzy	784.5 $\pm$ 13.2	775.7 $\pm$ 18.6	800.6 $\pm$ 21.1	832 $\pm$ 34.2
1000 execs in s / relQ	2.17 / 27243.69	2.04 / 3671.76	2 / 1786.46	2 / 363.3
single link	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	0.48 / 1	0.44 / 1	0.43 / 1	0.44 / 1
kmeans++	0.3 $\pm$ 0.7	0.9 $\pm$ 0.9	4.7 $\pm$ 2.4	12.1 $\pm$ 4.3
1000 execs in s / relQ	10.38 / 1.56	9.58 / 1.63	9.39 / 1.84	9.39 / 1.67
kmeans++ 2x	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0.2 $\pm$ 0.4
1000 execs in s / relQ	20.91 / 1	18.95 / 1	18.75 / 1	18.72 / 1.01

$𝔭$ =	0.1	0.2	0.3	0.4
Hartigan-Wong 1x	709.6 $\pm$ 40.7	739.7 $\pm$ 37.2	727.8 $\pm$ 40	727.3 $\pm$ 30.9
1000 execs in s / relQ	0.51 / 9331.26	0.53 / 6214.92	0.52 / 14411.23	0.49 / 11570.51
Lloyd 1x	730.6 $\pm$ 38.3	754.2 $\pm$ 37.9	725.1 $\pm$ 35.1	745.7 $\pm$ 37.2
1000 execs in s / relQ	0.49 / 10181.63	0.51 / 6489.79	0.51 / 14022.92	0.47 / 11998.16
Forgy 1x	723.8 $\pm$ 38.2	749.9 $\pm$ 46.6	725.1 $\pm$ 37	736 $\pm$ 30.3
1000 execs in s / relQ	0.49 / 10529.17	0.51 / 6483.41	0.51 / 15049.05	0.47 / 12702.78
MacQueen 1x	733.6 $\pm$ 32.4	745.6 $\pm$ 40.3	728 $\pm$ 37	737.4 $\pm$ 31.9
1000 execs in s / relQ	0.49 / 10804.58	0.5 / 6239.37	0.5 / 15511.64	0.48 / 12367.06
Hartigan-Wong 10x	38.3 $\pm$ 23.1	55.4 $\pm$ 21.6	47.5 $\pm$ 24.9	48.1 $\pm$ 12.9
1000 execs in s / relQ	1.57 / 81.12	1.72 / 131.9	1.65 / 133.48	1.6 / 158.6
Lloyd 10x	48.1 $\pm$ 25.7	56.7 $\pm$ 25.1	48.2 $\pm$ 23.9	55.3 $\pm$ 21.6
1000 execs in s / relQ	1.49 / 105.41	1.55 / 133.51	1.51 / 134.32	1.47 / 177.88
Forgy 10x	47.9 $\pm$ 19.8	57.8 $\pm$ 21.5	46.7 $\pm$ 23.7	52.8 $\pm$ 20.9
1000 execs in s / relQ	1.5 / 104.24	1.53 / 136.46	1.5 / 130.72	1.45 / 170.55
MacQueen 10x	46.6 $\pm$ 22.5	59.9 $\pm$ 19.6	44.5 $\pm$ 23.2	51.3 $\pm$ 20.2
1000 execs in s / relQ	1.49 / 102.86	1.52 / 142.14	1.49 / 124.33	1.45 / 164.45
Hartigan-Wong 20x	1.6 $\pm$ 3.1	2.5 $\pm$ 2.1	2.2 $\pm$ 2.9	1.9 $\pm$ 2.1
1000 execs in s / relQ	2.02 / 4.26	2.09 / 6.75	2.05 / 6.96	2 / 6.88
Lloyd 20x	2.7 $\pm$ 3.5	4.1 $\pm$ 2.6	1.6 $\pm$ 2.5	3.3 $\pm$ 2.3
1000 execs in s / relQ	1.78 / 6.54	1.83 / 10.53	1.78 / 5.15	1.72 / 11.47
Forgy 20x	2.6 $\pm$ 3.3	3.9 $\pm$ 2.8	1.7 $\pm$ 2.2	3.9 $\pm$ 2.8
1000 execs in s / relQ	1.76 / 6.32	1.84 / 10.02	1.79 / 5.56	1.73 / 13.36
MacQueen 20x	3.2 $\pm$ 2.9	4.8 $\pm$ 3.4	2.8 $\pm$ 3.8	2.4 $\pm$ 2.6
1000 execs in s / relQ	1.71 / 7.68	1.83 / 11.97	1.73 / 8.47	1.68 / 8.64
cmeans Fuzzy	43.1 $\pm$ 85.3	21.7 $\pm$ 37.8	41.7 $\pm$ 93.2	37.3 $\pm$ 80
1000 execs in s / relQ	1.4 / 89.96	1.48 / 54.06	1.44 / 116.27	1.4 / 116.36
ufcl Fuzzy	786.1 $\pm$ 26.5	796.3 $\pm$ 18	773.8 $\pm$ 22.8	772.9 $\pm$ 24.4
1000 execs in s / relQ	2.21 / 22096.93	2.39 / 11322.8	2.3 / 25817.26	2.25 / 31905.38
single link	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	0.49 / 1	0.54 / 1	0.51 / 1	0.5 / 1
kmeans++	0.6 $\pm$ 0.7	0 $\pm$ 0	0.4 $\pm$ 0.5	0.1 $\pm$ 0.3
1000 execs in s / relQ	10.55 / 2.4	11.75 / 1	11.07 / 2.6	10.98 / 1.31
kmeans++ 2x	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
1000 execs in s / relQ	21.23 / 1	23.47 / 1	22.01 / 1	21.63 / 1

Equations200

Q (C) = i = 1 \sum m j = 1 \sum k u_{ij} ∥ x_{i} - μ_{j} ∥^{2} = j = 1 \sum k \frac{1}{n _{j}} x_{i}, x_{l} \in C_{j} \sum ∥ x_{i} - x_{l} ∥^{2}

Q (C) = i = 1 \sum m j = 1 \sum k u_{ij} ∥ x_{i} - μ_{j} ∥^{2} = j = 1 \sum k \frac{1}{n _{j}} x_{i}, x_{l} \in C_{j} \sum ∥ x_{i} - x_{l} ∥^{2}

V_{d} \leq \frac{\frac{x _{1}^{2} ( n _{1} + n _{1}^{2} / n _{2} + n _{1} + n _{2} )}{n _{1} + n _{2}} \cdot ( n _{1} + n _{2} ) + \frac{x _{3}^{2} \cdot ( n _{3} + n _{3}^{2} / n _{4} + n _{3} + n _{4} )}{( n _{3} + n _{4} )} \cdot ( n _{3} + n _{4} )}{n _{A}}

V_{d} \leq \frac{\frac{x _{1}^{2} ( n _{1} + n _{1}^{2} / n _{2} + n _{1} + n _{2} )}{n _{1} + n _{2}} \cdot ( n _{1} + n _{2} ) + \frac{x _{3}^{2} \cdot ( n _{3} + n _{3}^{2} / n _{4} + n _{3} + n _{4} )}{( n _{3} + n _{4} )} \cdot ( n _{3} + n _{4} )}{n _{A}}

V_{d} \leq \frac{( x _{1}^{2} \cdot ( n _{1} + n _{1}^{2} / n _{2} + n _{1} + n _{2} ) + x _{3}^{2} \cdot ( n _{3} + n _{3}^{2} / n _{4} + n _{3} + n _{4} )}{n _{A}}

V_{d} \leq \frac{( x _{1}^{2} \cdot ( n _{1} + n _{1}^{2} / n _{2} + n _{1} + n _{2} ) + x _{3}^{2} \cdot ( n _{3} + n _{3}^{2} / n _{4} + n _{3} + n _{4} )}{n _{A}}

V_{d} \leq (x_{1}^{2} \cdot (n_{1} + n_{1}^{2} \cdot (r - x_{1}) / n_{1} / x_{1} + n_{1} + n_{2}) + x_{3}^{2} \cdot (n_{3} + n_{3}^{2} \cdot (r - x_{3}) / n_{3} / x_{3} + n_{3} + n_{4})) / n_{A}

V_{d} \leq (x_{1}^{2} \cdot (n_{1} + n_{1}^{2} \cdot (r - x_{1}) / n_{1} / x_{1} + n_{1} + n_{2}) + x_{3}^{2} \cdot (n_{3} + n_{3}^{2} \cdot (r - x_{3}) / n_{3} / x_{3} + n_{3} + n_{4})) / n_{A}

V_{d} \leq (x_{1}^{2} \cdot (2 \cdot n_{1} + n_{2}) + n_{1}^{2} \cdot (r - x_{1}) \cdot x_{1} / n_{1} + x_{3}^{2} \cdot (2 \cdot n_{3} + n_{4}) + n_{3}^{2} \cdot (r - x_{3}) \cdot x_{3} / n_{3}) / n_{A}

V_{d} \leq (x_{1}^{2} \cdot (2 \cdot n_{1} + n_{2}) + n_{1}^{2} \cdot (r - x_{1}) \cdot x_{1} / n_{1} + x_{3}^{2} \cdot (2 \cdot n_{3} + n_{4}) + n_{3}^{2} \cdot (r - x_{3}) \cdot x_{3} / n_{3}) / n_{A}

V_{d} \leq (x_{1}^{2} \cdot (2 \cdot n_{1} + n_{2}) + n_{1} \cdot (r - x_{1}) \cdot x_{1} + x_{3}^{2} \cdot (2 \cdot n_{3} + n_{4}) + n_{3} \cdot (r - x_{3}) \cdot x_{3}) / n_{A}

V_{d} \leq (x_{1}^{2} \cdot (2 \cdot n_{1} + n_{2}) + n_{1} \cdot (r - x_{1}) \cdot x_{1} + x_{3}^{2} \cdot (2 \cdot n_{3} + n_{4}) + n_{3} \cdot (r - x_{3}) \cdot x_{3}) / n_{A}

V_{d} \leq (x_{1}^{2} \cdot (n_{1} + n_{2}) + n_{1} \cdot r \cdot x_{1} + x_{3}^{2} \cdot (n_{3} + n_{4}) + n_{3} \cdot r \cdot x_{3}) / n_{A}

V_{d} \leq (x_{1}^{2} \cdot (n_{1} + n_{2}) + n_{1} \cdot r \cdot x_{1} + x_{3}^{2} \cdot (n_{3} + n_{4}) + n_{3} \cdot r \cdot x_{3}) / n_{A}

V_{d} \leq (x_{1}^{2} \cdot (n_{1} + n_{2}) + n_{1} \cdot r \cdot x_{1} + (x_{1} \cdot (n_{1} + n_{2}) / (n_{3} + n_{4}))^{2} \cdot (n_{3} + n_{4}) + n_{3} \cdot r \cdot x_{3}) / n_{A}

V_{d} \leq (x_{1}^{2} \cdot (n_{1} + n_{2}) + n_{1} \cdot r \cdot x_{1} + (x_{1} \cdot (n_{1} + n_{2}) / (n_{3} + n_{4}))^{2} \cdot (n_{3} + n_{4}) + n_{3} \cdot r \cdot x_{3}) / n_{A}

V_{d} \leq (x_{1}^{2} \cdot (n_{1} + n_{2}) + n_{1} \cdot r \cdot x_{1} + x_{1}^{2} \cdot (n_{1} + n_{2})^{2} / (n_{3} + n_{4}) + n_{3} \cdot r \cdot x_{3}) / n_{A}

V_{d} \leq (x_{1}^{2} \cdot (n_{1} + n_{2}) + n_{1} \cdot r \cdot x_{1} + x_{1}^{2} \cdot (n_{1} + n_{2})^{2} / (n_{3} + n_{4}) + n_{3} \cdot r \cdot x_{3}) / n_{A}

V_{d} \leq (x_{1}^{2} \cdot (n_{1} + n_{2}) \cdot n_{A} / (n_{3} + n_{4}) + n_{1} \cdot r \cdot x_{1} + + n_{3} \cdot r \cdot x_{3}) / n_{A}

V_{d} \leq (x_{1}^{2} \cdot (n_{1} + n_{2}) \cdot n_{A} / (n_{3} + n_{4}) + n_{1} \cdot r \cdot x_{1} + + n_{3} \cdot r \cdot x_{3}) / n_{A}

V_{d} \leq (x_{1}^{2} \cdot (n_{1} + n_{2}) \cdot n_{A} / (n_{3} + n_{4}) + (n_{1} + n_{2}) \cdot r \cdot x_{1} + (n_{3} + n_{4}) \cdot r \cdot x_{3}) / n_{A}

V_{d} \leq (x_{1}^{2} \cdot (n_{1} + n_{2}) \cdot n_{A} / (n_{3} + n_{4}) + (n_{1} + n_{2}) \cdot r \cdot x_{1} + (n_{3} + n_{4}) \cdot r \cdot x_{3}) / n_{A}

V_{d} \leq ((x_{3} \cdot (n_{3} + n_{4}) / (n_{1} + n_{2}))^{2} \cdot (n_{1} + n_{2}) \cdot n_{A} / (n_{3} + n_{4}) + 2 \cdot (n_{3} + n_{4}) \cdot r \cdot x_{3}) / n_{A}

V_{d} \leq ((x_{3} \cdot (n_{3} + n_{4}) / (n_{1} + n_{2}))^{2} \cdot (n_{1} + n_{2}) \cdot n_{A} / (n_{3} + n_{4}) + 2 \cdot (n_{3} + n_{4}) \cdot r \cdot x_{3}) / n_{A}

V_{d} \leq (x_{3}^{2} (n_{3} + n_{4})^{2} / (n_{1} + n_{2}) \cdot n_{A} / (n_{3} + n_{4}) + 2 \cdot (n_{3} + n_{4}) \cdot r \cdot x_{3}) / n_{A}

V_{d} \leq (x_{3}^{2} (n_{3} + n_{4})^{2} / (n_{1} + n_{2}) \cdot n_{A} / (n_{3} + n_{4}) + 2 \cdot (n_{3} + n_{4}) \cdot r \cdot x_{3}) / n_{A}

V_{d} \leq (x_{3}^{2} \cdot (n_{3} + n_{4})^{2} \cdot r / x_{3} / (n_{3} + n_{4}) \cdot n_{A} / (n_{3} + n_{4}) + 2 \cdot (n_{3} + n_{4}) \cdot r \cdot x_{3}) / n_{A}

V_{d} \leq (x_{3}^{2} \cdot (n_{3} + n_{4})^{2} \cdot r / x_{3} / (n_{3} + n_{4}) \cdot n_{A} / (n_{3} + n_{4}) + 2 \cdot (n_{3} + n_{4}) \cdot r \cdot x_{3}) / n_{A}

V_{d} \leq (x_{3} \cdot r \cdot n_{A} + 2 \cdot (n_{3} + n_{4}) \cdot r \cdot x_{3}) / n_{A}

V_{d} \leq (x_{3} \cdot r \cdot n_{A} + 2 \cdot (n_{3} + n_{4}) \cdot r \cdot x_{3}) / n_{A}

V_{d} \leq x_{3} \cdot r \cdot (n_{A} + 2 \cdot (n_{3} + n_{4})) / n_{A}

V_{d} \leq x_{3} \cdot r \cdot (n_{A} + 2 \cdot (n_{3} + n_{4})) / n_{A}

V_{d} \leq 3 \cdot x_{3} \cdot r

V_{d} \leq 3 \cdot x_{3} \cdot r

x_{3} \geq V_{d} /3/ r

x_{3} \geq V_{d} /3/ r

S S C (P_{1}^{'} \cup P_{6}^{'}) + S S C (P_{3}^{'} \cup P_{7}^{'}) \leq S S C (C_{A}) - x_{3}^{2} n_{A} b + n_{B} x_{5}^{2}

S S C (P_{1}^{'} \cup P_{6}^{'}) + S S C (P_{3}^{'} \cup P_{7}^{'}) \leq S S C (C_{A}) - x_{3}^{2} n_{A} b + n_{B} x_{5}^{2}

Q ({P_{1}^{'} \cup P_{6}^{'}, P_{3}^{'} \cup P_{7}^{'}}) = S S C (P_{1}^{'} \cup P_{6}^{'}) + S S C (P_{3}^{'} \cup P_{7}^{'}) \leq S S C (C_{A})

Q ({P_{1}^{'} \cup P_{6}^{'}, P_{3}^{'} \cup P_{7}^{'}}) = S S C (P_{1}^{'} \cup P_{6}^{'}) + S S C (P_{3}^{'} \cup P_{7}^{'}) \leq S S C (C_{A})

\leq S S C (C_{A}) + S S C (C_{B}) = Q ({C_{A}, C_{B}})

\leq S S C (C_{A}) + S S C (C_{B}) = Q ({C_{A}, C_{B}})

g \geq r_{ma x} k \frac{M + n}{m}

g \geq r_{ma x} k \frac{M + n}{m}

g \geq k r_{ma x} n_{p} /2 + n_{q} /2 + n /2 \frac{2 n}{n _{p} n _{q}}

g \geq k r_{ma x} n_{p} /2 + n_{q} /2 + n /2 \frac{2 n}{n _{p} n _{q}}

g \geq r k (k + 1)

g \geq r k (k + 1)

g \geq r k 2 k + k^{2}

g \geq r k 2 k + k^{2}

\frac{( k - i ) g ^{2}}{( k - i ) g ^{2} + i r ^{2}} \geq \frac{( k - i ) r ^{2} k ^{2} ( k + 1 ) ^{2}}{( k - i ) r ^{2} k ^{2} ( k + 1 ) ^{2} + i r ^{2}}

\frac{( k - i ) g ^{2}}{( k - i ) g ^{2} + i r ^{2}} \geq \frac{( k - i ) r ^{2} k ^{2} ( k + 1 ) ^{2}}{( k - i ) r ^{2} k ^{2} ( k + 1 ) ^{2} + i r ^{2}}

= \frac{( k - i ) k ^{2} ( k + 1 ) ^{2}}{( k - i ) k ^{2} ( k + 1 ) ^{2} + i} \geq \frac{k ^{2} ( k + 1 ) ^{2}}{k ^{2} ( k + 1 ) ^{2} + ( k - 1 )}

= \frac{( k - i ) k ^{2} ( k + 1 ) ^{2}}{( k - i ) k ^{2} ( k + 1 ) ^{2} + i} \geq \frac{k ^{2} ( k + 1 ) ^{2}}{k ^{2} ( k + 1 ) ^{2} + ( k - 1 )}

P A S (k) \geq (\frac{k ^{2} ( k + 1 ) ^{2}}{k ^{2} ( k + 1 ) ^{2} + ( k - 1 )})^{k - 1}

P A S (k) \geq (\frac{k ^{2} ( k + 1 ) ^{2}}{k ^{2} ( k + 1 ) ^{2} + ( k - 1 )})^{k - 1}

(1 - P A S (k))^{R} \leq (1 - (\frac{k ^{2} ( k + 1 ) ^{2}}{k ^{2} ( k + 1 ) ^{2} + ( k - 1 )})^{k - 1})^{R} < 1 - P r_{s u cc}

(1 - P A S (k))^{R} \leq (1 - (\frac{k ^{2} ( k + 1 ) ^{2}}{k ^{2} ( k + 1 ) ^{2} + ( k - 1 )})^{k - 1})^{R} < 1 - P r_{s u cc}

R \geq \frac{lo g ( 1 - P r _{s u cc} )}{lo g ( 1 - ( \frac{k ^{2} ( k + 1 ) ^{2}}{k ^{2} ( k + 1 ) ^{2} + ( k - 1 )} ) ^{k - 1} )}

R \geq \frac{lo g ( 1 - P r _{s u cc} )}{lo g ( 1 - ( \frac{k ^{2} ( k + 1 ) ^{2}}{k ^{2} ( k + 1 ) ^{2} + ( k - 1 )} ) ^{k - 1} )}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

An Aposteriorical Clusterability Criterion for

$k$ -Means++ and Simplicity of Clustering - Extended Version

Mieczysław A. Kłopotek

Institute of Computer Science of the Polish Academy of Sciences

ul. Jana Kazimierza 5, 01-248 Warszawa Poland

[email protected]

Abstract

We define the notion of a well-clusterable data set combining the point of view of the objective of $k$ -means clustering algorithm (minimising the centric spread of data elements) and common sense (clusters shall be separated by gaps). We identify conditions under which the optimum of $k$ -means objective coincides with a clustering under which the data is separated by predefined gaps.

We investigate two cases: when the whole clusters are separated by some gap and when only the cores of the clusters meet some separation condition.

We overcome a major obstacle in using clusterability criteria due to the fact that known approaches to clusterability checking had the disadvantage that they are related to the optimal clustering which is NP hard to identify.

Compared to other approaches to clusterability, the novelty consists in the possibility of an a posteriori (after running $k$ -means) check if the data set is well-clusterable or not. As the $k$ -means algorithm applied for this purpose has polynomial complexity so does therefore the appropriate check. Additionally, if $k$ -means++ fails to identify a clustering that meets clusterability criteria, with high probability the data is not well-clusterable.

1 Introduction

It is a commonly observed phenomenon that most practically used clustering algorithms (like $k$ -means) have a high theoretical computational complexity (are NP-hard), but at the same time in many (though not all) practical applications they perform quite well (converge quickly enough) yielding more or less usable output. Apparently then, the data must have the property that some data sets are better clusterable than other.

Though a number of attempts have been made to capture formally the intuition behind clusterability, none of these efforts seems to have been successful, as Ben-David exhibits in [10] in depth. He points at three important shortcomings of current state-of-the-art research results: the clusterability cannot be checked prior to applying a potentially NP-hard clustering algorithm to the data, known clusterability criteria impose strong (impractical) separation constraints and the research nearly does not address popular algorithms. A recent paper by Ackerman [1] partially eliminates some of these problems, but regrettably at the expense of introducing user-defined parameters that do not seem to be intuitive (in terms of one‘s imagination about what well-clusterable data are.).

Therefore in this paper we try a different approach to defining what well clusterable data are. As Ben-David mentioned, the research in the area does not address popular algorithms except for $\epsilon$ -Separatedness clusterability criterion related to $k$ -means proposed by Ostrovsky et al. [19]. Herewith we want also to contribute to applicability of clusterability criteria in that, following Ostrovsky‘s example, we deal with $k$ -means, and in particular with its special version called $k$ -means++, as in fact Ostrovsky did. Furthermore, we share Ben-David‘s concern that it is not a solution to the problem if we shift the NP hardness from the data clustering algorithm to the data clusterability checking algorithm, because the problem becomes even worse. Last not least we have to handle somehow the issue of the impractical gaps imposed by the clusterability criteria in the literature. Ben-David argues in [10] (see his Section 5), that apparently the efforts in clusterability research on finding support for the hypothesis of ”clustering is hard only if the clustering does not matter” have failed, mainly due to the fact that gaps between clusters that are required are too large for practical applications, as the popular algorithms behave reasonably even with significantly smaller gaps. But a closer look at the $k$ -means objective shows (see Section 3) that it does not make sense to build a clusterability criterion solely on the grounds of the gaps because $k$ -means criterion does not rely solely on gaps between clusters, but also on their cardinalities as well as in fact on their internal spread, which cannot be known in advance. Under changed cardinalities $k$ -means may prefer to split larger clusters and merge parts of them with smaller, though clearly separated clusters. If the $k$ -means criterion shall coincide with separation of clusters, the gaps need to be large. Therefore in our research, instead of seeking smaller gaps, we rather concentrate on redefining the goal of clusterability research efforts. So it is proposed here to change the perspective. Instead of (or in addition to) seeking conditions for easiness of clustering of a given data set, let us look for definition of a data set (data set generator) for which we know optimal clustering in advance and some algorithm returns the ground truth nearly for sure. This change of perspective will lead us to a practical application of clusterability concept consisting in testing the algorithm behaviour under varying degrees of violating the clusterability conditions.

Having been freed from the need to seek the smallest possible gaps, we can also weaken the problems with NP hardness of stating the clusterability of the data set. In particular we do not require that we have to say beforehand (before clustering) whether or not the data is well-clusterable. Instead we require that one shall be able to state aposteriorically whether or not the data is well-clusterable according to well-clusterability criteria that were assumed, in polynomial time. Note that this is a tremendous progress over clusterability criteria defined so far. None of the clusterability criteria discussed by Ben-David [10] fits this requirement and the criterion proposed in [1] can be shown to be invalid for simple data sets (see Section 3, Figure 1). We believe to resolve in this way a serious bottleneck in the clusterability research. We do not cover here the issue of measuring the deviation from well-separatedness, but are convinced that by presenting a clusterability criterion that is verifiable ex post and where the data can be checked for clusterability by at least one popular algorithm we open a way to attack this issue also.

In this study we will restrict ourselves to the $k$ -means family of algorithms. The restriction is in fact not too serious as the algorithms of this family are broadly used and in fact there exist quite large number of variants, starting with early work of Lloyd, Forgy, MacQueen and Hartigan-Wong, to $k$ -means++, spherical $k$ -means, their fuzzified versions, and many other111See e.g. Chapter 3 [22] for a review of the $k$ -means algorithm family

Within the $k$ -means family we have to face the following challenges [10]:

•

the clusterability criteria in the literature (e.g. [19]) refer to the optimal cost function value of $k$ means (see equation (1)) - but the actual value of this optimal solution is not known

•

people are accustomed to associate well-clusterable data with ones of large gaps between clusters - but the optimal cost function of $k$ -means is also influenced by cluster sizes, so that the gap sufficient for one set of clusters will prove insufficient for some other (see Section 3)

•

the cost function of $k$ -means usually has multiple local minima and the real world $k$ -means algorithms usually tend to stick at some local minimum (see e.g. [22, Chapter 3]).

For these reasons, when comparing the results of various $k$ -means brands on real data we have a hard time to distil the reason why their results differ: is it because the data are not clusterable, or that the cost function optimum does not agree with common sense split into well separated clusters or the algorithm is unable to discover the optimal clustering (systematically misses it).

In order to enable making such distinctions, we decided to seek such a clusterability criterion that:

•

the clusterability criterion is based on the gap size between clusters and other cluster characteristics, that can be computed by inspection of an obtained clustering (not referring to the optimal one),

•

if the clustering obtained meets the clusterability criteria, then this is the real optimal clustering,

•

if a special algorithm (here we mean $k$ -means++) fails to find a clustering meeting our clusterability criteria, then with high probability the data is not well-clusterable at all by any algorithm,

•

there exists the possibility to generate a data set matching the clusterability conditions for various constellations of the cluster sizes (cardinality, spread), dimensionalities, number of clusters etc.

Given such a tool at disposal we can investigate algorithm‘s capability to find the optimal clustering in the easy case, compare the algorithms in their performance in an easy case, and then compare their relative performance when the clusterability property degenerates, for example via decreasing the size of the gap between clusters.

In this research we confine ourselves to providing the tool in terms of the new clusterability criterion, and make only a small demonstration, how the degenerative behaviour of algorithms may be studied.

Our contribution encompasses:

•

Two brands of well-clusterability criteria for data to be clustered via $k$ -means algorithm, that can be verified ex-post (both positively and negatively) without great computational burden (inequalities (2) and (3) in Section 4, and inequalities (15) and (16) in Section 5).

•

Demonstration, that the structure of well-clusterable data (according to these criteria) is easy to recover (see Theorems 1(i) and 5(i)).

•

Demonstration that if well-clusterable data structure (in that sense) was not discovered by $k$ -means++, then there is no such structure in the data (with high probability - see Theorems 1(ii) and 5(ii)).

•

Demonstration that large gaps between data clusters are not sufficient to ensure well-clusterability by $k$ -means (see Section 3).

The structure of this paper is as follows: In Section 2 we recall the previous work on the topic of clusterability and give a brief introduction to the $k$ -means algorithm and its special case $k$ -means++. In Section 3 we show that large gaps are not sufficient for well-clusterability. In Section 4 we introduce the first version of well-clusterability concept and show that data well-clustered in this sense are easily learnable via $k$ -means++. This concept has the drawback that no data points (outliers) can lie in wide areas between the clusters. Therefore in Section 7 we propose a core-based well-clusterability concept and show that data well-clustered in this sense are also easily learnable via $k$ -means++. The concept of cluster core itself is introduced and investigated in Section 5 and a method determining proper gap size under these new conditions is derived in Section 6. In Section 8 some experimental results are reported concerning performance of various brands of $k$ -means algorithms for data fulfilling the clusterability criteria proposed in this paper. Section LABEL:sec:discussion contains a brief comparison of our clusterability criteria with those discussed by Ben-David [10]. In Section LABEL:sec:conclusions we draw some conclusions from this research.

2 The problem of clusterability in the previous work

Intuitively the clusterability shall be a function taking a set of points and returning a real value saying how ”strong” or ”conclusive” is the clustering structure of the data [2]. This intuition, however, turns out not to be formalized in a uniform way so that quite a large number of formal definitions have been proposed. Ackerman and Ben-David in [2] studied several of these notions. They concluded that across the various formalizations, two phenomena co-occur: on the one hand well-clusterable data sets (with high ”clusterability” value) are computationally easy to cluster (in polynomial time), but on the other hand identification whether or not the data is well-clusterable is NP-hard.

Ben-David [10] performed an interesting investigation of the concepts of clusterability from the point of view of the capability of ”not too complex” algorithms to discover the cluster structure, (negatively) verifying the working hypothesis that “Clustering is difficult only when it does not matter” (the $CDNM$ thesis).

He considered the following notions of clusterability, present in the literature:

•

Perturbation Robustness meaning that small perturbations of distances / positions in space of set elements do not result in a change of the optimal clustering for that data set. Two brands may be distinguished: additive [2] and multiplicative ones [12] (the limit of perturbation is upper-bounded either by an absolute value or by a coefficient).

•

$\epsilon$ -Separatedness meaning that the cost of optimal clustering into $k$ clusters is less than $\epsilon^{2}$ times the cost of optimal clustering into $k-1$ clusters [19] - here an explicit reference to the $k$ -means objective is made.

•

$(c,\epsilon)$ -Approximation- Stability* [8] meaning that if the cost function values of two partitions differ by the factor $c$ , then the distance (in some space) between the partitions is at most $\epsilon$ . As Ben-David recalls, this implies the uniqueness of optimal solution.

•

$\alpha$ -Centre Stability* [7] meaning, for any centric clustering, that the distance of an element to its cluster centre is $\alpha$ times smaller than the distance to any other cluster centre under optimal clustering.

•

$(1+\alpha)$ Weak Deletion Stability [6] meaning that given an optimal cost function value $OPT$ for $k$ centric clusters, then the cost function of a clustering obtained by deleting one of the cluster centres and assigning elements of that cluster to one of the remaining clusters should be bigger than $(1+\alpha)\cdot OPT$ .

Under these notions of clusterability algorithms have been developed clustering the data nearly optimally in polynomial times, when some constraints are matched by the mentioned parameters.

However, these conditions seem to be rather extreme. For example, given the $(c,\epsilon)$ -Approximation- Stability [8], polynomial time clustering requires that, in the optimal clustering (beside its uniqueness), all but an $\epsilon$ -fraction of the elements, are 20 times closer to their own cluster centre than to every other cluster centre. $\epsilon$ -Separatedness requires that the distance to its own cluster centre must be at least 200 times closer than to every other cluster element [19]. And this is still insufficient if the clusters are not balanced. A ratio of $10^{7}$ is deemed by these authors as sufficient. ( $1+\alpha$ ) Weak Deletion Stability [6] demands distances to other clusters being $\log(k)$ times the ”average radius” of the own cluster. The perturbational stability [2] induces exponential dependence on the sample size.

Anyway, we can draw a certain important conclusion from these concepts of clusterability mentioned above: People agree that a data set is well clusterable if each cluster is distant (widely separated) from the other clusters.

This idea occurs in many other clusterability concepts. Epter et al. [14] considers the data as clusterable when the minimum between-cluster separation exceeds the maximum in-cluster distance (called elsewhere ”perfect separation”).222It has been shown in the literature that under this notion of well-clusterability single link algorithm can detect clusters separated in such a way. It has also been shown that centre based algorithms like $k$ -means may fail to detect such clusters, see e.g. [4].

Balcan et al. [9] proposes to consider data as clusterable if each element is closer to all elements in its cluster than to all other data (called also ”nice separation”).333It has been shown in the literature that this notion of well-clusterability is hard to decide in a data set, see e.g. [4]. Interestingly, $k$ -means reflects the Balcan concept ”on average” that is each element average squared distance to elements of the same cluster is smaller than the minimum (over other clusters) averaged squared distance to elements of a different cluster. Kumar and Kannan [17], explicitly concentrating on $k$ -means objective, define clusterability via a proximity condition stating that any point projected on a line connecting its own cluster centre and some other cluster centre should be closer to its own cluster centre by a ”sufficiently large” gap depending on the number of clusters and inverted squared cluster cardinalities.

Kushagra et al. [18] consider clusterability from the point of view of a structure in the data. They allow for noise in the data, but insist that the noise does not create structures by itself. They refrain from optimising a cost function. They show that without assumption of structure in the data or without assumption of structureless noise discovery of clusters is not possible.

Ackerman and Dasgupta [4] move the focus on clusterability from the clusterability as a property of the data alone to the pair of (data type, algorithm type). In that paper, they are interested in incremental algorithms only and show that an incremental version of $k$ -means performs poorly under perfect and nice separation.

In the same spirit Ben-David and Haghtala [11] investigated clusterability by $k$ -centroidal algorithms (a class of algorithms including $k$ -means) via robustifying an algorithm against noise in the data by either clustering the noise into separate clusters or cutting off too distant points.

Ackerman et al. [3] consider the clusterability from the perspective of distortion of clusters by malicious points. It turns out that from this perspective $k$ -means performs better than various other algorithms. With respect to our research they also insist that the proportions between cluster sizes play a significant role ensuring proper clustering.

Cohen-Addad [13] raises the claim that data are clusterable (in terms of various stability criteria) if the global clustering can be well approximated by local one. Our work can be perceived in this spirit in that we try to achieve coincidence of clusters based on separability with global cost function minimum.

Tang [21] investigates a clusterability criterion for his own version of $k$ -means, based on the requirement that the cluster centres are separated by some distance, which is dependent upon ground truth optimal clustering.

Recently Ackerman et al. [1] derived a method for testing clusterability of data based on the large gap assumption. They investigate the histogram of (all) mutual dissimilarities between data points. If there is no data structure, the distribution should be unimodal. If there are distant clusters, then there will occur one mode for short distances (within clusters) and at least one for long distances (between clusters). Hence, to detect clusterability, they apply tests of multimodality, namely the Dip [15] and Silverman [20] tests.

But the criterion of a sufficiently large gap between clusters is not reflected in various clustering function objectives, like for example $k$ -means which may reach an optimum with poorly separated clusters in spite of the fact that there exists an alternative partition of data with a clear separation between clusters in the data, as we will demonstrate in Section 3. Also in Section 3 we will demonstrate, that multimodal distributions can be detected by Ackerman‘s method even if there is no structure in the data.

Ben-David [10] raises a further important point that it is usually (in practically all above mentioned methods except [1], which has a flaw by itself) impossible to verify apriori if the data fulfils the clusterability criterion because the conditions refer either to all possible clusterings or to optimal clustering so that we do not have the possibility to verify whether or not the data set is clusterable, before one starts clustering (but usually computing the optimum is NP-hard).

In this paper, however, we would like to stress that the situation is even worse. Even at the termination of the clustering algorithm we are unable to say whether or not the clustered data set turned out to be well-clusterable. For example, the $\epsilon$ -Separatedness criterion requires that we know the nearly optimal solution for clustering into $k$ and $k-1$ elements. While we can usually get the upper approximations for the cost functions in both cases, we need actually the lower approximation for $k-1$ in order to decide ex post if the data was well-clusterable, and hence whether or not we can say that we approximated the correct solution in some way. But we get it only for $k=2$ , hence for higher $k$ the issue is not decidable. Tang‘s [21] criterion is certainly better, though also based on solution to optimality criterion, because we can sometimes decide ex-post that the clusterability criterion was fulfilled (the distance between clusters needs to be greater than a product of optimal clustering cost function and reversed squared roots of cluster cardinalities, which may be upper-bounded by the actual clustering cost function and the number 2). Still in this case upon finding the optimal clustering we will be still unsure that it is so even if the clusterability criterion is met.

The issue of ex-post decision on clusterability seems nevertheless to be simpler to solve than the apriorical one, therefore we will attack it in this paper. We are unaware that such an issue was even raised in the past. Though the criteria of [14] and [9] can clearly be applied ex post to see that in the resulting clustering the clusterability criteria hold, but these approaches lack the solving of the inverse issue: what if the clusterability criteria are not matched by the result clustering - is the data unclusterable? Could no other algorithm discover the clusterable structure?

One shall note at this point that the approach in [1] is different with this respect. Compared to methods requiring finding the optimum first, Ackerman‘s approach seems to fulfil Ben-David requirement, that we can see if there is clusterability in the data before starting the clustering process as the clusterability method is computationally optimal because the computation of the histogram of dissimilarities is quadratic in sample size. But at an in-depth-investigation, the Ackerman‘s clusterability determination method misses one important point: it requires a user-defined parameter and the user may or may not make the right guess. Furthermore, even if clusterability is decided by Ackerman‘s tests, it is still uncertain if $k$ -means algorithm will be willing to find such a clustering that fits Ackerman‘s clusterability criterion. Beside this, as visible in Figure 1, one can easily find counterexamples to their concept of clusterability. The left image shows that there is a single cluster there, but the histogram to the right has two modes, indicating a two-cluster structure.

So in summary the issue of an aposteriorical determination if the data were clusterable, remains an open issue.

Therefore it seems to be justified to restrict oneself to a problem as simple as possible in order to show that the issue is solvable at all. So in this paper we will limit ourselves to the issue of clusterability for the purposes of $k$ -means algorithm.444 The $k$ -means algorithm seems to be quite popular in various variants both in traditional, kernel and spectral clustering. Hence the results may be still of sufficiently broad importance. Furthermore we restrict ourselves to determine such cases when the clusterability is decidable ”for sure”.

The first problem to solve seems to be to get rid of the dependence on the undecidedness of optimality of the obtained solution.

But before proceeding let us recall the $k$ -means cost function definition.

[TABLE]

for a dataset $\mathbf{X}$ under some partition $\mathcal{C}=\{C_{1},\dots,C_{k}\}$ into the predefined number $k$ of clusters, $C_{1}\cup\dots\cup C_{k}=\mathbf{X}$ , where $u_{ij}$ is an indicator of the membership of data point $\textbf{x}_{i}$ in the cluster $C_{j}$ having the centre at $\boldsymbol{\mu}_{j}=\frac{1}{|C_{j}|}\sum_{\textbf{x}\in C_{j}}\textbf{x}$ .

The $k$ -means algorithm starts with some initial guess of the positions of $\boldsymbol{\mu}_{j}$ for $j=1,\dots,k$ and then alternating two steps: cluster assignment and centre update till some convergence criterion is reached, e.g. no changes in cluster membership. The cluster assignment step updates $u_{ij}$ values so that each element $\textbf{x}_{i}$ is assigned to a cluster represented by the closest $\boldsymbol{\mu}_{j}$ . The centre update step uses the update formula $\boldsymbol{\mu}_{j}=\frac{1}{|C_{j}||}\sum_{\textbf{x}\in C_{j}}\textbf{x}$ .

The $k$ -means++ algorithm is a special case of $k$ -means where the initial guess of cluster centres proceeds as follows. $\boldsymbol{\mu}_{1}$ is set to be a data point uniformly sampled from $\mathbf{X}$ . The subsequent cluster centres are data points picked from $\mathbf{X}$ with probability proportional to the squared distance to the closest cluster centre chosen so far. For details check [5]. Note that the algorithm proposed by [19] differs from the $k$ -means++ only by the non-uniform choice of the first cluster centre (the first pair of cluster centres should be distant, and the choice of this pair is proportional in probability to the squared distances between data elements).

3 Non-suitability of gap-based clusterability criteria for $k$ -means

Let us discuss more closely the relationship between the gap-based well-clusterability concepts developed in the literature and the actual optimality criterion of $k$ -means. Specifically let us consider the approaches to clusterability of [1], [14], [7] and [9].

Human intuition will tell us that if the groups of data points occur in the data and there are large spaces between these groups, then it should be these groups that will be chosen as the actual clustering. On the other hand if there are no gaps between the groups of data points, then one would expect that the data are not considered as well-clusterable. Furthermore, if the data is well-clusterable, one would expect a reasonable clustering algorithm to discover easily such a well-clusterable data structure.

However, these intuitions prove wrong in case of $k$ -means.

Let us first point to the fact that [1] may indicate a clear bimodal structure in the data where there are no gaps in the data. We are unaware of anybody pointing at this weakness of well-clusterability in [1]: Imagine a thin ring uniformly covered with data points (see Figure 1(a)). We would be reluctant to say that there is a clustering structure in such data. Nonetheless we will see two obvious modes in such data. The thinner the ring (the closer to a circle), the more obvious the reason for the multimodality will be: we will get closer and closer to the following function. Consider the angle $\alpha$ centred at the centre of the circle (”thin ring”). As we are interested in calculating distances between points, we restrict ourselves to angles with measure $0^{o}\leq\alpha\leq 180^{o}$ (or $0\leq\alpha\leq\pi$ ). The number of elements within the angle would be approximately proportional to this angle. The distance between cutting points of this angle on the circle, given a radius $r$ of the circle, will amount to $x=2r\sin\frac{\alpha}{2}$ . Consequently $\alpha=2\arcsin\frac{x}{2r}$ . To determine the density of distances we need to compute a derivative $\frac{d\alpha}{dx}=\frac{1}{dx/d\alpha}$ $=\frac{1}{d(2r\sin\frac{\alpha}{2})/d\alpha}$ $=\frac{1}{r\cos\frac{\alpha}{2}}$ $=\frac{1}{r\sqrt{1-\sin^{2}\frac{\alpha}{2}}}$ $=\frac{1}{r\sqrt{1-\frac{x^{2}}{4r^{2}}}}$ . This function has a minimum at $x=r$ and grows towards both $x=0$ and $x=2r$ . If it is actually not a circle, but a ring, more distances close to zero occur, hence the shape of the histogram. In our case the radius was 5, so we need to multiply these numbers with 5 to get what is visible in the histogram in Figure 1(b).

On the other hand, even if there are gaps between groups of data, for example those required by [14], [7] or [9], $k$ -means optimum may not lie in the partition exhibiting gap based well-clusterability property in spite of its existence, And not only for these gaps, but also for any arbitrary many times larger ones. As [14] is concerned, it may be considered as a special case of [9]. [7] may be viewed in turn as a strengthening of the concept of [14]. So let us discuss a situation in which both perfect and nice separation criteria are identical that is of two clusters. We will show that whatever $\alpha$ we assume in the $\alpha$ -stability concept, $k$ -means fails to be optimal under unequal class cardinalities. Let these clusters, $C_{A},C_{B}$ be each enclosed in a ball of radius $r$ and the distance between ball centres should be at least $4r$ . We have demonstrated in [16] that under these circumstances the clustering of data into $C_{A},C_{B}$ reflects a local minimum of $k$ -means cost function. But it is not the global minimum, as we will show subsequently. So at least for $k$ -means the criteria of Epter and Balcan and Awasthi cannot be viewed as realistic definitions of well-clusterability. Subsequently, whenever we say that a cluster is enclosed in a ball of radius $r$ , we mean at the same time that the ball is centred at gravity centre of the cluster.

For purposes of demonstration we assume that both clusters are of different cardinalities $n_{A},n_{B}$ and let $n_{A}>n_{B}$ . We show that whatever distance between both clusters, we can get such a proportion of $n_{A}/n_{B}$ that the clustering into $C_{A},C_{B}$ is not optimal.

Let us consider a $d$ -dimensional space. Let us select the dimension that contributes most to the variance in cluster $C_{A}$ . So the variance along this direction amounts to at least the overall variance divided by $d$ . Let us denote this variance component as $V_{d}$ . Consider this coordinate axis responsible for $V_{d}$ to have the origin at the cluster centre of $C_{A}$ . Project all the points of cluster $C_{A}$ on this axis. The variance of projected points will be just $V_{d}$ . Split the projected data set into two parts $P_{1},P_{3}$ , one with coordinate above 0 and the rest. Let the centres of $P_{1},P_{3}$ lie $x_{1},x_{3}$ away from the cluster centre. Let $n_{1}$ data points of $P_{1}$ be at most $x_{1}$ distant from the origin, and $n_{2}$ more than $x_{1}$ from the origin. Let there be $n_{3}$ data points of $P_{3}$ be at most $x_{3}$ distant from the origin, and $n_{4}$ more than $x_{3}$ from the origin. Obviously, $n_{1}+n_{2}+n_{3}+n_{4}=n_{A}$ , $|P_{1}|=n_{1}+n_{2}$ , $|P_{3}|=n_{3}+n_{4}$ . As zero is assumed to be the $C_{A}$ cluster centre on this line, also $x_{1}\cdot(n_{1}+n_{2})=x_{3}\cdot(n_{3}+n_{4})$ holds. Furthermore, as the cluster is enclosed in a ball of radius $r$ centred at its gravity centre, both $x_{1}\leq r$ and $x_{3}\leq r$ . Under these circumstances, let us ask the question whether for a $V_{d}$ some minimal values of $x_{1},x_{3}$ are implied. Because if so, then by splitting the cluster $C_{A}$ into $P_{1},P_{3}$ and by increasing the cardinality of $C_{A}$ , the split of data into $P_{1},P_{2}\cup C_{B}$ will deliver a lower $Q$ value so that for sure the clustering into $C_{A},C_{B}$ will not be optimal.

Note that $V_{d}=(Var(P_{1})+x_{1}^{2})\cdot(n_{1}+n_{2})+(Var(P_{3})+x_{3}^{2})\cdot(n_{3}+n_{4}))/n_{A}$ . The $n_{1}$ points of $P_{1}$ closer to origin than $x_{1}$ are necessarily not more than $x_{1}$ distant from $P_{1}$ gravity centre. Therefore, the remaining $n_{2}$ points cannot be more distant than $x_{1}\frac{n_{1}}{n_{2}}$ . Hence $Var(P_{1})\leq x_{1}^{2}n_{1}+\left(x_{1}\frac{n_{1}}{n_{2}}\right)^{2}n_{2}$ . By analogy $Var(P_{3})\leq x_{3}^{2}n_{3}+\left(x_{3}\frac{n_{3}}{n_{4}}\right)^{2}n_{4}$ .

So we observe that

[TABLE]

that is

[TABLE]

Note that we can delimit $n_{2},n_{4}$ from below due to the relationship: $(r-x_{1})\cdot n_{2}\geq n_{1}\cdot x_{1}$ , $(r-x_{3})\cdot n_{4}\geq n_{3}\cdot x_{3}$ .

Therefore

[TABLE]

Hence

[TABLE]

Recall that $x_{1}\cdot(n_{1}+n_{2})=x_{3}\cdot(n_{3}+n_{4})$ . So we obtain equivalently

[TABLE]

which is equivalent to

[TABLE]

By rearranging the terms we have:

[TABLE]

Let us increase the right hand side by adding to the nominator $n_{2}\cdot r\cdot x_{1}+n_{4}\cdot r\cdot x_{3}$ . This implies

[TABLE]

Let us substitute $x_{1}=\frac{x_{3}\cdot(n_{3}+n_{4})}{n_{1}+n_{2}}$ .

[TABLE]

Hence

[TABLE]

We can delimit $n_{1}+n_{2}$ from below due to relationship $x_{3}\cdot(n_{3}+n_{4})=(n_{1}+n_{2})\cdot x_{1}\leq(n_{1}+n_{2})\cdot r$ because $x_{1}\leq r$ . It implies that $\frac{1}{n_{1}+n_{2}}\leq\frac{r}{x_{3}\cdot(n_{3}+n_{4})}$ . Therefore

[TABLE]

which simplifies to

[TABLE]

Clearly $n_{3}+n_{4}<n_{A}$ , so we obtain

[TABLE]

This means that

[TABLE]

Now let us show that when scaling up $n_{A}$ it pays off to split the first cluster and to attach the contents of the second one to one of the parts of the first. Let us increase the cardinality of $C_{A}$ $b$ times simply by replacing each data element by $b$ data elements collocated at the same place in space. In this way we keep $V_{d}$ when increasing $|C_{A}|$ . So the sum of squared distances between centre and elements of the cluster $C_{A}$ , $SSC(C_{A})$ will be kept below $V_{d}\cdot d\cdot n_{A}b$ ( $SSC(C_{A})\leq V_{d}\cdot d\cdot n_{A}b$ ).

Let $n_{1}+n_{2}$ be the minority among data points - then $x_{1}$ is larger and $x_{3}$ is smaller of the two, because of $x_{1}\cdot(n_{1}+n_{2})=x_{3}\cdot(n_{3}+n_{4})$ . Let $P^{\prime}_{1},P^{\prime}_{3}$ be the subsets of $C_{A}$ yielding upon the aforementioned projection the mentioned sets $P_{1},P_{3}$ . Then if we would split $C_{A}$ into $P^{\prime}_{1},P^{\prime}_{3}$ , the sum of squared distances to respective cluster centres of $P^{\prime}_{1},P^{\prime}_{3}$ would decrease by at least $x_{3}^{2}n_{A}b$ , because $SSC(P_{1}\cup P_{3})-x_{3}^{2}n_{A}b\geq SSC(P_{1}\cup P_{3})-x_{1}^{2}(n_{1}+n_{2})b-x_{3}^{2}(n_{3}+n_{4})b\geq SSC(P_{1})+SSC(P_{3})$ , and the distances between elements of $P^{\prime}_{1}$ and $P^{\prime}_{3}$ (and so respective gravity centres) are at least as big as between $P_{1}$ and $P_{3}$ , so that $SSC(C_{A})-x_{3}^{2}n_{A}b=SSC(P^{\prime}_{1}\cup P^{\prime}_{3})-x_{3}^{2}n_{A}b\geq SSC(P^{\prime}_{1})+SSC(P^{\prime}_{3})$ ,

On the other hand combining $P^{\prime}_{1},P^{\prime}_{3}$ with disjoint parts $P^{\prime}_{6},P^{\prime}_{7}$ of $C_{B}$ will increase the sum of squared distances by at most $n_{B}x_{5}^{2}$ , where $x_{5}$ is the distance between extreme elements of $C_{A}$ and $C_{B}$ : $SSC(P^{\prime}_{1}\cup P^{\prime}_{6})+SSC(P^{\prime}_{3}\cup P^{\prime}_{7})\leq SSC(P^{\prime}_{1})+|P_{6}|x_{5}^{2}+SSC(P^{\prime}_{3})+|P^{\prime}_{7}|x_{5}^{2}=SSC(P^{\prime}_{1})+SSC(P^{\prime}_{3})+n_{B}x_{5}^{2}$ .

Combining these two relations we get

[TABLE]

Therefore, as soon as we set $b\geq\frac{n_{B}x_{5}^{2}}{(V_{d}/3/r)^{2}n_{A}}\geq\frac{n_{B}x_{5}^{2}}{x_{3}^{2}n_{A}}$ , we will obtain

[TABLE]

that is that for suitably large $b$ it pays off to split $C_{A}$ and merge $C_{B}$ into parts of $C_{A}$ , because the optimum lies at other partition than the one of well-separatedness in terms of big distance between centres of cluster enclosing balls. See also the discussion in Section 8 on the table 1.

4 Our basic approach to clusterability

Let us stress at this point that the issue of well-clusterability is not only a theoretical issue, but it is of practical interest too. For example when we intend to create synthetic data sets for investigating suitability of various clustering algorithms. But also after having performed the clustering process with whatever method we have, we need to answer one important question: whether or not the obtained clustering meets the expectation of the analyst.

These expectations may be divided into several categories:

•

matching business goals,

•

matching underlying algorithm assumptions,

•

proximity to the optimal solutions.

Business goals of the clustering may be difficult to express in terms of data for an algorithm, or may not fit the algorithm domain or data may be too expensive to collect prior to performing an approximate clustering.

For example, when one seeks a clustering that would enable efficient collection of cars to be scrapped (disassembly network), then one has to match multiple goals, like covering the whole country, maximum distance from client to the disassembly station, and of course the number of prospective clients, which is known with some degree of uncertainty. The distances to the clients are frequently not Euclidean in nature (due to geographical obstacles like rivers mountains etc.), while the preferred $k$ -means algorithm works best with geometrical distances, no upper distance can be imposed etc. Other algorithms may induce same or different problems. So a posteriori one has to check if the obtained solution meets all criteria, does not violate constraints and is stable under fluctuation of the actual set of clients.

The other two problems are somehow related to one another. For example, you may have clustered the data being a subsample of the proper data set and the question may be raised how close the sub-sample cluster centres are to the cluster centres of the proper data set. Known methods allow to estimate this discrepancy given that we know that the cluster sizes do not differ too much. So prior to evaluating the correctness of cluster centre estimation we have to check if cluster proportions are within a required range (or if sub-sample size is relevant for such a verification). As another example consider methods of estimating closeness to optimal clustering solution under some general data distributions (like for the $k$ -means++[5]), but the guarantees are quite loose. But at the same time the guarantees can be much tighter if the clusters are well-separated in some sense. So if we want to be sure with a reasonable probability that the obtained solution is sufficiently close to the optimum, we would need to check if the obtained clusters are well separated in the defined sense.

With this in mind, as mentioned, a number of researchers developed the concept of data clusterability. The notion of clusterability should intuitively reflect the following idea: if it is easy to see that there are clear-cut clusters in the data, then one would say that the data set is clusterable. ”Easy to see” may mean either a visual inspection or some algorithm that quickly identifies the clusters. The well-established notion of clusterability would improve our understanding of the concept of the cluster itself - a well-defined clustering would be a clustering of clusterable points. This also would be a foundation for objective evaluation of clustering algorithms. The algorithm shall perform well for well-clusterable data and when the clusterability condition would be violated to some degree, the performance of a clustering algorithm is allowed to deteriorate also, but the algorithm quality would be measured on how the clusterability violation impacts the deterioration of algorithm performance.

However, the issue turns out not to be that simple. As is well known, each algorithm seeking to discover a clustering may be betrayed somehow to fail to discover a clustering structure that is visible upon human inspection of data. So instead of trying to reflect human vision of clusterability of the data set independently of the algorithm, let us rather concentrate on finding a concept of clusterability that is both reflecting human perception and the minimum of cost function of a concrete algorithm, in our case $k$ -means. We will particularly concentrate on its version called $k$ -means++.

So let us define:

Definition 1.

A data set is well-clusterable with respect to $k$ -means if (a) the data points may be split into subsets that are clearly separated by an appropriately chosen gap such that (b) the global minimum of $k$ -means cost function coincides with this split and (c) with high probability (over 0.95) the $k$ -means++ algorithm discovers this split and (d) if the split was found, it may be verified that the data subsets are separated by the abovementioned gap and (e) if the $k$ -means++ did not discover a split of the data fulfilling the requirement of the existence of the gap, then with high probability the split described by points (a) and (b) does not exist.

In the paper [16] we have investigated conditions under which one can ensure that the minimum of $k$ -means cost function is related to a clustering with (wide) gaps between clusters.

The conditions for clusterable data set therein are rather rigid, but serve the purpose of demonstration that it is possible to define properties of the data set that ensure this property of the minimum of $k$ -means. Let us recall below the main result in this respect.

So assume that the data set encompassing $n$ data points consists of $k$ subsets such that each subset $i=1,\dots,k$ can be enclosed in a ball of radius $r_{i}$ . Let the gap (distance between surfaces of enclosing balls) between each pair of subsets amount to at least $g$ , that is described below.

[TABLE]

and

[TABLE]

for any $p,q=1,\dots,k;p\neq q$ , when $n_{i},i=1,\dots,k$ is the cardinality of the cluster $i$ , $M=\max_{i}n_{i}$ , $m=\min_{i}n_{i}$ ,

Please note that the quotient of the cardinality of the largest to the smallest cluster increases the size of the required gap, as may be expected from Section 3. From formula (2) we see that both the relationship $M/m$ and $n/m$ matter. This formula gives the impression that this relationship may be like square root of the sum of the two. But note that $g$ is controlled also by formula (3), where the dependence of $g$ on $n/m$ may become close to linear, while that on $M/m$ will still be close to square root. As visible from Section 3, the sum of squared distances to cluster centre within the cluster and between clusters decides on the point when the shift in minimal costs occurs when the disproportion between cluster sizes grows. Hence $g$ needs to grow as square root with this disproportion $M/m$ . The impact of $n/m$ shall be rather viewed in the context of the number of clusters $k$ , as with fixed $m$ and growing $n$ $n/m$ may be deemed as a reflection of $k$ . If one looks at formula (5), one sees that $g$ depends approximately quadratically on $k$ . This relates probably to the fact that the number of possible misassignments between clusters grows quadratically with $k$ .

It is claimed in [16] that the optimum of $k$ -means objective is reached when splitting the data into the aforementioned subsets.

What are the implications? The most fundamental one is that the problem is decidable.

Theorem 1.

(i) If the data set is well-clusterable with a gap defined by formulas (2) and (3), then with high probability $k$ -means++ (after an appropriately chosen number of repetitions) will discover the respective clustering. (ii) If $k$ -means++ (after an appropriately chosen number of repetitions) does not discover a clustering matching formulas (2) and (3), then with high probability the data set is not well clusterable with a gap defined by formulas (2) and (3).

The rest of the current section is devoted to the proof of the claims of this new theorem, proposed in the current paper.

If we obtained the split, then for each cluster we are able to compute the cluster centre, the radius of the ball containing all the data points of the cluster, and finally we can check if the gaps between the clusters meet the requirement of formulas (2) and (3). So we are able to decide that we have found that the data set is well-clusterable.

So let us look at the claim (i). As we already know, the global minimum of $k$ -means coincides with the separation by abovementioned gaps. Hence if there exists a positive probability, that $k$ -means++ discovers the appropriate split, then by repeating independent runs of $k$ -means++ and picking the split minimising $k$ -means cost function we will increase the probability of finding the global minimum. We will show that we know the number of repetitions needed in advance, if we assume the maximum value of the quotient $M/m$ .

First consider the easiest case of all clusters being of equal sizes ( $M=m$ ). Then the above equations (2) and (3) can be reduced to ( $r=r_{max}$ )

[TABLE]

A diagram of dependence of $g/r$ on $k$ is depicted in Figure 3

Now let us turn to $k$ -means++ seeding. If already $i$ distinct clusters were seeded, then the probability that a new cluster will be seeded (under our assumptions) amounts to at least

[TABLE]

Hence the probability of accurate seeding ( $PAS(k)$ ) amounts to

[TABLE]

The diagram of dependence of this expression on $k$ is depicted in Figure 3.

Let us denote with $Pr_{succ}$ the required probability of success in finding the global minimum. To ensure that the seeding was successful in $Pr_{succ}$ (e.g. 95% ) of cases, we need to rerun $k$ -means++ at least $R$ times, with $R$ given by

[TABLE]

But look at the following relationship:

[TABLE]

The exponent of the last expression approaches rapidly zero, so that with increasing $k$ within a single pass of $k$ -means++ the optimum is reached. In fact, already for k=2 we have an error of below 3%, for k=8, below 1%, for k=30 below 0.1%. See the Figure 3 for illustration.

Let us discuss clusters with same radius, but different cardinalities. Let $m$ be the cluster minimum cardinality, and $M$ respectively the maximum.

[TABLE]

for any $p,q=1,\dots,k;p\neq q$ , when $n_{i},i=1,\dots,k$ is the cardinality of the cluster $i$ , $M=\max_{i}n_{i}$ , $m=\min_{i}n_{i}$ , Worst case $g/r$ values are illustrated in Figure 5.

Now let us turn to $k$ -means++ seeding. If already $i$ distinct clusters were seeded, then the probability that a new cluster will be seeded (under our assumptions) amounts to at least

[TABLE]

So again the probability of successful seeding will amount to at least:

[TABLE]

Even if $M$ is 20 times as big as $m$ , still the convergence to 1 is so rapid that already for $k=2$ the clustering success is achieved with $95\%$ success probability in a single repetition. An illustration is visible in Figure 5

So far we have concentrated on showing that if the data is well-clusterable, then within practically a single clustering run the seeding will have the property that each cluster obtains a single seed. But what about the rest of the run of $k$ -means? As in all these cases $g\geq 2r$ , then, as shown in [16], the cluster centres will never switch to balls encompassing other clusters, so that eventually the true cluster structure is detected and minimum of $Q$ is reached. This would complete the proof of claim (i). The demonstration of claim (ii) is straight forward. Note that if a clustering discovered by $k$ -means fulfils the conditions of well-clusterability, then the data set is clusterable for sure, by definition. If the data were not well-clusterable then $k$ -means++ for sure not find a clustering with the property of being well-clusterable, because it does not exist. If the data were well-clusterable then $k$ -means++ would have failed to identify it with probability of at most $1-Pr_{succ}$ .

So denote with $W$ the event that the data is well-clusterable. Further denote with $D$ the event that the $k$ -means++ algorithm states that the data is well-clusterable. We are now interested in approximating $P(\lnot W|\lnot D)$ , or more precisely stating that this probability is high.

[TABLE]

The last inequality is true because the well-clusterable data are in practice extremely rare, and for sure less frequent than not well-clusterable ones.

Please note that what we have discussed here is a kind of worst case analysis. Already from this discussion it is obvious that the probability of seeding of $k$ distinct clusters depends on the characteristics of the data. We refer always to the smallest gaps between clusters, but it may turn out that some clusters are stronger separated. This will automatically increase their probability of being hit so that the overall probability of hitting unhit clusters will increase significantly.

5 Smaller gaps between clusters

. In the previous section we considered well-clusterability under the assumption of large areas between clusters where no data points of any cluster will occur. Subsequently we show that this assumption may be relaxed so that spurious points are allowed between the major concentrations of cluster points. But to ensure that the presence of such points will not lead the $k$ -means procedure astray, we will distinguish core parts of the clusters and will ensure by the subsequent Theorem 3 that once a cluster core is hit by $k$ -means initialisation procedure, the cluster is preserved over subsequent $k$ -means iterations.

In [16] we have proven that

Theorem 2.

Let $A,B$ be cluster centres. Let $\rho_{AB}$ be the radius of a ball centred at $A$ and enclosing its cluster and it also is the radius of a ball centred at $B$ and enclosing its cluster. If the distance between the cluster centres $A,B$ amounts to $2\rho_{AB}+g$ , $g>0$ ( $g$ being the ”gap” between clusters), if we pick any two points, $X$ from the cluster of $A$ and $Y$ from the cluster of $B$ , and recluster both clusters around $X$ and $Y$ , then the new clusters will preserve the balls centred at $A$ and $B$ of radius $g/2$ (called subsequently ”cores”) each ( $X$ the core of $A$ , $Y$ the core of $B$ ).

Here we shall demonstrate the validity of a complementary theorem.

Theorem 3.

Let $A,B$ be cluster centres. Let $\rho_{AB}$ be the radius of a ball centred at $A$ and enclosing its cluster and it also is the radius of a ball centred at $B$ and enclosing its cluster. Let $\rho_{cAB}$ be the radius of a ball centred at $A$ and enclosing ”vast majority” of its cluster and it also is the radius of a ball centred at $B$ and enclosing ”vast majority” of its cluster. If the distance between the cluster centres $A,B$ amounts to $2\rho_{AB}+g$ , $g>0$ ( $g=2r_{cAB}$ being the ”gap” between clusters), if we pick any two points, $X$ from the ball $B(A,r_{cAB})$ and $Y$ from the ball $B(A,r_{cAB})$ , and recluster both clusters around $X$ and $Y$ , then the new clusters will be identical to the original clusters around $A$ and $B$ .

Definition 2.

If the gap between each pair of clusters fulfils the condition of either of the above two theorems, then we say that we have core-clustering.

Proof.

For the illustration of the proof see Figure 6.

The proof does not differ too much from the previous one and in fact the previous Theorem 2 is a special case of Theorem 3.

Consider the two points $A,B$ being the two centres of double balls. The inner call represents the core of radius $r_{cAB}=g/2$ , the outer ball of radius $\rho$ ( $\rho=\rho_{AB}$ ), enclosing the whole cluster. Consider two points, $X,Y$ , one being in each core ball (presumably the cluster centres at some stage of the $k$ -means algorithm). To represent their distances faithfully, we need at most a 3D space.

Let us consider the plane established by the line $AB$ and parallel to the line $XY$ . Let $X^{\prime}$ and $Y^{\prime}$ be orthogonal projections of $X,Y$ onto this plane. Now let us establish that the hyperplane $\pi$ orthogonal to $XY$ , and passing through the middle of the line segment $XY$ , that is the hyperplane containing the boundary between clusters centred at $X$ and $Y$ does not cut any of the balls centred at $A$ and $B$ . This hyperplane will be orthogonal to the plane of the Figure 6 and so it will manifest itself as an intersecting line $l$ that should not cross outer circles around $A$ and $B$ , being projections of the respective balls. Let us draw two solid lines $k,m$ between circles $O(A,\rho_{AB})$ and $O(B,\rho_{AB})$ tangential to each of them. Line $l$ should lie between these lines, in which case the cluster centre will not jump to the other ball.

Let the line $X^{\prime}Y^{\prime}$ intersect with the circles $O(A,r_{cAB})$ and $O(B,r_{cAB})$ at points $C,D,E,F$ as in the figure.

It is obvious that the line $l$ would get closer to circle $A$ , if the points $X^{\prime},Y^{\prime}$ would lie closer to $C$ and $E$ , or closer to circle $B$ if they would be closer to $D$ and $F$ .

Therefore, to show that it does not cut the circle $O(A,\rho)$ it is sufficient to consider $X^{\prime}=C$ and $Y^{\prime}=E$ . (The case with ball $Ball(B,\rho)$ is symmetrical).

Let $O$ be the centre of the line segment $AB$ . Let us draw through this point a line parallel to $CE$ that cuts the circles at points $C^{\prime},D^{\prime},E^{\prime}$ and $F^{\prime}$ . Now notice that centric symmetry through point $O$ transforms the circles $O(A,r_{cAB})$ , $O(B,r_{cAB})$ into one another, and point $C^{\prime}$ in $F^{\prime}$ and $D^{\prime}$ in $E^{\prime}$ . Let $E^{*}$ and $F^{*}$ be images of points $E$ and $F$ under this symmetry.

In order for the line $l$ to lie between $m$ and $k$ , the middle point of the line segment $CE$ shall lie between these lines.

Let us introduce a planar coordinate system centred at $O$ with $\mathcal{X}$ axis parallel to lines $m,k$ , such that $A$ has both coordinates non-negative, and $B$ non-positive. Let us denote with $\alpha$ the angle between the lines $AB$ and $k$ . As we assume that the distance between $A$ and $B$ equals $2\rho+2r_{cAB}$ , then the distance between lines $k$ and $m$ amounts to $2((\rho+r_{cAB})\sin(\alpha)-\rho)$ . Hence the $\mathcal{Y}$ coordinate of line $k$ equals $((\rho+r_{cAB})\sin(\alpha)-\rho)$ .

So the $\mathcal{Y}$ coordinate of the centre of line segment $CE$ shall be not higher than this. Let us express this in the coordinate system:

[TABLE]

Where $y_{OC}$ is the $y$ -coordinate of the vector $\overrightarrow{OC}$ , etc..

Note, however that

[TABLE]

So let us examine the circle with centre at A. Note that the lines $CD$ and $E^{*}F^{*}$ are at the same distance from the line $C^{\prime}D^{\prime}$ . Note also that the absolute values of direction coefficients of tangentials of circle $A$ at $C^{\prime}$ and $D^{\prime}$ are identical. The more distant these lines are, as line $CD$ gets closer to $A$ , the $y_{AC}$ gets bigger, and $y_{E^{*}A}$ becomes smaller. But from the properties of the circle we see that $y_{AC}$ increases at a decreasing rate, while $y_{E^{*}A}$ decreases at an increasing rate. So the sum $y_{AC}+y_{E^{*}A}$ has the biggest value when $C$ is identical with $C^{\prime}$ and we need hence to prove only that

[TABLE]

Let $M$ denote the middle point of the line segment $C^{\prime}D^{\prime}$ . As point $A$ has the coordinates $((\rho+r_{cAB})\cos(\alpha),(\rho+r_{cAB})\sin(\alpha))$ , the point $M$ is at distance of $(\rho+r_{cAB})\cos(\alpha)$ from $A$ . But $C^{\prime}M^{2}=r_{cAB}^{2}-((\rho+r_{cAB})\cos(\alpha))^{2}$ .

So we need to show that

[TABLE]

In fact we get from the above

[TABLE]

which is obviously true, as $\sin$ never exceeds 1. ∎

6 Core based global $k$ -means minimum

In the paper [16] we have investigated conditions under which one can ensure that the minimum of $k$ -means cost function is related to a clustering with (wide) gaps between clusters.

Based on the result of the preceding Section 5, we want to weaken these conditions requiring only that the big gaps exist between cluster cores and the clusters themselves are separated by much smaller gaps, equal to the size of the core.

In particular, let us consider the set of $k$ clusters $\overline{\mathcal{C}}=\{\overline{C_{1}},\dots,\overline{C_{k}}\}$ of cardinalities $\overline{n_{1}},\dots,\overline{n_{k}}$ and with radii of balls enclosing the clusters (with centres located at cluster centres) $\overline{r_{1}},\dots,\overline{r_{k}}$ . Let each of these clusters $\overline{C_{i}}$ have a core $C_{i}$ around the cluster $\overline{C_{i}}$ centre of radius $r_{i}$ and cardinality $n_{i}$ such that for $\mathfrak{p}\in[0,1)$

[TABLE]

We are interested in a gap $g$ between cluster cores $C_{1},\dots,C_{k}$ such that it does not make sense to split each cluster $\overline{C_{i}}$ into subclusters $\overline{C_{i1}},\dots,\overline{C_{ik}}$ and to combine them into a set of new clusters $\mathcal{S}=\{S_{1},\dots,S_{k}\}$ such that $S_{j}=\cup_{i=1}^{k}\overline{C_{ij}}$ .

We seek a $g$ such that the highest possible central sum of squares combined over the clusters $\overline{C_{i}}$ would be lower than the lowest conceivable combined sums of squares around respective centres of clusters $S_{j}$ . Let $Var(C)$ be the variance of the cluster $C$ (average squared distance to set $C$ gravity centre; with one exception, however: if referring to the * core * of any of the clusters $\overline{C_{i}}$ , we compute against the cluster $\overline{C_{i}}$ gravity centre, not the core $C_{i}$ gravity centre, so also with the $Q$ function). Let $C_{ij}=\overline{C_{ij}}\cap C_{i}$ be the core part of the subcluster $\overline{C_{ij}}$ . Let $r_{ij}$ be the distance of the centre of core subcluster $C_{ij}$ to the centre of cluster $\overline{C_{i}}$ . Let $v_{ilj}$ be the distance of the centre of core subcluster $C_{ij}$ to the centre of core subcluster $C_{lj}$ . So the total $k$ -means function for the set of clusters $(C_{1},\dots,C_{k})$ will amount to:

[TABLE]

And the total $k$ -means function for the set of clusters $(S_{1},\dots,S_{k})$ will amount to:

[TABLE]

Should $(\overline{C_{1}},\dots,\overline{C_{k}})$ constitute the absolute minimum of the $k$ -means target function, then $Q(\mathcal{S})\geq Q(\overline{\mathcal{C}})$ should hold, which is fulfilled if :

[TABLE]

Note that on the left hand-side of the inequality we ignored the portion of the data outside of the cores. this portion of the data would have made the left-hand-side even bigger.

The above inequality is implied by:

[TABLE]

Note that $Var(C_{ij})\leq r_{ij}^{2}$ , so

[TABLE]

To maximise $\sum_{j=1}^{k}n_{ij}r_{ij}^{2}$ for a single cluster $C_{i}$ of enclosing ball radius $r_{i}$ , note that you should set $r_{ij}$ to $r_{i}$ . Let $m_{j}=\arg\max_{j\in\{1,\dots,k\}}n_{ij}$ . If we set $r_{ij}=r_{i}$ for all $j$ except $m_{j}$ , then the maximal $r_{i{m_{j}}}$ is delimited by the relation $\sum_{j=1;j\neq m_{j}}^{k}n_{ij}r_{ij}\geq n_{i{m_{j}}}r_{i{m_{j}}}$ . So

[TABLE]

So if we can guarantee that the gap between cluster balls (of clusters from $\mathcal{C}$ ) amounts to $g$ then surely

[TABLE]

because in such case $g\leq v_{ilj}$ for all $i,l,j$ .

By combining inequalities (10), (12) and (13) we see that the global minimum is granted if the following holds:

[TABLE]

One can distinguish two cases: either (1) there exists a cluster $S_{t}$ containing two subclusters $C_{pt}$ , $C_{qt}$ such that $t=\arg\max_{j}|C_{pj}|$ and $t=\arg\max_{j}|C_{qj}|$ (maximum cardinality subclasses of their respective original clusters $C_{p},C_{q}$ or (2) not.

Consider the first case. Let $C_{p},C_{q}$ be the two clusters where $C_{pt}$ and $C_{qt}$ be two subclusters of highest cardinality within $C_{p},C_{q}$ resp. This implies that $n_{pt}\geq\frac{1}{k}n_{p},n_{qt}\geq\frac{1}{k}n_{q}$ . Also this implies that for $i\neq p,i\neq q$ $n_{it}\leq n_{i}/2$ .

[TABLE]

Note that

[TABLE]

So, in order to fulfil inequality (14), it is sufficient to require that

[TABLE]

This of course maximized over all combinations of $p,q$ .

Let us proceed to the second case. Here each cluster $S_{j}$ contains a subcluster of maximum cardinality of a different cluster $C_{i}$ . As the relation between $S_{j}$ and $C_{i}$ is unique, we can reindex $S_{j}$ in such a way that actually $C_{j}$ contains its maximum cardinality subcluster $C_{jj}$ . Let us rewrite the inequality (14).

[TABLE]

This is met if

[TABLE]

This is the same as:

[TABLE]

This is fulfilled if:

[TABLE]

Let $M$ be the maximum over $n_{1},\dots,n_{k}$ . The above holds if

[TABLE]

Let $m$ be the minimum over $n_{1},\dots,n_{k}$ . The above holds if

[TABLE]

This is the same as

[TABLE]

The above will hold, if for every $i=1,\dots,k$

[TABLE]

So the inequality (14) is fulfilled, if both inequality (15) and inequality (16) are held by an appropriately chosen $g$ .

In summary we have shown that

Theorem 4.

Let $\overline{\mathcal{C}}=\{\overline{C_{1}},\dots,\overline{C_{k}}\}$ be a partition of a data set into $k$ clusters of cardinalities $\overline{n_{1}},\dots,\overline{n_{k}}$ and with radii of balls enclosing the clusters (with centres located at cluster centres) $\overline{r_{1}},\dots,\overline{r_{k}}$ . Let each of these clusters $\overline{C_{i}}$ have a core $C_{i}$ of radius $r_{i}$ and cardinality $n_{i}$ around the cluster centre such that for $p\in[0,1)$

[TABLE]

Then if the gap $g$ between cluster cores $C_{1},\dots,C_{k}$ fulfils conditions expressed in formulas (15) and (16) then the partition $\overline{\mathcal{C}}$ coincides with the global minimum of the $k$ -means cost function for the data set.

7 Core based approach to clusterability

After the preceding preparatory work, we want to prove a theorem analogous to Theorem 1, but now allowing for smaller gaps between clusters.

Theorem 5.

(i) If the data set is well-clusterable with a gap defined by formulas (16) and (15), with $r_{i}$ replaced by their maxima, then with high probability $k$ -means++ (after an appropriately chosen number of repetitions) will discover the respective clustering. (ii) If $k$ -means++ (after an appropriately chosen number of repetitions) does not discover a clustering matching formulas (16) and (15) (with $r_{i}$ replaced by their maxima), then with high probability the data set is not well clusterable with a gap defined by formulas (16) and (15.

The rest of the current section is devoted to the proof of the claims of this theorem.

If we obtained the split, then for each cluster we are able to compute the cluster centre, the radius of the ball containing all the data points of the cluster but the most distant ones, constituting at most $\mathfrak{p}$ of the quality function for the cluster, and finally we can check if the gaps between the cluster cores meet the requirement of formulas (16) and (15). So we are able to decide that we have found that the data set is well-clusterable.

So let us look at the claim (i). As we already know from preceding Section 6, the global minimum of $k$ -means coincides with the separation by abovementioned gaps. Hence if there exists a positive probability, that $k$ -means++ discovers the appropriate split, then by repeating independent runs of $k$ -means++ and picking the split minimising $k$ -means cost function we will increase the probability of finding the global minimum. We will show that we know the number of repetitions needed in advance, if we assume the maximum value of the quotient $M/m$ .

We assume it is granted that

[TABLE]

for any $i=1,\dots,k$

[TABLE]

for any $p,q=1,\dots,k;p\neq q$ , when $n_{i},i=1,\dots,k$ is the cardinality of the cluster $i$ , $M=\max_{i}n_{i}$ , $m=\min_{i}n_{i}$ , For an illustration of this dependence see Figure 7.

So let us turn to $k$ -means++ seeding. If already $i$ distinct cluster cores were seeded, then the probability that a new cluster core will be seeded (under our assumptions) amounts to at least

[TABLE]

So again the probability of successful seeding will amount to at least:

[TABLE]

For an illustration of this dependence see Figure 9

Apparently in the limit the above expression lies at about $(1-\mathfrak{p})^{k-1}$ .

So to achieve the identification of the clustering with probability of at least $Pr_{succ}$ (e.g. $95\%$ ), we will need $R$ runs of $k$ -means++ where

[TABLE]

Note that

[TABLE]

The effect of doubling $k$ is

[TABLE]

that is it is sublinear in the expression $1-(1-\mathfrak{p})^{k-1}$ , hence $R$ grows slower than reciprocally logarithmically in $k$ and $p$ . For an illustration of this relation see Figure 9

So far we have concentrated on showing that if the data is well-clusterable, then within practically reasonable number of $k$ -means++ runs the seeding will have the property that each cluster obtains a single seed. But what about the rest of the run of $k$ -means? As shown in Section 5, the cluster centres will never switch to balls encompassing other clusters, so that eventually the true cluster structure is detected and minimum of $Q$ is reached. This would complete the proof of claim (i). The demonstration of claim (ii) is straight forward. If the data were well-clusterable then $k$ -means++ would have failed to identify it with probability of at most $1-Pr_{succ}$ . As the well-clusterable data are in practice extremely rare, the failure of the algorithm to identify a well-clusterable structure induces with probability of at least $Pr_{succ}$ that no such structure exists in the data. A detailed proof follows the reasoning of the last part of the proof of Theorem 1

8 Experimental results

In order to illustrate the issues raised in this paper, three types of experiments were performed. The first experiment, performed on synthetic data, is devoted to the mismatch of gaps between subsets of data and the clusterings obtained by common clustering algorithms. The results are shown in Table 1. The second, performed on synthetic data (Table 8), and third, performed on real data (Table LABEL:tab:real) are devoted to demonstration that $k$ -means++ is able to discover well-clusterable data. In particular it is shown that:

If a dataset is well-clusterable as defined in Theorem 1 or Theorem 5 (based on Definition 1) then $k$ -means++ is able to identify the best clustering (both for real world datasets and synthetic ones) 2) If $k$ -means++ cannot find a clustering satisfying well-clusterability, there is no good clustering structure, fitting those definitions, hidden in data (with high probability) for all $k$ -means style algorithms.

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Ackerman, A. Adolfsson, and N. Brownstein. An effective and efficient approach for clusterability evaluation. Co RR , abs/1602.06687, 2016.
2[2] M. Ackerman and S. Ben-David. Clusterability: A theoretical study. In David van Dyk and Max Welling, editors, Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics , volume 5 of Proceedings of Machine Learning Research , pages 1–8, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16–18 Apr 2009. PMLR.
3[3] M. Ackerman, S. Ben-David, D. Loker, and S. Sabato. Clustering oligarchies. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2013, Scottsdale, AZ, USA, April 29 - May 1, 2013 , pages 66–74, 2013.
4[4] M. Ackerman and S. Dasgupta. Incremental clustering: The case for extra clusters. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada , pages 307–315, 2014.
5[5] D. Arthur and S. Vassilvitskii. k 𝑘 k -means++: the advantages of careful seeding. In N. Bansal, K. Pruhs, and C. Stein, editors, Proc. of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms , SODA 2007, pages 1027–1035, New Orleans, Louisiana, USA, 7-9 Jan. 2007. SIAM.
6[6] P. Awasthi, A. Blum, and O. Sheffet. Stability yields a ptas for k-median and k-means clustering. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science , FOCS ‘10, pages 309–318, Washington, DC, USA, 2010. IEEE Computer Society.
7[7] P. Awasthi, A. Blum, and O. Sheffet. Center-based clustering under perturbation stability. Inf. Process. Lett. , 112(1-2):49–54, January 2012.
8[8] M.-F. Balcan, Blum A, and A. Gupta. Approximate clustering without the approximation. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2009, New York, NY, USA, January 4-6, 2009 , pages 1068–1077, 2009.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

An Aposteriorical Clusterability Criterion for

Abstract

1 Introduction

2 The problem of clusterability in the previous work

3 Non-suitability of gap-based clusterability criteria for kkk-means

4 Our basic approach to clusterability

Definition 1**.**

Theorem 1**.**

5 Smaller gaps between clusters

Theorem 2**.**

Theorem 3**.**

Definition 2**.**

Proof.

6 Core based global kkk-means minimum

Theorem 4**.**

7 Core based approach to clusterability

Theorem 5**.**

8 Experimental results

3 Non-suitability of gap-based clusterability criteria for $k$ -means

Definition 1.

Theorem 1.

Theorem 2.

Theorem 3.

Definition 2.

6 Core based global $k$ -means minimum

Theorem 4.

Theorem 5.