Analysis of Ward's Method

Anna Gro{\ss}wendt; Heiko R\"oglin; Melanie Schmidt

arXiv:1907.05094·cs.DS·July 12, 2019

Analysis of Ward's Method

Anna Gro{\ss}wendt, Heiko R\"oglin, Melanie Schmidt

PDF

Open Access

TL;DR

This paper analyzes Ward's hierarchical clustering method for the $k$-means problem, providing approximation guarantees under separation and balance conditions, and establishing bounds in various dimensions.

Contribution

It offers the first theoretical analysis of Ward's method for $k$-means, showing approximation ratios and bounds under specific data conditions.

Findings

01

Ward's method achieves a 2-approximation under separation.

02

Full optimal recovery with balance condition.

03

Lower bounds in high dimensions without separation.

Abstract

We study Ward's method for the hierarchical $k$ -means problem. This popular greedy heuristic is based on the \emph{complete linkage} paradigm: Starting with all data points as singleton clusters, it successively merges two clusters to form a clustering with one cluster less. The pair of clusters is chosen to (locally) minimize the $k$ -means cost of the clustering in the next step. Complete linkage algorithms are very popular for hierarchical clustering problems, yet their theoretical properties have been studied relatively little. For the Euclidean $k$ -center problem, Ackermann et al. show that the $k$ -clustering in the hierarchy computed by complete linkage has a worst-case approximation ratio of $Θ (lo g k)$ . If the data lies in $R^{d}$ for constant dimension $d$ , the guarantee improves to $O (1)$ , but the $O$ -notation hides a linear dependence on $d$ .…

Equations42

Δ_{k} (H) = Q \in H_{n - k} \sum Δ (Q, μ (Q)) = Q \in H_{n - k} \sum Δ (Q) .

Δ_{k} (H) = Q \in H_{n - k} \sum Δ (Q, μ (Q)) = Q \in H_{n - k} \sum Δ (Q) .

P_{d} = {(x_{1}, \dots, x_{d}) ∣ x_{1} \in {- 1, - (2 - 1), 2 - 1, 1}, x_{i} \in {- z_{i}, z_{i}} \forall i \in {2, \dots, d}} .

P_{d} = {(x_{1}, \dots, x_{d}) ∣ x_{1} \in {- 1, - (2 - 1), 2 - 1, 1}, x_{i} \in {- z_{i}, z_{i}} \forall i \in {2, \dots, d}} .

opt_{k} (P_{d}) = 2^{d} \cdot (2 - 2)^{2} .

opt_{k} (P_{d}) = 2^{d} \cdot (2 - 2)^{2} .

\frac{2 ^{i - 1} \cdot 2 ^{i - 1}}{2 ^{i - 1} + 2 ^{i - 1}} \cdot (2 z_{i})^{2} = 2^{i} z_{i}^{2} .

\frac{2 ^{i - 1} \cdot 2 ^{i - 1}}{2 ^{i - 1} + 2 ^{i - 1}} \cdot (2 z_{i})^{2} = 2^{i} z_{i}^{2} .

2^{d} \cdot (1 + z_{2}^{2} + \dots + z_{d}^{2}) = 2^{d + 1} \cdot z_{d + 1}^{2} = 2 \cdot 3^{d - 1} .

2^{d} \cdot (1 + z_{2}^{2} + \dots + z_{d}^{2}) = 2^{d + 1} \cdot z_{d + 1}^{2} = 2 \cdot 3^{d - 1} .

Ward_{k} (P_{d})

Ward_{k} (P_{d})

= 4 \cdot 3^{d - 1} + 2^{d - 1} (2 - 2)^{2} - 2^{d} .

\frac{Ward _{k} ( P _{d} )}{opt _{k} ( P _{d} )}

\frac{Ward _{k} ( P _{d} )}{opt _{k} ( P _{d} )}

= \frac{4}{3 ( 2 - 2 ) ^{2}} \cdot (\frac{3}{2})^{d} + \frac{1}{2} - \frac{1}{( 2 - 2 ) ^{2}}

\in Ω ((\frac{3}{2})^{d}) .

(A^{'}, B^{'}) =

(A^{'}, B^{'}) =

(A_{s + 1}, B_{s + 1}), \dots, (A_{t - 1}, B_{t - 1}),

(A_{t + 1}, B_{t + 1}), \dots, (A_{n - 1}, B_{n - 1})

Δ (A \cup B \cup C \cup D) \leq Δ (A) + 3 \cdot Δ (B \cup C) + Δ (D) + 4 \cdot D (A, B) + 4 \cdot D (C, D)

Δ (A \cup B \cup C \cup D) \leq Δ (A) + 3 \cdot Δ (B \cup C) + Δ (D) + 4 \cdot D (A, B) + 4 \cdot D (C, D)

D (A \cup B, C \cup D) \leq 3 \cdot Δ (B \cup C) + 3 \cdot D (A, B) + 3 \cdot D (C, D) - Δ (B) - Δ (C) .

D (A \cup B, C \cup D) \leq 3 \cdot Δ (B \cup C) + 3 \cdot D (A, B) + 3 \cdot D (C, D) - Δ (B) - Δ (C) .

Δ (N) \leq h = a - 1 \sum m - 1 D (x_{h}, x_{h + 1}) + h = m + 1 \sum b D (x_{h}, x_{h + 1}) .

Δ (N) \leq h = a - 1 \sum m - 1 D (x_{h}, x_{h + 1}) + h = m + 1 \sum b D (x_{h}, x_{h + 1}) .

Δ (N) \leq 8 \cdot (Δ ({x_{ℓ}, \dots, x_{m}}) + Δ ({x_{m + 1}, \dots, x_{r}})) .

Δ (N) \leq 8 \cdot (Δ ({x_{ℓ}, \dots, x_{m}}) + Δ ({x_{m + 1}, \dots, x_{r}})) .

Δ (F) \leq 35 \cdot (Δ ({x_{ℓ}, \dots, x_{m}}) + Δ ({x_{m + 1}, \dots, x_{r}})) .

Δ (F) \leq 35 \cdot (Δ ({x_{ℓ}, \dots, x_{m}}) + Δ ({x_{m + 1}, \dots, x_{r}})) .

Δ (F) \leq 35 \cdot (Δ ({x_{ℓ}, \dots, x_{m}}) + Δ ({x_{m + 1}, \dots, x_{r + 1}})),

Δ (F) \leq 35 \cdot (Δ ({x_{ℓ}, \dots, x_{m}}) + Δ ({x_{m + 1}, \dots, x_{r + 1}})),

Δ (F) \leq 35 \cdot (Δ ({x_{ℓ - 1}, \dots, x_{m}}) + Δ ({x_{m + 1}, \dots, x_{r}})) .

Δ (F) \leq 35 \cdot (Δ ({x_{ℓ - 1}, \dots, x_{m}}) + Δ ({x_{m + 1}, \dots, x_{r}})) .

A \in C_{5} \sum Δ (A) \leq O (1) \cdot opt_{k} .

A \in C_{5} \sum Δ (A) \leq O (1) \cdot opt_{k} .

Δ (W_{i} \cup W_{i + 1})

Δ (W_{i} \cup W_{i + 1})

\displaystyle\leq\

+ 4 D (W_{i} \cap O_{j}, W_{i} \cap O_{j + 1})

+ 4 D (W_{i + 1} \cap O_{j + 1}, W_{i + 1} \cap O_{j + 2}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFacility Location and Emergency Management · Advanced Clustering Algorithms Research · Stochastic Gradient Optimization Techniques

Full text

\RedeclareSectionCommand

[indent=0pt]subparagraph

Analysis of Ward’s Method††thanks: This research was supported by ERC Starting Grant 306465 (BeyondWorstCase).

Anna Großwendt Department of Computer Science, University of Bonn, Bonn, Germany

[email protected], [email protected], [email protected]

Heiko Röglin44footnotemark: 4

Melanie Schmidt44footnotemark: 4

Abstract

We study Ward’s method for the hierarchical $k$ -means problem. This popular greedy heuristic is based on the complete linkage paradigm: Starting with all data points as singleton clusters, it successively merges two clusters to form a clustering with one cluster less. The pair of clusters is chosen to (locally) minimize the $k$ -means cost of the clustering in the next step.

Complete linkage algorithms are very popular for hierarchical clustering problems, yet their theoretical properties have been studied relatively little. For the Euclidean $k$ -center problem, Ackermann et al. [1] show that the $k$ -clustering in the hierarchy computed by complete linkage has a worst-case approximation ratio of $\Theta(\log k)$ . If the data lies in $\mathbb{R}^{d}$ for constant dimension $d$ , the guarantee improves to $\mathcal{O}(1)$ [23], but the $\mathcal{O}$ -notation hides a linear dependence on $d$ . Complete linkage for $k$ -median or $k$ -means has not been analyzed so far.

In this paper, we show that Ward’s method computes a $2$ -approximation with respect to the $k$ -means objective function if the optimal $k$ -clustering is well separated. If additionally the optimal clustering also satisfies a balance condition, then Ward’s method fully recovers the optimum solution. These results hold in arbitrary dimension. We accompany our positive results with a lower bound of $\Omega((3/2)^{d})$ for data sets in $\mathbb{R}^{d}$ that holds if no separation is guaranteed, and with lower bounds when the guaranteed separation is not sufficiently strong. Finally, we show that Ward produces an $\mathcal{O}(1)$ -approximative clustering for one-dimensional data sets.

1 Introduction

Clustering is a fundamental tool in machine learning. As an unsupervised learning method, it provides an easy way to gain insight into the structure of data without the need for expert knowledge to start with. One of the most popular clustering objectives is $k$ -means: Given a set $P$ of points in the Euclidean space $\mathbb{R}^{d}$ , find $k$ centers that minimize the sum of the squared distances of each point in $P$ to its closest center. The objective is also called sum of squared errors, since the centers can serve as representatives, and then the sum of the squared distances becomes the squared error of this representation.

Theory has focused on metric objective functions for a long time: Facility location or $k$ -median are very well understood, with upper and lower bounds on the best possible approximation guarantee slowly approaching one another. The $k$ -means cost function is arguably more popular in practice, yet its theoretical properties were long not the topic of much analysis. In the last decade, considerable efforts have been made to close this gap.

We now know that $k$ -means is NP-hard, even in the plane [31] and also even for two centers [3]. The problem is also APX-hard [9], and the currently best approximation algorithm achieves an approximation ratio of 6.357 [2]. The best lower bound, though, is only 1.0013 [27]. A seminal paper on $k$ -means is the derivation of a practical approximation algorithm, $k$ -means++, which is as fast as the most popular heuristic for the problem (the local search algorithm due to Lloyd [29]), has an upper bound of $\mathcal{O}(\log k)$ on the expected approximation ratio, and has proven to significantly improve the performance on actual data [6]. Due to its simplicity and superior performance, it (or variants of it) can now be seen as the de facto standard initialization for Lloyd’s method.

From a practical point of view, however, there is still one major drawback of using $k$ -means++ and Lloyd’s method, and this has nothing to do with its approximation ratio or speed. Before using any method that strives to optimize $k$ -means, one has to determine the number $k$ of clusters. If one knows very little about the data at hand, then even this might pose a challenge. Indeed, there are several suggestions how to set $k$ , which usually look at the tradeoff between the number of clusters and the cost (which decreases if the number of clusters is increased). For example, the elbow method searches for a point where the cost decreases dramatically, arguing that this happens only at the point of the true number of clusters. However, there are many more methods to choose from (see for example the summary in §5 of [37]). Notice that one usually needs to compute multiple clusterings for different $k$ to use such a method.

However, there is a simpler and popular method available: hierarchical clustering. Instead of computing clusterings for several different numbers of clusters and comparing them, one computes one clustering tree (a dendrogram), which contains a clustering for every value of $k$ . For any $k\in[n-1]$ , the $k$ -clustering in such a tree results from the $(k+1)$ -clustering in the same tree by merging two clusters. The hierarchical clustering does not only provide an answer for every $k$ , it also allows the user to view the data at different levels of granularity. A hierarchical clustering is apparently something very desirable, but the question is: Can the solutions be good for all values of $k$ ? Do we lose much by forcing the hierarchical structure?

Dasgupta and Long [21] were the first to give positive and negative answers to this question. Their analysis evolves around the (metric) $k$ -center problem, which is to minimize the maximum radius of any cluster. They compare the $k$ -center cost on each level of a hierarchical clustering to an optimal clustering with the best possible radius with the same number of clusters and look for the level with the worst factor. It turns out that popular heuristics for hierarchical clustering can be off by a factor of $\log k$ or even $k$ compared to an optimal clustering. Dasgupta and Long also propose a clever adaption of the $2$ -approximation for $k$ -center due to González [22], which results in a hierarchical clustering algorithm. For this algorithm, they can guarantee that the solution is an $8$ -approximation of the optimum on every level of the hierarchy simultaneously.

In a series of works, Mettu, and Plaxton [34], Plaxton [36] and finally Lin, Nagarajan, Rajaraman, and Williamson [28] develop and refine algorithms for the hierarchical $k$ -median problem, which can be seen as the metric cousin of the hierarchical $k$ -means problem. It consists of minimizing the sum of the distances of every point to its closest center, and is usually studied in metric spaces. The best known approximation guarantee is $20.06$ . However, the quality guarantee vastly deteriorates for $k$ -means: An $\mathcal{O}(1)$ -approximation for the hierarchical $k$ -means problem follows from [36, 34] as well as from [28], but the approximation ratios range between $961$ and $3662$ .

On the practical side, however, there is a long known greedy algorithm for the hierarchical $k$ -means problem, named Ward’s method [39]. In the fashion of complete linkage algorithms, it does the following. It starts with singleton clusters, one for each data point from the input $P\subset\mathbb{R}^{d}$ . Then it performs $|P|-1$ iterations where two clusters in the current clustering are merged (this is called agglomerative clustering). In each iteration, it chooses the pair of clusters which results in the cheapest clustering. This is a locally optimal choice only, since the optimal merge in one operation may prove to be a poor choice with respect to a later level of the hierarchy.

To the best of the authors’ knowledge, the worst-case quality of Ward’s method has not been studied so far. In particular, it was not known whether the algorithm can be used to compute constant-factor approximations. We answer this question negatively by giving a family of examples with increasing $k$ and $d$ where the approximation factor of Ward is $\Omega((3/2)^{d})$ .

To explain the algorithms popularity, we then proceed to study it under different clusterability assumptions. Clustering problems are usually NP-hard and even APX-hard, yet clustering is routinely solved in practical applications. This discrepancy has led to the hypothesis that data sets are either easy to cluster, or they have little interesting structure to begin with. ‘Well-clusterable data sets are computationally easy to cluster’ [14] and ‘Clustering is difficult only when it does not matter’ [18] are two slogans summarizing this idea. Following it, many notions have been developed that strive to capture how well a data set is clusterable. One such notion is center separation [15]: A data set $P\subset\mathbb{R}^{d}$ is $\delta$ -center separated for some number $k$ of clusters if the distance between any pair of clusters in the target clustering is at least $\delta$ times the maximal radius of one of the clusters. It satisfies the similar $\alpha$ -center proximity [8] for $k$ if in the optimum $k$ -clustering the distance of each data point to any center except for its own is larger by a factor of at least $\alpha$ than the distance to its own center.

We apply these notions to hierarchical clustering by showing that if there is a well-separated optimum solution for a level, then the clustering computed by Ward on this level is a $2$ -approximation. This means that Ward finds good clusterings for all levels of granularity that have a meaningful clustering; and these good clusterings have a hierarchical structure. For levels on which the sizes of the optimal clusters are additionally to some extend balanced, we prove that Ward even computes the optimum clustering.

Related work.

The design of hierarchical clustering algorithms that satisfy per-level guarantees started with the paper by Dasgupta and Long [21]. They give a deterministic $8$ -approximation and a randomized $2e$ -approximation for hierarchical $k$ -center. Their method turns González’ algorithm [22] into a hierarchical clustering algorithm. González’ algorithm is a $2$ -approximation not only for $k$ -center, but also for the incremental $k$ -center problem: Find an ordering of all points, such that for all $k$ , the first $k$ points in the ordering approximately minimize the $k$ -center cost. The idea to make an algorithm for incremental clustering hierarchical was picked up by Plaxton [36], who proves that this approach leads to a constant factor approximation for the hierarchical $k$ -median problem. He uses an incremental $k$ -median algorithm due to Mettu and Plaxton [34]. Finally, Lin, Nagarajan, Rajaraman and Williamson [28] propose a general framework for approximating incremental problems that also works for incremental variants of MST, vertex cover, and set cover. They also cast hierarchical $k$ -median and $k$ -means into their framework for incremental approximation. They get a randomized/deterministic $20.06/41.42$ -approximation for hierarchical $k$ -median and a randomized/deterministic $151.1\alpha/576\alpha$ -approximation for $k$ -means, where $\alpha$ is the approximation ratio of a $k$ -means approximation algorithm. Thus, applying [2] yields guarantees of $961$ and $3662$ , respectively.

Lattanzi, Leonardi, Mirrokni, and Razenshteyn [26] develop a constant factor algorithm for robust hierarchical $k$ -center, i.e., a variant with outliers. In a different line of work, Dasgupta recently developed a new cost function for similarity-based hierarchical clusterings [20]. Although it can be transferred to the setting of dissimilarity measures, this yields an objective for which any solution is a constant factor approximation [17]. Work on this new cost function includes [16, 17, 20]. Balcan et al. present an algorithm for computing hierarchical clusterings that clusters the data accurately in the presence of outliers if the data satisfies certain clusterability properties [11, 13].

In practice, $k$ -means and hierarchical $k$ -means are rather tackled by popular heuristics, but the properties of these algorithms are often unknown. The famous $k$ -means algorithm due to Lloyd [30] was analyzed about ten years ago and became the subject of many papers, including [4, 5, 7, 19, 33, 35, 38]. This has led to the development of $k$ -means++ [6], a practically efficient algorithm with a theoretical approximation guarantee of $\mathcal{O}(\log k)$ .

Hierarchical clustering algorithms work either top-down (divisive methods) or bottom-up (agglomerative methods). Agglomerative methods are more popular because they are usually faster, and the most popular agglomerative methods are based on the complete linkage strategy. Here, the clusters to be merged are those which minimize the cost of the clustering in the next step. Using complete linkage for $k$ -means yields Ward’s method [39].

There is a relatively small number of papers studying the performance of complete linkage algorithms. Dasgupta and Long [21] establish the above mentioned $\log k$ lower bound for $k$ -center. Ackermann, Blömer, Kuntze, and Sohler [1] study complete linkage for variants of $k$ -center in the Euclidean space. The variants include minimizing the radius, the discrete radius and the diameter. They show that for constant dimension, complete linkage provides $\mathcal{O}(\log k)$ -approximations for $k$ -center as well as all variants of it. The drawback is that the approximation factor depends on the the dimension of the space (the extent of the dependence goes from linear dependence to doubly exponential dependence, depending on the problem variant). Großwendt and Röglin [23] improve the analysis, showing that for constant dimension, complete linkage indeed provides an $\mathcal{O}(1)$ -approximation. The dependencies on $d$ prevail.

Balcan, Liang, and Gupta [13] observe that Ward’s method cannot be used to recover a given target clustering.

There is a vast body of literature on clusterability assumptions, i.e., assumptions on the input that make clustering easier either in the sense that a target clustering can be (partially) recovered or that a good approximation of an objective function can be computed efficiently. A survey of recent work in this area can be found in [14]. Particularly relevant for our paper are the notions of $\delta$ -center separation [15] and $\alpha$ -center proximity [8] mentioned above. There are several papers showing that under these assumptions it is possible to recover the target/optimal clustering if $\delta$ and $\alpha$ are sufficiently large [8, 12, 25, 32]. Other notions include the strict separation property of Balcan, Blum, and Vempala [11], the $\epsilon$ -separation property of Ostrovsky et al. [35], and the weaker version of the proximity condition due to Kumar and Kannan [24] which Awasthi and Sheffet [10] proposed (it is based on the spectral norm of a matrix whose rows are the difference vectors between the points in the data set and their centers). For all these notions of clusterability, algorithms are developed that (partially) recover the target/optimal clustering.

Our results.

In §3, we analyze the approximation factor of Ward’s method on data sets that satisfy different well-known clusterability notions. It turns out that the assumption that the input satisfies a high $\delta$ -center separation [15] or $\alpha$ -center proximity [8] implies a very good bound on the approximation guarantee of Ward’s method. We show that Ward’s method computes a $2$ -approximation for all values of $k$ for which the input data set satisfies $(2+2\sqrt{2})$ -center separation or $(3+2\sqrt{2})$ -center proximity. We also show that on instances that satisfy $(2+2\sqrt{2\nu})$ -center separation and for which all clusters $O_{i}$ and $O_{j}$ in the optimal clustering satisfy $|O_{j}|\geq|O_{i}|/\nu$ , Ward even recovers the optimal clustering.

In §4 we show that, in general, Ward’s method does not achieve a constant approximation factor. We present a family of instances $(P_{d})_{d\in\mathbb{N}}$ with $P_{d}\subset\mathbb{R}^{d}$ on which the cost of the $2^{d}$ -clustering computed by Ward is larger than the cost of the optimal $2^{d}$ -means clustering of $P_{d}$ by a factor of $\Omega((3/2)^{d})$ . Then we observe that the family of instances used for this lower bound satisfy the strict separation property of Balcan, Blum, and Vempala [11], the $\epsilon$ -separation property of Ostrovsky et al. [35] for any $\epsilon>0$ , and the separation condition from Awasthi and Sheffet [10]. Hence, none of these three notions of clusterability helps Ward’s method to avoid that the approximation factor grows exponentially with the dimension.

Finally in §5 we show that the approximation ratio of Ward’s method on one-dimensional inputs is $\mathcal{O}(1)$ . The one-dimensional case turns out to be more tricky than one would expect, and our analysis is quite complex and technically challenging.

Preliminaries.

We consider inputs in the Euclidean space $\mathbb{R}^{d}$ . The Euclidean distance of $x_{1},x_{2}\in\mathbb{R}^{d}$ is denoted by $||x_{1}-x_{2}||=||x_{1}-x_{2}||_{2}$ . Let $P\subset\mathbb{R}^{d}$ be a finite set of points. For any center $c\in\mathbb{R}^{d}$ , we denote the sum of the squared distances of each point in $P$ to $c$ by $\Delta(P,c)=\sum_{p\in P}||p-c||^{2}.$ This sum is minimized when the center is the centroid $\mu(P):=\frac{1}{|P|}\sum_{p\in P}p$ of $P$ . We set $\Delta(P):=\Delta(P,\mu(P))$ . For any set of $k$ centers $C\subset\mathbb{R}^{d}$ , the $k$ -means objective cost is $\Delta(P,C)={\sum_{p\in P}}\min_{c\in C}||p-c||^{2}.$ The $1$ -means cost of $P$ is $\Delta(P)$ . If $P$ is weighted with a weight function $w:P\to\mathbb{N}_{\geq 1}$ , then we denote the total weight by $w(P):=\sum_{x\in P}w(x)$ and extend the above notations by $\mu(P,w)=\frac{1}{w(P)}\sum_{x\in P}w(x)x$ , $\Delta(P,w,c)=\sum_{x\in P}w(x)||x-c||^{2}$ , and $\Delta(P,w)=\Delta(P,w,\mu(P,w))$ . The weighted $k$ -means objective is $\Delta(P,w,C)=\sum_{x\in P}\min_{c\in C}w(x)||x-c||^{2}$ . We denote by $\operatorname{opt}_{k}(P)$ / $\operatorname{opt}_{k}(P,w)$ the value of a solution that minimizes the (weighted) $k$ -means objective, i.e., $\operatorname{opt}_{k}(P)=\min_{C\subset\mathbb{R}^{d},|C|=k}\Delta(P,C)$ and $\operatorname{opt}_{k}(P,w)=\min_{C\subset\mathbb{R}^{d},|C|=k}\Delta(P,w,C)$ , respectively.

We use the abbreviation $[i]=\{1,\ldots,i\}$ for $i\in\mathbb{N}$ .

Hierarchical clustering.

As described by Dasgupta and Long [21], a hierarchical clustering is a nested partitioning of a point set $P$ into $1,2,3,\ldots$ and finally $n$ clusters, where each intermediate clustering is a more fine-grained version of the previous clustering that results from dividing one cluster into two. This definition is ‘top-down’. Complete linkage algorithms build the hierarchical clustering ‘bottom-up’ by starting with $n$ singleton clusters and then subsequently merging two clusters into one until only one cluster remains. We will adapt this view and define a hierarchical clustering $\mathcal{H}$ as a sequence of partitionings $\mathcal{H}_{0},\ldots,\mathcal{H}_{n-1}$ , where $\mathcal{H}_{0}=\{\{x\}\mid x\in P\}$ and $\mathcal{H}_{n-1}=\{P\}$ , i.e., $\mathcal{H}_{i}$ shall be the clustering after $i$ merges. The intermediate partitionings satisfy that $\mathcal{H}_{i}=\mathcal{H}_{i-1}\backslash\{A_{i},B_{i}\}\cup\{A_{i}\cup B_{i}\}$ for two clusters $A_{i},B_{i}\in\mathcal{H}_{i-1}$ . Note that we can fully describe $\mathcal{H}$ by the sequence of the $n-1$ merge operations $(A_{1},B_{1}),(A_{2},B_{2}),\ldots,(A_{n-1},B_{n-1})$ that it implicitly contains.

A hierarchical clustering contains a $k$ -clustering for any $k\in\{1,\ldots,n\}$ . The clusterings are given as partitionings, the centers are implicitly defined as the centroids. More precisely, the $k$ -clustering defined by a hierarchical clustering $\mathcal{H}$ has the centers $\{\mu(Q)\mid Q\in\mathcal{H}_{n-k}\}$ . We thus define the $k$ -means clustering cost of $\mathcal{H}$ for a given $k$ as

[TABLE]

Useful Facts about $k$ -means.

The following two facts are well known.

Lemma 1 (Relaxed triangle inequality).

For all $x,y,z\in\mathbb{R}^{d}$ , $||x-y||^{2}\leq 2(||x-z||^{2}+||z-y||^{2}).$

Lemma 2.

For any finite point set $P\subset\mathbb{R}^{d}$ and any $c\in\mathbb{R}^{d}$ , $\Delta(P,c)=\Delta(P)+|P|\cdot||c-\mu(P)||^{2}.$

Lemma 2 has the following important consequence. Whenever a set of points $P^{\prime}$ is clustered together, i.e., all points in it are assigned to the same center in a given solution, then the cost for this assignment can be computed by knowing only the centroid of the point set and $\Delta(P^{\prime})$ . Thus, we can treat such a $P^{\prime}$ as one weighted point with some additional constant cost. This view is very helpful to simplify the analysis of agglomerative hierarchical clustering strategies.

Ward’s method.

Ward’s method (or simply Ward in the following) is a greedy algorithm. To describe it, the easiest way is to define the following quantity that describes how much the sum of the $1$ -means costs increases when merging two clusters.

Definition 3.

Let $A,B\subset\mathbb{R}^{d}$ be two finite point sets. We define $D(A,B)=\Delta(A\cup B)-\Delta(A)-\Delta(B)$ . If a set contains only one point, e.g., $A=\{a\}$ , we slightly abuse notation and write $D(a,B)=D(\{a\},B)$ (similarly, if $A=\{a\}$ and $B=\{b\}$ , we write $D(a,b)=D(\{a\},\{b\})$ ).

Ward’s method is agglomerative. It starts with $n$ singleton clusters. Then in every step, it greedily chooses two clusters $A,B$ in the current clustering for which $D(A,B)$ is minimal. This choice is optimal for the next clustering, but subsequent merges and clusterings may suffer from it. We denote the costs of the $k$ -clustering computed by Ward’s method on data set $P$ by $\mathrm{Ward}_{k}(P)$ .

2 Techniques and Observations

2.1 Upper Bounds: Proof Technique in a Nutshell

Let us give an overview of the basic idea underlying our proof that Ward’s method computes a $2$ -approximation for all values of $k$ for which the input data set satisfies $(2+2\sqrt{2})$ -center separation or $(3+2\sqrt{2})$ -center proximity. The main challenge is to relate the cost of the $k$ -clustering computed by Ward to the cost of an optimal $k$ -clustering. For this, we fix an arbitrary optimal $k$ -clustering $O_{1},\ldots,O_{k}$ . Consider an arbitrary cluster $O_{j}$ and let $P_{1}^{j},\ldots,P_{n_{j}}^{j}$ be the data points $O_{j}$ consists of (in the actual proof, $P_{i}^{j}$ is defined slightly differently). We consider the set $\mathcal{S}_{j}=\{\{P_{1}^{j},P_{2}^{j}\},\{P_{2}^{j},P_{3}^{j}\},\ldots,\{P_{n_{j}-1}^{j},P_{n_{j}}^{j}\}\}$ of merges. Observe that the merges in $\mathcal{S}_{j}$ cannot be applied one after another because after the first merge $\{P_{1}^{j},P_{2}^{j}\}$ the singleton point $P_{2}^{j}$ is gone, which is to be merged in the second merge $\{P_{2}^{j},P_{3}^{j}\}$ . Since it is possible to do every second merge of $\mathcal{S}_{j}$ , one can argue that all merges in $\mathcal{S}_{j}$ together cost at most $2\Delta(O_{j})$ . Now let $\mathcal{S}=\cup_{j}\mathcal{S}_{j}$ . Then all merges in $\mathcal{S}$ together cost at most $2\operatorname{opt}_{k}$ .

The next step is then to construct a bijection between the set $\mathcal{S}_{\mathrm{Ward}}$ of the $n-k+1$ merges performed by Ward to form a $k$ -clustering and the set $\mathcal{S}$ . This bijection has the property that every merge of Ward is at most as expensive as the merge in $\mathcal{S}$ assigned to it. This implies that Ward computes a clustering with cost at most $2\operatorname{opt}_{k}$ . In order to construct this bijection, consider a step of Ward in which two clusters $A$ and $B$ are merged. Let $\mathcal{C}$ denote the current clustering directly before this merge happens, and let $\mathcal{S}_{\mathcal{C}}\subseteq\mathcal{S}$ denote the set of those merges from $\mathcal{S}$ that are feasible in $\mathcal{C}$ and unassigned, i.e., those merges for which both clusters are contained in $\mathcal{C}$ and that have not been assigned to any previous merge of Ward. We know that any merge from $\mathcal{S}_{\mathcal{C}}$ is at least as expensive as the merge of $A$ and $B$ because Ward chooses the next merge greedily. Hence, in the bijection we can map the merge of $A$ and $B$ to an arbitrary merge from $\mathcal{S}_{\mathcal{C}}$ . This implies that if $\mathcal{S}_{\mathcal{C}}$ is non-empty in every step, the bijection can be constructed. Since $|\mathcal{S}|=|\mathcal{S}_{\mathrm{Ward}}|$ this can only be guaranteed if every merge of Ward decreases the number of available merges in $\mathcal{S}$ by only 1. One can show that this follows from the separation assumption.

For the one-dimensional case, the basic approach is similar. The main difference is that without separation, we can no longer guarantee that the number of available merges decreases by only $1$ with every step of Ward. Indeed, the original set $\mathcal{S}$ of good merges may be empty after $n-2k$ merges. To bound the cost of the remaining merge steps, we find a new set of (relatively) good merges, i.e., a set of merges whose costs can be bounded by a constant times $\operatorname{opt}_{k}$ . Again, this set may run dry, and we have to start again. Essentially, we show that after a constant number of phases (Ward merges that are charged against a specific set of good merges), Ward has obtained a $k$ -clustering.

Although the basic idea is similar, the technical implementation of the proof for $d=1$ is very different from our proof for well-clusterable data. Every time that Ward does not merge in a way compatible to the optimum clustering, we have to account for all possible consequences. Techniques like reordering help us to organize the proof. We also simplify the instance before the actual proof.

2.2 Useful Statements

Here we discuss some of the technical statements which we feel may be of interest for future work. All omitted proofs in this section can be found in the full version of this paper.

Cost of one step.

The value $D(A,B)$ plays a central role in the analysis of Ward’s method. By using Lemma 2, it is easy to show that $D(A,B)$ does not depend on $\Delta(A)$ or $\Delta(B)$ . The following lemma gives an explicit formula, which leads to convenient upper and lower bounds. These bounds say that the cost of merging two clusters is roughly equivalent to assigning the points of the smaller cluster to the centroid of the larger cluster.

Lemma 4.

Let $A$ and $B$ be two clusters. Then $D(A,B)=\frac{|A||B|}{|A|+|B|}\cdot||\mu_{A}-\mu_{B}||^{2}$ . Furthermore, $\frac{1}{2}\cdot\min\{|A|,|B|\}\cdot||\mu_{A}-\mu_{B}||^{2}\leq D(A,B)\leq\min\{|A|,|B|\}\cdot||\mu_{A}-\mu_{B}||^{2}.$ The left hand side is attained for $|A|=|B|$ , and the right hand side for $\frac{\max\{|A|,|B|\}}{\min\{|A|,|B|\}}\to\infty$ .

How cost accumulates.

Notice that whenever Ward makes a decision, it is optimal for the clustering in the next step. Where does its error lie? The problem is that every merge forces the points of two clusters to be in the same cluster for any clustering to come. In later clusterings, the condition to cluster certain points together may induce error. We need a way to bound this error. We prove the following technical statement.

Corollary 5.

Let $A$ , $B$ , and $C$ be three disjoint sets of points with $|A|\leq|B|$ (or $w(A)\leq w(B)$ , for weighted sets). Then $\Delta(A\cup B\cup C)\leq\Delta(A)+3\cdot\Delta(B\cup C)+4\cdot D(A,B)$ and $D(A\cup B,C)\leq 3\cdot\Delta(B\cup C)+3\cdot D(A,B)-\Delta(B)-\Delta(C)$ .

To see how Corollary 5 can be used, assume that $A\subset O_{i}$ and $B\subset O_{j}$ belong to different optimum clusters which Ward merged during its execution. Now Corollary 5 tells us something about the compatibility of $A\cup B$ with the optimum clustering. We pick the smaller of the two clusters, say $A$ . Assume that we still have some subset of $B$ ’s optimum cluster, i.e., there is a cluster $C\subset O_{j}$ that is still part of the clustering. Then we can merge $A\cup B$ with $C$ . Corollary 5 says that what we lose is proportional to the optimum cost plus the cost that we already invested into our clustering at an earlier time: $\Delta(A)$ and $\Delta(B\cup C)$ are both part of the optimum cost, and $D(A,B)$ is what Ward (accumulatively) already payed for merging $A$ and $B$ .

Monotonicity.

Notice that performing arbitrary merge operations is not monotone: Say that $a<b<c$ are one-dimensional points such that the centroid of $a$ and $c$ is $b$ . Then merging $a$ and $c$ first results in a point set where merging with $b$ costs nothing; clearly, this is not monotone. Indeed, when considering a natural variant of Ward’s method for the related $k$ -median problem, monotonicity is not true. Even for a simple isosceles triangle, greedily chosen merges result in non-monotone merge costs. However, Ward’s merges are indeed monotone. We show the following statement by proving a decomposition lemma for $D(A,B)$ .

Corollary 6.

[Monotonicity of Ward’s method] Let $D_{i}$ be the increase of the objective function in the $i$ -th step of Ward’s method. Then $D_{i}\leq D_{j}$ for $i\leq j$ .

Monotonicity is a very helpful property. In the argument discussed in §2.1 we use, e.g., that all merges that are possible in the final $k$ -clustering computed by Ward’s method are at least as expensive as all merges that are performed before by Ward’s method to obtain the $k$ -clustering.

Special structures in dimension one.

The following statements only hold for $d=1$ . First we observe that Ward satisfies the following convexity property.

Lemma 7 (Convexity in $\mathbb{R}^{1}$ ).

For any three finite convex clusters $A,B,C\subset\mathbb{R}^{1}$ with $\mu(A)<\mu(C)<\mu(B)$ , we have $D(A,C)<D(A,B)$ or $D(B,C)<D(A,B)$ .

Lemma 7 means that Ward will never merge $A$ and $B$ if a point or cluster lies between them on the line. This establishes that Ward’s clusters never overlap. It gives us a concept of neighbors on the line.

We combine Lemma 7 with a convexity property of Ward (see Corollary 25). This allows us to prove a powerful technique that we call reordering. Say that Ward at some point merges two clusters $A$ and $B$ . Then $A$ and $B$ are neighbors on the line. This means that merging $A$ and $B$ will result in a centroid $\mu(A\cup B)$ which is further away from any other cluster than $\mu(A)$ and $\mu(B)$ are. So, clusters that did not want to merge with $A$ or $B$ would also not merge with $A\cup B$ (by Corollary 25). Thus, we could perform the merge $(A,B)$ earlier without distorting Ward’s course of action at all (except that the merge $(A,B)$ is at the wrong position). This allows us to reorder Ward’s merges for our analysis.

3 Ward on Well-Clusterable Data

Clustering suffers from a general gap between theoretical study and practical application; clustering objectives are usually NP-hard to optimize, and even NP-hard to approximate to arbitrary precision. On the other hand, heuristics like Lloyd’s algorithm, which can produce arbitrarily bad solutions, are known to work well or reasonably well in practice. One way of interpreting this situation is that data often has properties that make the problem computationally easier. Indeed, for clustering it is very natural to assume that the data has some structure – otherwise, what do we hope to achieve with our clustering? The challenge is to find good measures of structure that characterize what makes clustering easy (but non-trivial).

Many notions of clusterability have been introduced in the literature and there are also different ways to measure the quality of a clustering. While traditionally a clustering is evaluated on the basis of an objective function (e.g., the $k$ -means objective function), there has been an increased interest recently to study which notions of clusterability make it feasible to recover (partially) a target clustering, some true clustering of the data. For this, the niceness conditions imposed on the input data are usually some form of separation condition on the clusters of the target clustering. We study the effect of five well-studied clusterability notions on the quality of the solution computed by Ward’s method.

First we study the notions of $\delta$ -center separation and $\alpha$ -center proximity, which have been introduced by Ben-David and Haghtalab [15] and Awasthi, Blum, and Sheffet [8], respectively.

Definition 8 ([15]).

An input $P\subset\mathbb{R}^{d}$ satisfies $\delta$ -center separation with respect to some target clustering $C_{1},\ldots,C_{k}$ if there exist centers $c_{1}^{\ast},\ldots,c_{k}^{\ast}\in\mathbb{R}^{d}$ such that $||c_{j}^{\ast}-c_{i}^{\ast}||\geq\delta\cdot\max_{\ell\in[k]}\max_{x\in C_{\ell}}||x-c_{\ell}^{\ast}||$ for all $i\neq j$ . We say the input satisfies weak $\delta$ -center separation if for each cluster $C_{j}$ with $j\in[k]$ and for all $i\neq j$ , $||c_{j}^{\ast}-c_{i}^{\ast}||\geq\delta\cdot\max_{x\in C_{j}}||x-c_{j}^{\ast}||$ .

Kushagra, Samadi, and Ben-David [25] show that single linkage and a pruning technique are sufficient to find the target clustering under the condition that the data satisfies $\delta$ -center separation for $\delta\geq 3$ .

While the goal of Ben-David and Haghtalab [15] is to recover a target clustering, we focus in this paper on approximating the $k$ -means objective function. Hence, in the following we will always assume that the target clustering $C_{1},\ldots,C_{k}$ is an optimal $k$ -means clustering (which we usually denote by $O_{1},\ldots,O_{k}$ ) and the centers $c_{1}^{\ast},\ldots,c_{k}^{\ast}\in\mathbb{R}^{d}$ are the optimal $k$ -means centers for this clustering. We will make this assumption also for all other notions of clusterability that are based on a target clustering and that we introduce in the following.

Definition 9 ([8]).

An instance $P$ satisfies $\alpha$ -center proximity if there exists an optimal $k$ -means clustering $O_{1},\ldots,O_{k}$ with centers $c_{1}^{\ast},\ldots,c_{k}^{\ast}\in\mathbb{R}^{d}$ such that for all $j\neq i,j\in[k]$ and for any point $x\in C_{i}$ it holds $||x-c_{j}^{\ast}||\geq\alpha||x-c_{i}^{\ast}||$ .

Awasthi, Blum, Sheffet [8] introduced the notion of $\alpha$ -perturbation resilience and showed that it implies $\alpha$ -center proximity. They show that for $\alpha\geq 3$ , the optimal clustering can be recovered if the data is $\alpha$ -perturbation resilient. This was improved by Balcan and Liang [12] and finally by Makarychev and Makarychev [32], who show that it is possible to completely recover the optimal clustering for $\alpha=2$ . The latter paper shows that the results even hold for a weaker property called metric perturbation resilience. We show that for large enough $\delta$ and $\alpha$ , Ward’s method computes a $2$ -approximation if the data satisfies $\delta$ -center separation or $\alpha$ -center proximity.

Theorem 10.

Let $P\subset\mathbb{R}^{d}$ be an instance that satisfies weak $(2+2\sqrt{2}+\epsilon)$ -center separation or $(3+2\sqrt{2}+\epsilon)$ -center proximity for some $\epsilon>0$ and some number $k$ of clusters. Then the $k$ -clustering computed by Ward on $P$ is a $2$ -approximation with respect to the $k$ -means objective function.

We also show that on instances that satisfy $(2+2\sqrt{2\nu}+\epsilon)$ -center separation and for which all clusters $O_{i}$ and $O_{j}$ in the optimal clustering satisfy $|O_{j}|\geq|O_{i}|/\nu$ , Ward even recovers the optimal clustering.

It is interesting to note that the example proposed by Arthur and Vassilvitskii [6] that shows that the famous $k$ -means++ algorithm has an approximation ratio of $\Omega(\log k)$ satisfies $\delta$ -center separation and $\alpha$ -center proximity for large values of $\delta$ and $\alpha$ , and has balanced clusters, i.e., $\nu=1$ .

Observation 11.

There is a family of examples where $k$ -means++ has an expected approximation ratio of $\Omega(\log k)$ , while Ward computes an optimal solution.

In contrast we will see that the instances that we use to prove our exponential lower bound on the approximation factor of Ward’s method (Theorem 22) satisfy $\delta$ -center separation and $\alpha$ -center proximity for $\delta\leq 1+\sqrt{2}$ and $\alpha\leq 1+\sqrt{2}$ . We will also see that even for arbitrary large $\delta$ and $\alpha$ there are instances that satisfy $\delta$ -center separation and $\alpha$ -center proximity and on which Ward’s method does not compute an optimal solution. In addition to center separation and center proximity we study the following three other prominent notions of clusterability: the strict separation property due to Balcan, Blum, and Vempala [11], $\epsilon$ -separation due to Ostrovsky et al. [35], and the separation condition from Awasthi and Sheffet [10] We will see that the exponential lower bound instances satisfy these clusterability notions when the target clustering is the optimal $k$ -means clustering. Hence, none of these notion guarantees that Ward’s method computes a good clustering.

Corollary 12.

For any $\epsilon>0$ , there is a family of point sets $(P_{d})_{d\in\mathbb{N}}$ with $P_{d}\subset\mathbb{R}^{d}$ that are $\epsilon$ -separated and that satisfy $1+\sqrt{2}$ -center separation, $1+\sqrt{2}$ -center proximity, the strict separation property and the AS-center separation property where $\mathrm{Ward}_{k}(P_{d})\in\Omega((3/2)^{d}\cdot\mathrm{\operatorname{opt}}_{k}(P_{d}))$ for $k=2^{d}$ . Furthermore, for any $\delta>1$ and any $\alpha>1$ , there exists a point set that satisfies $\delta$ -center separation and $\alpha$ -center proximity and for which Ward does not compute an optimal solution.

3.1 Upper Bounds

In this section, we analyze the behavior of Ward on $\delta$ -center separated instances and instances that satisfy $\alpha$ -center proximity for some number $k$ of clusters. We are only interested in the $k$ -clustering computed by Ward. Hence, in the following we assume that $k$ is fixed and that Ward stops as soon as it has obtained a $k$ -clustering. First we prove that center proximity implies weak center separation. Hence, it suffices to study instances that satisfy weak center separation. Deferred proofs can be found in the full version of this paper.

Lemma 13.

Let $P\subset\mathbb{R}^{d}$ be an instance that satisfies $\alpha$ -center proximity. Then $P$ also satisfies weak $(\alpha-1)$ -center separation.

In the following we call a cluster $A$ that is formed by Ward an inner cluster if $A$ is completely contained within an optimum cluster. We start our analysis with the following lemma, which states one very crucial property of Ward’s behavior on well-separated data. It implies that Ward does not merge inner clusters from two different optimal clusters as long as there exists more than one inner cluster in both of these optimal clusters.

Lemma 14.

Let $P\subset\mathbb{R}^{d}$ be an instance that satisfies weak $(2+2\sqrt{2}+\epsilon)$ -center separation for some $\epsilon>0$ . Assume we have two optimal clusters $O_{1}$ and $O_{2}$ and each of them contains at least two inner clusters $A_{1},B_{1}$ and $A_{2},B_{2}$ , respectively, directly after the $i$ -th step of Ward. Then, in step $i+1$ , Ward will not merge an inner cluster of $O_{1}$ with an inner cluster of $O_{2}$ .

Inner-cluster merges

In the following, assume that $P\subset\mathbb{R}^{d}$ is an arbitrary instance and and that the clusters $O_{1},\dots,O_{k}$ are an optimal $k$ -clustering of $P$ with objective value $\operatorname{opt}=\operatorname{opt}_{k}(P)$ . Our goal is to show that the $k$ -clustering $W_{1},\ldots,W_{k}$ computed by Ward on $P$ is worse by only a factor of at most $2$ if $P$ satisfies weak $(2+2\sqrt{2}+\epsilon)$ -center separation for some $\epsilon>0$ .

Observe that Lemma 14 does not exclude the possibility that Ward performs inner-cluster merges on $P$ , i.e., it might merge two inner clusters from the same optimum cluster at some point during its execution. While we will see that in the one-dimensional case one can assume that such inner-cluster merges do not happen, we cannot make this assumption in general. In our analysis, we bound the costs of the inner-cluster merges separately from the costs of the other merges, which we call non-inner merges in the following.

We define an equivalence relation $r$ on $P$ as follows: two points $x_{1}$ and $x_{2}\in P$ are equivalent if and only if there exists an inner cluster $C$ constructed by Ward at some point of time with $x_{1},x_{2}\in C$ . We denote the equivalence classes of $r$ by $P/r=\{C_{1},\dots,C_{m}\}$ . The following observation is immediate.

Observation 15.

If Ward merges in any step an inner cluster $C$ with another cluster that is not an inner cluster of the same optimal cluster, then $C\in P/r$ is an equivalence class.

This means that the equivalence classes represent inner clusters of Ward right before they are merged with points from outside their optimal cluster. With other words, if we perform all inner cluster merges that are performed by Ward and leave out all non-inner merges, we get the clustering represented by $P/r$ .

Consider an arbitrary optimal cluster $O_{j}$ and let $P_{1}^{j},\ldots,P_{n_{j}}^{j}$ denote the inner clusters of $O_{j}$ in $P/r$ . We assume that these inner clusters are indexed in the order in which they are merged with other clusters by Ward. To illustrate this definition, consider the step in which $P_{i}^{j}$ is merged by Ward with some other cluster $Q$ . Since $P_{i}^{j}\in P/r$ , this step is a non-inner merge and in particular $Q$ is not equal to any of the clusters $P_{i+1}^{j},\ldots,P_{n_{j}}^{j}$ . At the time this merge happens, the indexing guarantees that the cluster $P_{i+1}^{j}$ is either present or there exist multiple parts $C_{1},\ldots,C_{\ell}$ of $P_{i+1}^{j}$ that are only later merged by inner-cluster merges to $P_{i+1}^{j}$ . Since Ward merges $P_{i}^{j}$ and $Q$ , we know that $D(P_{i}^{j},Q)\leq D(P_{i}^{j},C_{h})$ for any $h\in[\ell]$ . We will use this fact to give an upper bound for the costs of the clustering $W_{1},\ldots,W_{k}$ .

It might be that some inner clusters of $O_{j}$ in $P/r$ are not merged at all by Ward and contained in the clustering $W_{1},\ldots,W_{k}$ . These inner clusters are the last in the ordering, i.e., they are $P_{a}^{j},\ldots,P_{n_{j}}^{j}$ where $n_{j}-a+1$ is the number of such clusters.

Potential graph

In order to bound the costs of the clustering $W_{1},\ldots,W_{k}$ produced by Ward we introduce the potential graph $G=(V,E)$ with vertex set $V=P/r$ . The edges $E$ of $G$ are directed and there are only edges between inner clusters of the same optimal cluster. Consider an arbitrary optimal cluster $O_{j}$ with $j\in[k]$ and let $P_{1}^{j}\ldots P_{n_{j}}^{j}$ be the inner clusters of $O_{j}$ in $P/r$ indexed as above in the order in which they are merged with other clusters by Ward. Then for every $i\in[n_{j}-1]$ the set $E$ contains the edge $(P_{i}^{j},P_{i+1}^{j})$ . Both the vertices and the edges are weighted and we denote the sum of all vertex and edge weights by $w(G)$ .

The weight of a vertex $Q\in P/r$ is defined as $w(Q)=\Delta(Q)$ , i.e., the weight of vertex $Q$ equals the costs of forming the inner cluster $Q$ . We will now define weights for the edges such that the sum of all vertex and edge weights in the potential graph is at most $2\operatorname{opt}_{k}$ . After that we prove that there is a one-to-one correspondence between the non-inner merges of Ward and the edges in the graph such that the costs of each non-inner merge of Ward are at most the weight of the associated edge. Together this proves that Ward computes a solution with costs at most $2\operatorname{opt}_{k}$ .

To define the weight of the edge $(P_{i}^{j},P_{i+1}^{j})$ , we first consider the case that $P_{i}^{j}$ is merged at some point of time with another cluster $Q$ by Ward. Then let $C_{1},\ldots,C_{\ell}$ again denote the parts of $P_{i+1}^{j}$ that are present at that point of time. The edge weight $w(P_{i}^{j},P_{i+1}^{j})$ is defined as $\max_{h\in[\ell]}D(P_{i}^{j},C_{h})$ 444When reading the proof the reader might notice that our definition of $w(P_{i}^{j},P_{i+1}^{j})$ is to some extend arbitrary. Instead of defining it as $\max_{h\in[\ell]}D(P_{i}^{j},C_{h})$ , we could also define it as $\min_{h\in[\ell]}D(P_{i}^{j},C_{h})$ or as $D(P_{i}^{j},C_{h})$ for any $h$ .. Observe that since Ward performs greedy merges, this definition guarantees that the merge of $P_{i}^{j}$ and $Q$ costs at most the edge weight $w(P_{i}^{j},P_{i+1}^{j})$ . If $P_{i}^{j}$ is not merged at all by Ward, we set the weight $w(P_{i}^{j},P_{i+1}^{j})$ to $D(P_{i}^{j},P_{i+1}^{j})$ .

Lemma 16.

Let $P\subset\mathbb{R}^{d}$ be a finite point set and let $Q_{1},\ldots,Q_{\ell}$ denote an arbitrary partition of $P$ into pairwise disjoint parts. Then $\Delta(P)\geq\Delta(Q_{1})+\ldots+\Delta(Q_{\ell})$ .

Lemma 17.

The weights in the potential graph satisfy $w(G)\leq 2\operatorname{opt}_{k}$ .

Bijection between non-inner merges and edges

We have seen that the sum of the weights in the potential graph is at most $2\operatorname{opt}_{k}$ . Our goal is now to find a bijection between the non-inner merges of Ward and the edges of the potential graph such that the costs of any non-inner merge are bounded from above by the weight of the edge assigned to it in the bijection. The existence of such a bijection implies that also the costs of the solution $W_{1},\ldots,W_{k}$ computed by Ward are at most $2\operatorname{opt}_{k}$ .

Now we construct this bijection. Let us first consider non-inner merges in which at least one of the clusters is an inner cluster contained in $P/r$ . Let this be the inner cluster $P_{i}^{j}$ of some optimal cluster $O_{j}$ and assume further that $i<n_{j}$ . Then $P_{i}^{j}$ has an outgoing edge to $P_{i+1}^{j}$ . We denote by $Q$ the cluster with which $P_{i}^{j}$ is merged and we assign the merge of $P_{i}^{j}$ with $Q$ to the edge $(P_{i}^{j},P_{i+1}^{j})$ in the bijection.

Lemma 18.

Let $P\subset\mathbb{R}^{d}$ be an instance that satisfies weak $(2+2\sqrt{2}+\epsilon)$ -center separation for some $\epsilon>0$ . Consider a non-inner merge of Ward between two inner clusters from $P/r$ . Then at most one of these inner clusters has an outgoing edge in $G$ .

Observe that it cannot happen that the same edge is assigned to two different merges by the construction described above because an edge $(P_{i}^{j},P_{i+1}^{j})$ can only be assigned to a step in which $P_{i}^{j}$ is merged with some other cluster and there can only be one such merge.

Let $L\subseteq E$ denote the set of edges that are not assigned to a step of Ward by the above construction. The potential graph $G$ contains $|V|=|P/r|$ vertices and $|V|-k$ edges. Since the number of non-inner merges of Ward is also $|V|-k$ , there are also $|L|$ non-inner merges that are not yet assigned to an edge. We finish the construction of the bijection by assigning the unassigned non-inner merges arbitrarily bijectively to $L$ .

Lemma 19.

The costs of each non-inner merge of Ward are bounded from above by the weight of the assigned edge in the potential graph.

Now the following theorem follows easily.

Theorem 20.

Let $P\subset\mathbb{R}^{d}$ be an instance that satisfies weak $(2+2\sqrt{2}+\epsilon)$ -center separation or $(3+2\sqrt{2}+\epsilon)$ -center proximity for some $\epsilon>0$ . Then Ward computes a $2$ -approximation on $P$ .

Theorem 21.

Let $P\subset\mathbb{R}^{d}$ be an instance with optimal $k$ -means clustering $O_{1},\ldots,O_{k}$ with centers $c_{1}^{\ast},\ldots,c_{k}^{\ast}\in\mathbb{R}^{d}$ . Assume that $P$ satisfies $(2+2\sqrt{2\nu}+\epsilon)$ -center separation for some $\epsilon>0$ , where $\nu=\max_{i,j\in[k]}\frac{|O_{i}|}{|O_{j}|}$ is the largest factor between the sizes of any two optimum clusters. Then Ward computes the optimal $k$ -means clustering $O_{1},\ldots,O_{k}$ .

In the full version of this paper we show that Theorem 20 does not hold for significantly smaller $\delta$ and $\alpha$ .

4 Exponential Lower Bound in High Dimension

In the following, we describe a family of instances of increasing dimension $d$ where Ward computes for some number $k=k(d)$ of clusters a $k$ -clustering that costs $\Omega((3/2)^{d}\operatorname{opt}_{k})$ . Here and in all other worst-case examples, we assume that given a choice between equally expensive merges, Ward chooses the action that leads to a worse outcome. This is without loss of generality because we can always slightly move the points to ensure the outcome we want. However, it greatly simplifies the exposition.

To further simplify the exposition, the below definitions use points of infinite weight and assume that the optimal cluster centers coincide with these infinite weight points. For any finite realization of the example, that is not the case. To ensure that Ward actually behaves like described in the following, we have to move the high weight points by an infinitesimal distance. We do this in the full version of this paper, but for sake of clarity, omit it in the exposition here. Notice that merging a cluster $H$ of infinite weight with a cluster $A$ of finite weight costs $|A|\cdot||\mu(A)-\mu(H)||^{2}$ by Lemma 4.

Let $d$ be given. We construct an instance $P_{d}\subseteq\mathbb{R}^{d}$ with $2^{d+1}$ points. For $i\geq 2$ let $z_{i}^{2}=\frac{3^{i-2}}{2^{i-1}}$ and define

[TABLE]

All points from $P_{d}$ whose first coordinate is $-1$ or $1$ have weight $\infty$ (we call these heavy points). All other points have weight $1$ (we call these light points). For an illustration of $P_{2}$ and $P_{3}$ , see Figure 1.

We show the following theorem.

Theorem 22.

The family of point sets $(P_{d})_{d\in\mathbb{N}}$ satisfies $\mathrm{Ward}_{k}(P_{d})\in\Omega((3/2)^{d}\cdot\mathrm{\operatorname{opt}}_{k}(P_{d}))$ for $k=2^{d}$ .

In the theorem, we use $k=k(d)=2^{d}$ , i.e., we are interested in finding a $2^{d}$ -clustering of $P_{d}$ . Observe that in the optimal $2^{d}$ -clustering of $P_{d}$ , the heavy points are in separate clusters. Due to their infinite weight, they also determine the cluster centers. Hence, in the optimal solution each light point is in the same cluster as its closest heavy point. Since each light point is within distance $2-\sqrt{2}$ of a heavy point, the cost of the optimal solution is

[TABLE]

Now we look at a run of Ward’s method on $P_{d}$ . We say that phase $1$ lasts as long as there is at least one light point that forms its own cluster. We prove by induction that during phase $1$ the only clusters that occur are singleton clusters consisting of one light or one heavy point and clusters that consist of two light points that differ only in the first coordinate. We call the latter pair clusters. At the beginning this is clearly the case. Now assume that the induction hypothesis holds at some point of time in phase $1$ . Merging two heavy points has infinite cost and merging a heavy point with a light point or a pair cluster has cost at least $(2-\sqrt{2})^{2}\approx 0.343$ because $2-\sqrt{2}$ is the minimum distance between a light and a heavy point. Merging two singleton light points that differ only in the first coordinate costs $\frac{1}{2}\cdot(2\sqrt{2}-2)^{2}=(2-\sqrt{2})^{2}$ (observe that the induction hypothesis guarantees that for any singleton light point the light point that differs only in the first coordinate is also a singleton point). Merging two singleton light points that differ in any other coordinate costs at least $\frac{1}{1+1}\cdot(2z_{2})^{2}=1$ , merging a singleton light point with a pair cluster costs at least $\frac{1\cdot 2}{1+2}\cdot(2z_{2})^{2}=\frac{4}{3}$ , and merging two pair clusters costs at least $\frac{2\cdot 2}{2+2}\cdot(2z_{2})^{2}=2$ . Hence, we can assume that Ward merges two singleton light points that differ only in the first coordinate. After that the induction hypothesis is still true. Hence, in phase $1$ all $2^{d-1}$ pairs of points of the form $(-(\sqrt{2}-1),x_{2},\ldots,x_{d})$ and $(\sqrt{2}-1,x_{2},\ldots,x_{d})$ will be merged. We call the clusters that consist of these points the $(*,x_{2},\ldots,x_{d})$ -clusters in the following.

Then phase $2$ starts. Phase $2$ lasts as long as there are pair clusters. We show by induction that the only clusters that occur in phase $2$ are singleton heavy points, pair clusters, and clusters with four points that result from merging two pair clusters that differ only in the second coordinate. We call the latter quadruple clusters. Merging two pair clusters of the form $(*,-z_{2},x_{3},\ldots,x_{d})$ and $(*,z_{2},x_{3},\ldots,x_{d})$ to form a quadruple cluster costs $\frac{2\cdot 2}{2+2}(2z_{2})^{2}=2$ . Merging two pair clusters that differ in any other coordinate than the second is more expensive because their centers are further apart than $2z_{2}$ . Merging the $(*,x_{2},\ldots,x_{d})$ -cluster with a heavy point costs at least $2$ because the center of this cluster is $(0,x_{2},\ldots,x_{d})$ , which is at distance $1$ from the heavy points. Similarly merging a quadruple cluster (whose center is $(0,0,x_{3},\ldots,x_{d})$ ) with a heavy point costs at least $2+z_{2}^{2}\geq 2$ . Merging a quadruple cluster with a pair cluster costs at least $\frac{2\cdot 4}{2+4}(2z_{3})^{3}>2$ and merging two quadruple clusters costs at least $\frac{4\cdot 4}{4+4}(2z_{3})^{3}>2$ . Hence, we can assume that Ward merges two pair clusters that differ only in the second coordinate. After that the induction hypothesis is still true. Hence, in phase $2$ all $2^{d-2}$ pairs of clusters of the form $(*,-z_{2},x_{3},\ldots,x_{d})$ and $(*,z_{2},x_{3},\ldots,x_{d})$ will be merged. We call the clusters that consist of these points the $(*,*,x_{3},\ldots,x_{d})$ -clusters in the following.

At the beginning of phase $i\geq 2$ , there are $2^{d}$ singleton heavy points and $2^{d-i+1}$ clusters of the form $(*,\ldots,*,x_{i},\ldots,x_{d})$ with $2^{i-1}$ points each. Phase $i$ ends when there is no cluster of the form $(*,\ldots,*,x_{i},\ldots,x_{d})$ left. One can show again by induction that Ward merges in phase $i$ all pairs of clusters of the form $(*,\ldots,*,-z_{i},x_{i+1},\ldots,x_{d})$ and $(*,\ldots,*,z_{i},x_{i+1},\ldots,x_{d})$ . The center of the cluster $(*,\ldots,*,x_{i},\ldots,x_{d})$ is $(0,\ldots,0,x_{i},\ldots,x_{d})$ , which is at distance $\sqrt{1+z_{2}^{2}+\ldots+z_{i-1}^{2}}$ from the heavy points. Hence, merging such a cluster with a heavy point costs at least $2^{i-1}\cdot(1+z_{2}^{2}+\ldots+z_{i-1}^{2})=2^{i}z_{i}^{2}$ , where the equation follows from the following observation.

Observation 23.

It holds that $1+z_{2}^{2}+\ldots+z_{i-1}^{2}=2z_{i}^{2}$ .

Merging the clusters $(*,\ldots,-z_{i},x_{i+1},\ldots,x_{d})$ and $(*,\ldots,z_{i},x_{i+1},\ldots,x_{d})$ costs

[TABLE]

Merging two clusters that differ in one of the $d-i$ last coordinates costs at least $\frac{2^{i-1}\cdot 2^{i-1}}{2^{i-1}+2^{i-1}}(2z_{i+1})^{2}=2^{i}\cdot z_{i+1}^{2}>2^{i}z_{i}^{2}$ . Hence, in phase i all $2^{d-i}$ pairs of clusters of the form $(*,\ldots,*,-z_{i},x_{i+1},\ldots,x_{d})$ and $(*,\ldots,*,z_{i},x_{i+1},\ldots,x_{d})$ will merge, which costs in total $2^{d-i}\cdot 2^{i}z_{i}^{2}$ .

Phases $2$ until $d$ together cost $\sum_{i=2}^{d}2^{d-i}\cdot 2^{i}z_{i}^{2}=2^{d}\cdot(2z_{d+1}^{2}-1)=2\cdot 3^{d-1}-2^{d}$ , where we used Observation 23. After phase $d$ , all light points will be in the same cluster. Then the number of clusters is $2^{d}+1$ and in the last step the cluster of light points, whose center is the origin, will be merged with one heavy point. This costs

[TABLE]

Phase $1$ costs in total $2^{d-1}(2-\sqrt{2})^{2}$ . Thus, the overall cost of Ward’s solution is

[TABLE]

This implies

[TABLE]

5 Ward’s Method in Dimension One

In this section, we discuss the approximation ratio of Ward’s method for inputs $P\subset\mathbb{R}^{1}$ and show the following theorem.

Theorem 24.

Let $P\subset\mathbb{R}$ be an arbitrary instance that is one-dimensional. Then, for every $k$ , the $k$ -clustering computed by Ward on $P$ is an $\mathcal{O}(1)$ -approximation with respect to the $k$ -means objective function.

For the purpose of analyzing the worst-case behavior of Ward’s method, an instance sometimes also contains an integer $k\in\mathbb{N}$ in addition to $P$ (even though Ward itself only takes $P$ as the input). If we specify $P$ and $k$ , then we are interested in the quality of the $k$ -clustering produced by Ward on $P$ .

We will usually denote the hierarchical clustering computed by Ward on $P$ by $\mathcal{W}=(\mathcal{W}_{0},\ldots,\mathcal{W}_{n-1})$ . Ward’s method always chooses greedily a cheapest merge to perform. We say that a merge is a greedy merge if it is a cheapest merge; if all merges are greedy, we call $\mathcal{W}$ greedy. Ward’s method computes a greedy hierarchical clustering, and every greedy hierarchical clustering can be the output of Ward’s method.

5.1 Prelude: Reordering

Recall the following statement from §2.2:

See 7 Lemma 7 means that Ward will always merge $A$ and $C$ or $B$ and $C$ , and never $A$ and $B$ . This gives us a convexity property: If Ward forms a cluster $M$ , then no other point or cluster lies within the convex hull of $M$ . Clusters can thus also never overlap, and we get a concept of neighbors on the line. Thus, the clusterings $\mathcal{W}_{i}$ consist of non-overlapping clusters, which we can thus view as ordered by their position on the line. Ward’s method always merges neighbors on the line. We will combine it with the following useful corollary of Lemma 4. It gives a condition under which merging a cluster $A$ with a subcluster $B^{\prime}\subset B$ is cheaper than merging $A$ with $B$ . Notice that without the condition, the statement is not true: Imagine that $A$ and $B$ have the same centroid (merging them is free), but $\mu(B^{\prime})\neq\mu(B)$ . Then clearly, merging $A$ with $B^{\prime}$ is more expensive than merging $A$ and $B$ .

Corollary 25.

Assume we have two finite clusters $B^{\prime}\subseteq B\subset\mathbb{R}^{d}$ and a third finite cluster $A\subset\mathbb{R}^{d}$ such that $||\mu(A)-\mu(B^{\prime})||^{2}\leq||\mu(A)-\mu(B)||^{2}$ . Then $D(A,B^{\prime})\leq D(A,B)$ .

Corollary 25 holds in arbitrary dimension. However, for $d=1$ , it is much easier to benefit from it. We get a very convenient tool that we call reordering. Say that Ward at some point merges two clusters $A$ and $B$ . By Lemma 7, that means that $\mu(A)$ and $\mu(B)$ are neighbors on the line (at the time of the merge). Now assume that $A$ and $B$ are present for a while before they are merged. Then during all this time, they are neighbors. Notice that this means that merging $A$ and $B$ will result in a centroid $\mu(A\cup B)$ which is further away from any other cluster than $\mu(A)$ and $\mu(B)$ are. So, clusters that did not want to merge with $A$ or $B$ would also not merge with $A\cup B$ by Corollary 25. Thus, we could perform the merge $(A,B)$ earlier without distorting Ward’s course of action at all (except that the merge $(A,B)$ is at the wrong position). Lemma 26 below formulates this idea.

Recall that a hierarchical clustering can also be described by the $n-1$ merge operations that produce it. We usually denote the sequence of merges by $(A,B)(\mathcal{W})=((A_{1},B_{1}),\ldots,(A_{n-1},B_{n-1}))$ . We say that a cluster $Q\subset P$ exists in $\mathcal{W}$ after merge $t$ if $Q\in\mathcal{W}_{t}$ . If $Q$ is the result of the merge $(A_{i},B_{i})$ (i.e., $Q=A_{i}\cup B_{i})$ , and it is later merged with another cluster in merge $(A_{j},B_{j})$ (i.e., $A_{j}=Q$ or $B_{j}=Q$ ), then $Q$ exists as long as merge $i$ has happened and merge $j$ has not yet happened. All singleton clusters exist in $\mathcal{W}_{0}$ . After merge $n-1$ , $P$ is the only remaining existing cluster.

Lemma 26 (Reordering Lemma).

Let $P\subset\mathcal{R}^{d}$ be an input for which Ward computes the clustering $\mathcal{W}$ with merge operations $(A,B)(\mathcal{W})$ . Consider the merge $(A_{t},B_{t})$ for $t\in[n-1]$ . If both $A_{t}$ and $B_{t}$ exist after merge $s<t$ , then

The sequence of merge operations

[TABLE]

results in a valid hierarchical clustering $\mathcal{W}^{\prime}$ . 2. 2.

$\mathcal{W}^{\prime}_{j}=\mathcal{W}_{j}$ * for all $j\geq t$ .* 3. 3.

All merges except the moved merge $(A^{\prime}_{s+1},B^{\prime}_{s+1})=(A_{t},B_{t})$ are greedy merges.

Proof.

(1) and (2) hold because performing merges in a different order does not change the resulting clustering, and after merge $t$ , all deviations from the original order are done. For (3), we have to argue that inserting $(A_{t},B_{t})$ as step $s+1$ does not create cheaper merges. For this, we observe that by Lemma 7, $A_{t}$ and $B_{t}$ are neighbors on the line. In the original sequence, no cluster was merged with $A_{t}$ or $B_{t}$ up to point $t$ . The cluster $A_{t}\cup B_{t}$ is a superset of $A_{t}$ and of $B_{t}$ , and its centroid is further away from all other clusters than the centroids of $A_{t}$ and $B_{t}$ . Thus by Corollary 25, up to point $t$ , merging with $A_{t}\cup B_{t}$ cannot be cheaper than the merges we do. However, after $(A_{t-1},B_{t-1})$ , the clustering is identical to $\mathcal{W}_{t}$ by (1), thus all remaining merges are also greedy merges. ∎

Lemma 26 a crucial observation to allow us to systematically analyze Ward’s steps: We can sort them into steps that depend on each other, and then analyze them in batches / phases.

In $\mathbb{R}^{d}$ for $d>1$ , reordering does not work. Also, we cannot assume that there are no inner-cluster merges.

5.2 Prelude: No Inner-cluster Merges

Reordering also gives us a nice simplification tool. Assume that $A$ and $B$ are in fact singleton clusters, $A=\{a\}$ and $B=\{b\}$ , and they are from the same optimum cluster. Then they are present from the start; we can reorder the merge $(A,B)$ to be the first merge Ward does. Indeed, instead of actually doing this merge, we can also simply forget about it and replace $a$ and $b$ by a weighted point. How does this affect the approximation ratio? Both Ward’s cost and the optimal cost decrease by $\Delta(\{a,b\})$ , meaning that the approximation ratio can only get worse. We can now assume that there are no merges between inner clusters, since inner clusters arise from merging input points that belong to the same optimum cluster. We formalize our observation in Lemma 27.

We directly apply Lemma 26 in order to achieve a simplification method. Recall that (given an optimal $k$ -clustering) we call a merge $(A_{i},B_{i})$ an inner-cluster merge if $A_{i}$ and $B_{i}$ are inner clusters from the same optimum cluster. For a worst-case instance $(P,k)$ we can always assume that such inner-cluster merges do not happen, as they are only helpful for Ward’s method. We formally see this in the next lemma, where we relocate inner-cluster merges to the front of the hierarchical clustering and then eliminate them.

Recall that $\Delta_{k}(\mathcal{W})=\sum_{Q\in\mathcal{W}_{n-k}}\Delta(Q)$ is the cost of the $k$ -clustering contained in $\mathcal{W}$ . For an instance $(P,k)$ and Ward’s resulting clustering $\mathcal{W}$ , the approximation ratio of Ward’s method is $\Delta_{k}(\mathcal{W})/\operatorname{opt}_{k}(P)$ .

Lemma 27.

Let $(P,k)$ be an instance with $P\subset\mathbb{R}^{d}$ and $k\in\mathbb{N}$ , for which $\mathcal{O}=\{O_{1},\ldots,O_{k}\}$ is an optimal $k$ -clustering and for which Ward computes the hierarchical clustering $\mathcal{W}$ with merge operations $(A,B)(\mathcal{W})$ . Then there exists a weighted point set $P^{\prime}$ and a hierarchical clustering $\mathcal{W}^{\prime}$ for $P^{\prime}$ with merges $(A^{\prime},B^{\prime})(\mathcal{W}^{\prime})$ with the following properties:

$\mathcal{W}^{\prime}$ * is greedy.* 2. 2.

No $(A^{\prime}_{i},B_{i}^{\prime})$ is an inner-cluster merge with respect to $\mathcal{O}$ . 3. 3.

For some $\alpha\geq 0$ , $\Delta_{k}(\mathcal{W}^{\prime})=\Delta_{k}(\mathcal{W})-\alpha$ and $\operatorname{opt}_{k}(P^{\prime})\leq\operatorname{opt}_{k}(P)-\alpha$ .

Proof.

Assume that $P$ is weighted; this will be necessary to iterate the following process. Let $(\{x\},\{y\})$ be a merge operation in $(A,B)(\mathcal{W})$ that merges two points $x,y\in O_{j}$ for $j\in[k]$ , i.e., two points from the same cluster in the optimal solution. Let their weights be $w(x)$ and $w(y)$ . By Lemma 26, we can move the merge $(\{x\},\{y\})$ to the front. Then we replace $x$ and $y$ in $P$ by one point $z=\frac{w(x)x+w(y)y}{w(x)+w(y)}$ with weight $w(z):=w(x)+w(y)$ . By Lemma 4, $z$ behaves identically to $\{x,y\}$ in Ward’s method. Thus, we can adjust ${\mathcal{W}}^{\prime}$ by removing the merge operation $(\{x\},\{y\})$ , and replacing $x$ and $y$ by $z$ in all further merge operations of the cluster $\{x,y\}$ . We see that (1) holds for the new hierarchical clustering. Our adjustment will change the cost by $\alpha:=\Delta(\{x,y\})$ . Similarly, we can replace $x$ and $y$ in $O_{j}$ by $z$ , which decreases the cost of the clustering induced by $O_{1},\ldots,O_{k}$ by $\alpha$ . Since this is still a possible clustering, the optimal clustering can cost at most $\operatorname{opt}_{k}(P)-\alpha$ . Thus, (3) holds for the new clustering.

Observe that if (2) is not true, then there has to be a merge operation where two points from the same cluster in the optimum are merged. Thus, we can complete the proof by repeating the above process until we have removed all pairs with this property. Then (2) holds. ∎

Now if Ward performs inner-cluster merges on an instance, we apply Lemma 27. If this changes the optimum solution, we just apply Lemma 27 again, and repeat this until Ward does not do any inner-cluster merges. We explicitly note the following trivial corollary.

Corollary 28.

Assume that $\mathcal{W}^{\prime}$ and $(A^{\prime},B^{\prime})(\mathcal{W^{\prime}})$ result from applying Lemma 27 until Ward does not do inner cluster merges. If a merge $(A_{i}^{\prime},B_{i}^{\prime})$ for $i\in[n-1]$ contains an inner cluster, then this inner cluster is a (weighted) input point.

Proof.

If $A$ resulted from a previous merge, then this merge was an inner-cluster merge, which is a contradiction. ∎

Corollary 28 implies that we can use the terms inner cluster and input point interchangeably.

5.3 Prelude: Clustering points together

Crucial in showing the approximation factors of the good merges is the following lemma. To see its usage, assume that $A$ and $B$ belong to one optimum cluster, and $C$ and $D$ belong to another. Then the lemma implies that if Ward has already merged $B$ and $C$ , but $\Delta(B\cup C)$ is small, say $\Delta(B\cup C)\leq c\cdot(\Delta(B)+\Delta(C))$ , then we can still obtain a $7c$ -approximation. Its proof is deferred to the full version of this paper.

Lemma 29.

Let $A,B,C,D\subset\mathbb{R}^{d}$ be disjoint sets with $|A|\leq|B|$ and $|C|\geq|D|$ . Then

[TABLE]

and

[TABLE]

5.4 The analysis

We now analyze the worst-case behavior of Ward’s method on the line by fixing an arbitrary worst-case example that does not contain inner-cluster merges (we can assume this by Lemma 27).

The general plan is the following. Whenever Ward merges two clusters, it does so greedily, meaning that the cost of the merge is always bounded by the cost of any other merge. Thus, if we can find a merge with low cost, then the merge actually performed can only be cheaper. We can clearly find cheap merges in the beginning, however, Ward’s decisions may lead us to a situation where we run out of the originally good options. The idea of the proof is to find a point during Ward’s execution where:

•

We still know a bound on the costs produced so far.

•

We know a set $\mathcal{S}$ of good merges that can still be performed and lead to a good $k$ -clustering.

•

We can ensure that no merge can possibly destroy two merges from $\mathcal{S}$ .

At such a point in time, we can use $\mathcal{S}$ to charge the remaining merges that Ward does to compute a $k$ -clustering. We find this point in time by sorting specific merges of Ward into the front, and bounding their cost. There will be five phases of merges which we need to pull forward and charge.

The phases

We will use the reordering lemma (Lemma 26) to sort the merges into phases and then analyze the cost of the solution after each phase.

In the following, we call a cluster that contains points from more than one optimum cluster composed, more precisely, we call it an $\ell$ -composed cluster if it contains points from $\ell$ different optimum clusters. Most of the time, we are interested in $2$ -composed clusters, and we name such a cluster $2$ -composed cluster from $O_{j}$ and $O_{j+1}$ if these are the involved optimum clusters.

The goal of the reordering is simple in nature; we want to collect all merges that create $2$ -composed clusters and that grow $2$ -composed clusters. We can think of the phases as different stages of development of $2$ -composed clusters. A $2$ -composed cluster may become part of the $k$ -clustering computed by Ward’s method, or it may at some point become $i$ -composed for $i>2$ , at which time we are no longer interested in it. By the final stage of a $2$ -composed cluster we either mean how it looks in the $k$ -clustering, or how it looked in the last step before it became more than $2$ -composed.

Consider the example in Figure 2, where we depict the development of a $2$ -composed cluster from $O_{j}$ and $O_{j+1}$ which in its final stage consists of the input points $x_{\ell},\ldots,x_{r}$ . It undergoes five principal phases: It is created by merging a point from $O_{j}$ with a point from $O_{j+1}$ (phase $P1$ ). Then it grows; it is merged with points left and right of itself (phase $P2$ ). We add extra phases for the last points on both sides. In phase $P3$ , the first side is completed; in the example, it is the left side. This merge is again followed by a growth phase (phase $P4$ ). The final phase $P5$ consists of the final merge on the other side; the right side in the example. (We skip some merges in $P5$ , the details of $P5$ are not discussed until much later in this proof).

So, we use reordering to pull the following phases of merges to the front.

P1

(Creation phase)

We create $2$ -composed clusters by collecting the merges $(\{a_{i}\},\{b_{i}\})$ with $a_{i}\in O_{j}$ , $b_{i}\in O_{j+1}$ for some $j\in[k]$ . The collected merges constitute phase $P1$ . For technical reasons, we make one exception. If the $2$ -composed cluster only consists of two input points in its final stage (i.e., the creating merge is also the last merge), then we defer the merge to phase $P5$ . 2. P2

(Main growth phase)

We now grow the $2$ -composed clusters initialized during phase $P1$ . For each $2$ -composed cluster, we move the growth merges to phase $P2$ , preserving their original order. We stop right before one side of the $2$ -composed cluster is done. There may be many growth merges for a cluster, or none. 3. P3

(First side elimination phase)

This phase consists of at most one merge for each $2$ -composed cluster, and this merge is the last merge on the first side. After phase $P3$ , every $2$ -composed cluster thus has one side where it will not be merged with further input points. Notice that a cluster may skip phase $P3$ if it only shares one point with $O_{j}$ or $O_{j+1}$ in its final stage, anyway. 4. P4

(Second growth phase)

This phase resembles phase $P2$ , however, the growth is now one-sided. For each $2$ -composed cluster, we move the growth merges to phase $P4$ , preserving their original order, and stopping right before the final merge. 5. P5

(Second side elimination phase)

The last phase consists of at most one merge for each cluster. If the final stage of a $2$ -composed cluster contains only two points, then the merging of these two points is done in phase $P5$ . Otherwise, phase $P5$ may contain the last merge for the cluster, resulting in its final state. For technical reasons, we have to exclude some merges; we postpone the details to Definition 32.

We now analyze the sum of the $1$ -means costs of all clusters in the clustering after each phase. The proofs of the lemmata are deferred to the full version of the paper. We start with phases $P1$ and $P2$ .

Lemma 30.

Let $N=\{x_{a},\ldots,x_{b}\}$ with $x_{a},\ldots,x_{m}\in O_{j}$ and $x_{m+1},\ldots,x_{b}\in O_{j+1}$ be a $2$ -composed cluster after phases $P1$ and $P2$ . Then

[TABLE]

Furthermore, $D(N\cap O_{j},N\cap O_{j+1})\leq D(x_{a-1},x_{a})+D(x_{b},x_{b+1})$ .

In phase $P3$ , Ward’s method faces the first situation where it may run out of good merge options and has to resort to more expensive merges. Notice that by the definition of our phases, each cluster has one side where after phase $P2$ , there is exactly one point left which has not been added to the cluster. The key technical observation that we use again and again during the (omitted) proofs is the following corollary.

See 5

We need the following interpretation of Corollary 5. If we have a 2-composed cluster $M=A\cup B$ which consists of a lighter cluster $A\subseteq O^{\prime}$ for an optimum cluster $O^{\prime}$ and a heavier cluster $B\subset O^{\prime\prime}$ for another optimum cluster $O^{\prime\prime}$ , then merging $A\cup B$ with another cluster $C\subset O^{\prime\prime}$ basically costs as much as $A\subseteq O^{\prime}$ and $B\cup C\subseteq O^{\prime\prime}$ cost individually, plus what merging $A$ and $B$ costed us already (up to constant factors). Corollary 5 allows us to analyze the $1$ -means costs of the clusters after phase $P4$ .

Lemma 31.

Let $F=\{x_{\ell},\ldots,x_{r}\}$ be the final state of a $2$ -composed cluster, with $x_{\ell},\ldots,x_{m}\in O_{j}$ and $x_{m+1},\ldots,x_{r}\in O_{j+1}$ . The state of the cluster after phase $P4$ is either $N=\{x_{\ell},\ldots,x_{r-1}\}$ or $N=\{x_{\ell-1},\ldots,x_{r}\}$ . In both cases,

[TABLE]

Now we come to phase $P5$ , which we haven’t completely defined yet. The problem with phase $P5$ is that we can no longer charge all clusters ‘internally’. To see this, first notice that we say that a $2$ -composed cluster $F$ from $O_{j}$ and $O_{j+1}$ points to cluster $A$ if

•

$w(F\cap O_{j})\geq w(F\cap O_{j+1})$ holds and $A$ is the cluster left of $F$ , or

•

$w(F\cap O_{j})\leq w(F\cap O_{j+1})$ holds and $A$ is the cluster right of $F$ .

We define a lopsided cluster to be a $2$ -composed cluster $F=\{x_{\ell},\ldots,x_{r}\}$ for which the last merge is $\{F\backslash\{x\},\{x\}\}$ , but at the time of this merge, $F^{\prime}=F\backslash\{x\}$ does not point to $\{x\}$ . This means that we cannot use Corollary 5 (directly) to charge this merge. As a technicality, we also call a $2$ -composed cluster lopsided if it only contains two points in its final state; again, we cannot use Corollary 5 in this case.

We have to pay attention to one more detail when defining phase $P5$ . When charging $2$ -composed clusters internally, we could always be sure that the clusters that are involved are part of one of the two optimum clusters that the $2$ -composed cluster intersects. That is because the $2$ -composed cluster by definition only contains points from two optimum clusters, and we only dealt with points and subclusters of such a $2$ -composed cluster. However, in the following arguments, we will have to argue about clusters neighboring a $2$ -composed cluster. These may or may not belong to one of the optimum clusters. Let $A$ and $B$ be two clusters that are neighbors on the line such that $A$ lies left of $B$ . We say that there is an opt change between $A$ and $B$ if the last point in $A$ and the first point in $B$ belong to different optimum clusters.

Now we define phase $P5$ . Let $Y$ be the cluster that lies on the other side of $F^{\prime}$ than $x$ at the time of the merge $\{F^{\prime},\{x\}\}$ . Let $Z$ be the cluster that lies ‘behind’ $x$ from the point of view of $F^{\prime}$ at the time of the merge $\{F^{\prime},\{x\}\}$ . By behind from $F$ ’s point of view we mean that if $x$ lies left of $F$ , then $Z$ lies left of $x$ , and if $x$ lies right of $F^{\prime}$ , then $Z$ lies right of $x$ .

Definition 32 (Phase P5).

Phase $P5$ contains the final merge $\{F^{\prime},\{x\}\}$ of a cluster $F=F^{\prime}\cup\{x\}$ if any of the following conditions applies.

$F$ * is not lopsided (phase $P5a$ ),* 2. 2.

$F$ * is lopsided, there is no opt change between $Y$ and $F^{\prime}$ , and $Y$ is an inner cluster (phase $P5b$ ),* 3. 3.

$F$ * is lopsided, there is no opt change between $\{x\}$ and $Z$ , and $Z$ is an inner cluster (phase $P5c$ ),* 4. 4.

$F$ * is lopsided, there is no opt change between $\{x\}$ and $Z$ , $Z$ is $2$ -composed, and points to {x} (phase $P5d$ ).*

The next lemma deals with merges in $P5a$ .

Lemma 33.

Let $F=\{x_{\ell},\ldots,x_{r}\}$ be the final state of a $2$ -composed cluster, with $x_{\ell},\ldots,x_{m}\in O_{j}$ and $x_{m+1},\ldots,x_{r}\in O_{j+1}$ . Assume that $F$ is not lopsided. Then

[TABLE]

Now we consider the merges in phase $P5b$ .

Lemma 34.

Let $F=\{x_{\ell},\ldots,x_{r}\}$ be the final state of a $2$ -composed cluster, with $x_{\ell},\ldots,x_{m}\in O_{j}$ and $x_{m+1},\ldots,x_{r}\in O_{j+1}$ . Assume that $F$ is lopsided. Assume that at the time of the merge $\{F\backslash\{x\},\{x\}\}$ , the cluster on the other side of $F^{\prime}=F\backslash\{x\}$ is an inner cluster $Y$ , and there is no opt change between $F^{\prime}$ and $Y$ . Then if $x=x_{\ell}$ , we have

[TABLE]

and if $x=x_{r}$ , then

[TABLE]

The following lemma is the main lemma about the phases and summarizes our findings: After phase $5$ , the error is still bounded by a constant times the optimum value.

Lemma 35.

Let $C_{5}$ be the clustering after phase $P5$ . Then

[TABLE]

Good merges for the final analysis

In general, the clustering of Ward after phase $P5$ has still more than $k$ clusters. It remains to analyze the merges after phase $P5$ that reduce the number of clusters to $k$ . For the final charging argument, we need four types of good merges. Good merges are not necessarily merges that Ward’s method does, instead, it’s a collection of merges that are possible and can be used for charging. Indeed, good merges include merges that would not be present anymore if Ward did them, since then we would move them to the phases. But if Ward never uses them, they may still be present for us to charge against.

The whole point of the phases is to ensure that any merge that Ward may still do does not destroy two good merges. The final arguments of the proof will be to count good merges and to show that no two good merges can be invalidated simultaneously by one of Ward’s merges.

Recall that $W_{1},\ldots,W_{\ell}$ is the current Ward solution, and $O_{1},\ldots,O_{k}$ is a fixed optimal solution, numbered from left to right. The following merges are good merges in the sense that we can bound the increase in cost. Of course, the result of the merge only forms a cluster of low cost if the participating clusters had low cost beforehand.

T1:

Two inner clusters $W_{i},W_{i+1}$ of the same optimal cluster $O_{j}$ , i.e., $W_{i},W_{i+1}\subset O_{j}$ . This type of merge is never actually applied by Ward on simplified examples, but we need it for charging.

T2:

A $2$ -composed cluster $W_{i}\subset O_{j}\cup O_{j+1}$ for some $j$ and an inner cluster $W_{i+1}\subset O_{j+1}$ , with the condition that $W_{i+2}$ is an inner cluster of $O_{j+1}$ as well. Also: The symmetric situation of a $2$ -composed cluster $W_{i}\subset O_{j}\cup O_{j+1}$ for some $j$ and an inner cluster $W_{i-1}\subset O_{j}$ with the condition that $W_{i-2}\subset O_{j}$ .

T3:

A $2$ -composed cluster $W_{i}\subset O_{j}\cup O_{j+1}$ for some $j$ and an inner cluster $W_{i-1}\subset O_{j}$ , with the condition that $W_{i}$ points to $W_{i-1}$ . Also: The symmetric situation of a $2$ -composed cluster $W_{i}\subset O_{j}\cup O_{j+1}$ for some $j$ and an inner cluster $W_{i+1}\subset O_{j+1}$ with the condition that $W_{i}$ points to $W_{i+1}$ .

T4:

Two $2$ -composed clusters $W_{i}\subset O_{j}\cup O_{j+1}$ and $W_{i+1}\subset O_{j+1}\cup O_{j+2}$ that point at each other.

We already know T1 merges (inner-cluster merges), T2 merges (growth phase and phase $5c$ ) and T3 merges (merges chargeable with Corollary 5). We know that applying them increases the cost by at most a constant factor. We also know that these merges cannot happen anymore: T1 merges are inner-cluster merges, which Ward does not do on our example. T2 merges happen either in the growth phase, or in phase $5c$ . T3 merges merge non-lopsided clusters, which happens in phase $5a$ .

T4 is a type of merge that we did not yet consider, and which Ward can still do. Indeed, to charge it, we need the general charging statement in the below Lemma 29 from which Corollary 5 follows.

See 29

Let $W_{i}$ and $W_{i+1}$ constitute a T4 merge as described above. Then Lemma 29 with $A=W_{i}\cap O_{j}$ , $B=W_{i}\cap O_{j+1}$ , $C=W_{i+1}\cap O_{j+1}$ and $D=W_{i+1}\cap O_{j+2}$ implies that

[TABLE]

Thus, if $\Delta(W_{i})+\Delta(W_{i+1})$ was bounded by a constant factor times the optimal cost of the points in $W_{i}\cup W_{i+1}$ , then this is still true after the merge of $W_{i}$ and $W_{i+1}$ (with a higher factor).

Counting inner clusters

Observe that the only merges that delete more than one inner cluster are the merges in phase $P1$ . All other merges remove either exactly one inner cluster, or none at all. In phase $P2$ - $P5$ , every merge eliminates exactly one inner cluster. In the beginning, there are $n$ inner clusters. So if phase $P1$ has $n_{1}$ merges and $P2$ until $P5$ together have $n_{r}$ merges, then we have $n-2n_{1}-n_{r}$ inner clusters after phase $P5$ , and we have $n_{1}$ $2$ -composed clusters. The total number of all clusters is $n-n_{1}-n_{r}$ .

Consider the Ward clustering $W_{1},\ldots,W_{t}$ after phase $P5$ . We split the clustering into blocks, based on the inner clusters. More precisely, we get $n-2n_{1}-n_{r}-1$ blocks that start with an inner cluster, possibly has some $2$ -composed clusters and ends with another inner cluster. The blocks overlap in the inner clusters.

We argue that there is at least one good merge in every block except for $k-n_{1}-1$ blocks. The exceptions are the blocks where the optimum cluster changes between start and end, but the change happens between the clusters (not in a $2$ -composed cluster). This can only happen $k-n_{1}-1$ times because $n_{1}$ of the $k-1$ cluster borders are within $2$ -composed clusters. For the remaining blocks, we argue the following. If there are no $2$ -composed clusters in the block, then the two inner clusters are neighbored and form a T1 merge. If there is only one $2$ -composed cluster in the block, then it has to point at an inner cluster and thus there is a T2 or a T3 merge. If there are multiple $2$ -composed clusters, we argue as follows. The first $2$ -composed cluster either points left and thus there is a T2 or a T3 merge, or it points to the right. Any further $2$ -composed cluster either points to the one before it, forming a T4 merge, or it points to the right. This goes on until we either find a merge, or we find the last $2$ -composed cluster, which then has to point at the second inner cluster, forming a T2 or T2 merge.

We collect one good merge from every block and call the resulting set of merges $\mathcal{S}$ . Observe that the cost of all merges in $\mathcal{S}$ together is a constant factor of the cost that we have so far, so all merges together cost $\mathcal{O}(1)\operatorname{opt}_{k}$ .

This argument alone is not enough. The main feature of $\mathcal{S}$ is that every merge that Ward actually performs can make at most one merge from our set invalid. This means that we can charge $n-2n_{1}-n_{r}-1-(k-n_{1}-1)$ merges to $\mathcal{S}$ .

Notice that our merges are disjoint except for possible overlap at inner clusters. Assume that a merge of Ward invalidates two merges from our set. There are two ways how this can happen. Case one is that Ward’s merge is one of the two good merges that are invalidated. Say this merge is called $(A,B)$ . Then the second merges involves either $A$ or $B$ , say it involves $B$ . Thus, there is another cluster $C$ next to $B$ , and the merge $(A,B)$ invalidates itself and $(B,C)$ . This in particular means that $(A,B)$ is a good merge. Since Ward does not do inner-cluster merges, either $A$ or $B$ has to be $2$ -composed, since $(A,B)$ is a merge of Ward. If they are both $2$ -composed clusters, then $A$ and $B$ are in the same block, thus $(A,B)$ and $(B,C)$ cannot both be in $\mathcal{S}$ . Thus, one is $2$ -composed and the other is an inner cluster, i.e., they form a T2 or T3 merge, since $(A,B)$ is supposed to be a good merge. If it is a T3 merge, then $(A,B)$ is not lopsided, and would have happened in phase $P5a$ . If it is a T2 merge, then it is either not lopsided (phase $P5a$ ), or it is lopsided, but has an inner cluster behind its inner cluster (phase $P5c)$ . We conclude that a good merge $(A,B)$ cannot invalidate another good merge.

Case two is that the two good merges are disjoint, and Ward does a merge that overlaps with both of them. Thus, we have two good merges $(A,B)$ and $(C,D)$ , and Ward performs merge $(B,C)$ . Since Ward does not do inner-cluster merges, either $B$ or $C$ is $2$ -composed, w.l.o.g. say that $C$ is $2$ -composed. If $B$ is $2$ -composed as well, then $(A,B)$ and $(C,D)$ are in the same block, so they would not both be in $\mathcal{S}$ . So $B$ is an inner cluster. If $C$ points to $B$ , then $(B,C)$ is not lopsided and would have happened in phase $P5a$ . Thus, $C$ points to $D$ . If $A$ is an inner cluster, then $(B,C)$ is a T2 merge and would have happened in phase $P5c$ . So say that $A$ is $2$ -composed. $(A,B)$ is a good merge. It is not a T2 merge since $C$ is $2$ -composed. It has to be a T3 merge, thus, $A$ points to $B$ . Thus, $(B,C)$ would have happened in phase $P5d$ : It is a lopsided merge with $B$ left of $C$ , and the $2$ -composed cluster left of $B$ points to $A$ .

We have seen that no merge of Ward can invalidate two merges from $\mathcal{S}$ . Thus, we can now charge in the following way. The cost of the performed merge is bounded by the cost of any available merge. For each Ward step, we look whether it invalidates a merge from $\mathcal{S}$ . If so, then we charge the performed merge to this good merge. If Ward’s merge does not invalidate any merge from $\mathcal{S}$ , we just arbitrarily charge a merge in $\mathcal{S}$ and mark it as invalid. In this manner, we can pay for $n-2n_{1}-n_{r}-(k-n_{1}-1)-1$ merges, i.e., we can pay until the number of clusters is reduced to $n-n_{1}-n_{r}-(n-2n_{1}-n_{r}-(k-n_{1}-1)-1)=k$ . That completes the proof of Theorem 24.

6 Conclusions

We have initiated the theoretical study of the approximation guarantee of Ward’s method. In particular, we have shown that Ward computes a 2-approximation on well-separated instances, which can be seen as the first theoretical explanation for its popularity in applications. We have also seen that its worst-case approximation guarantee increases exponentially with the dimension of the input and that it computes an $\mathcal{O}(1)$ -approximation on one-dimensional instances.

These results leave room for further research. It would be particularly interesting to better understand the worst-case behavior of Ward’s method. It is not clear, for example, if it computes a constant-factor approximation if the dimension is constant. Our analysis of the one-dimensional case is very complex and the factor hidden in the $\mathcal{O}$ -notation is large. It would be interesting to simplify our analysis and to improve the approximation factor.

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Marcel R. Ackermann, Johannes Blömer, Daniel Kuntze, and Christian Sohler. Analysis of agglomerative clustering. Algorithmica , 69(1):184–215, 2014.
2[2] Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. In Proceedings of the 58th IEEE Annual Symposium on Foundations of Computer Science (FOCS) , pages 61–72, 2017.
3[3] Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning , 75(2):245–248, 2009.
4[4] David Arthur, Bodo Manthey, and Heiko Röglin. k 𝑘 k -means has polynomial smoothed complexity. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS) , pages 405–414, 2009.
5[5] David Arthur and Sergei Vassilvitskii. How slow is the k 𝑘 k -means method? In Proceedings of the 22nd International Symposium on Computational Geometry (So CG) , pages 144–153, 2006.
6[6] David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages 1027–1035, 2007.
7[7] David Arthur and Sergei Vassilvitskii. Worst-case and smoothed analysis of the ICP algorithm, with an application to the k 𝑘 k -means method. SIAM Journal on Computing , 39(2):766–782, 2009.
8[8] Pranjal Awasthi, Avrim Blum, and Or Sheffet. Center-based clustering under perturbation stability. Information Processing Letters , 112(1-2):49–54, 2012.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Analysis of Ward’s Method††thanks: This research was supported by ERC Starting Grant 306465 (BeyondWorstCase).

Abstract

1 Introduction

Related work.

Our results.

Preliminaries.

Hierarchical clustering.

Useful Facts about kkk-means.

Lemma 1** (Relaxed triangle inequality).**

Lemma 2**.**

Ward’s method.

Definition 3**.**

2 Techniques and Observations

2.1 Upper Bounds: Proof Technique in a Nutshell

2.2 Useful Statements

Cost of one step.

Lemma 4**.**

How cost accumulates.

Corollary 5**.**

Monotonicity.

Corollary 6**.**

Special structures in dimension one.

Lemma 7** (Convexity in R1\mathbb{R}^{1}R1).**

3 Ward on Well-Clusterable Data

Definition 8** ([15]).**

Definition 9** ([8]).**

Theorem 10**.**

Observation 11**.**

Corollary 12**.**

3.1 Upper Bounds

Lemma 13**.**

Lemma 14**.**

Inner-cluster merges

Observation 15**.**

Potential graph

Lemma 16**.**

Lemma 17**.**

Bijection between non-inner merges and edges

Lemma 18**.**

Lemma 19**.**

Theorem 20**.**

Theorem 21**.**

4 Exponential Lower Bound in High Dimension

Theorem 22**.**

Observation 23**.**

5 Ward’s Method in Dimension One

Theorem 24**.**

5.1 Prelude: Reordering

Corollary 25**.**

Lemma 26** (Reordering Lemma).**

Proof.

5.2 Prelude: No Inner-cluster Merges

Lemma 27**.**

Proof.

Corollary 28**.**

Proof.

5.3 Prelude: Clustering points together

Lemma 29**.**

5.4 The analysis

The phases

Lemma 30**.**

Lemma 31**.**

Definition 32** (Phase P5).**

Lemma 33**.**

Lemma 34**.**

Lemma 35**.**

Good merges for the final analysis

Counting inner clusters

6 Conclusions

Useful Facts about $k$ -means.

Lemma 1 (Relaxed triangle inequality).

Lemma 2.

Definition 3.

Lemma 4.

Corollary 5.

Corollary 6.

Lemma 7 (Convexity in $\mathbb{R}^{1}$ ).

Definition 8 ([15]).

Definition 9 ([8]).

Theorem 10.

Observation 11.

Corollary 12.

Lemma 13.

Lemma 14.

Observation 15.

Lemma 16.

Lemma 17.

Lemma 18.

Lemma 19.

Theorem 20.

Theorem 21.

Theorem 22.

Observation 23.

Theorem 24.

Corollary 25.

Lemma 26 (Reordering Lemma).

Lemma 27.

Corollary 28.

Lemma 29.

Lemma 30.

Lemma 31.

Definition 32 (Phase P5).

Lemma 33.

Lemma 34.

Lemma 35.