Parameterized k-Clustering: The distance matters!
Fedor V. Fomin, Petr A. Golovach, Kirill Simonov

TL;DR
This paper investigates the parameterized complexity of the k-Clustering problem under different Minkowski distances, revealing tractability for p in (0,1] and hardness for p=0 and p=∞.
Contribution
It establishes the fixed-parameter tractability of k-Clustering for p in (0,1], and proves hardness results for p=0 and p=∞, highlighting the importance of distance choice.
Findings
FPT algorithm for p in (0,1] with runtime 2^{O(D log D)}(nd)^{O(1)}.
Hardness results for p=0 and p=∞, unless FPT=W[1].
Distance order p critically affects the complexity of k-Clustering.
Abstract
We consider the -Clustering problem, which is for a given multiset of vectors and a nonnegative number , to decide whether can be partitioned into clusters such that the cost \[\sum_{i=1}^k \min_{c_i\in \mathbb{R}^d}\sum_{x \in C_i} \|x-c_i\|_p^p \leq D,\] where is the Minkowski () norm of order . For , -Clustering is the well-known -Median. For , the case of the Euclidean distance, -Clustering is -Means. We show that the parameterized complexity of -Clustering strongly depends on the distance order . In particular, we prove that for every , -Clustering is solvable in time , and hence is fixed-parameter tractable when parameterized by . On the other hand, we prove that for distances of orders and , no such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Parameterized -Clustering: The distance matters!
Fedor V. Fomin
Department of Informatics, University of Bergen, Norway.
Petr A. Golovach00footnotemark: 0
Kirill Simonov00footnotemark: 0
Abstract
We consider the -Clustering problem, which is for a given multiset of vectors and a nonnegative number , to decide whether can be partitioned into clusters such that the cost
[TABLE]
where is the Minkowski () norm of order . For , -Clustering is the well-known -Median. For , the case of the Euclidean distance, -Clustering is -Means. We show that the parameterized complexity of -Clustering strongly depends on the distance order . In particular, we prove that for every , -Clustering is solvable in time , and hence is fixed-parameter tractable when parameterized by . On the other hand, we prove that for distances of orders and , no such algorithm exists, unless .
1 Introduction
Recall that for , the Minkowski or -norm of a vector is defined as
[TABLE]
Respectively, we define the (-norm) distance between two vectors and as
[TABLE]
We also consider for and . For , is (or the Hamming) distance, that is the number of different coordinates in and :
[TABLE]
For , is -distance, which is defined as
[TABLE]
The -Clustering problem is defined as follows. For a given (multi) dataset of vectors (points) , the task is to find a partition of into clusters minimizing the cost
[TABLE]
In particular, for , is the -distance and the corresponding clustering problem is known as -Median. (Often in the literature, -Median is also used for clustering minimizing the sums of the Euclidean distances.) For , is the (Euclidean) distance, and then the clustering problem becomes -Means.
Let us note that optimal clusterings for the same set of vectors can be drastically different for various values of , as shown in Figure 1. The main conceptual contribution of this paper is that the complexity of -Clustering also strongly depends on the choice of .
-Clustering, and especially -Median and -Means, are among the most prevalent problems occurring in virtually every subarea of data science. We refer to the survey of Jain [22] for an extensive overview. While in practice the most common approaches to clustering are based on different variations of Lloyd’s heuristic [25], the problem is interesting from the theoretical perspective as well. In particular, there is a vast amount of literature on approximation algorithms for -Clustering whose behavior can be analyzed rigorously, see e.g. [1, 2, 6, 8, 9, 16, 17, 19, 24, 13, 23, 10, 30].
When it comes to exact solutions, the complexity of -Clustering is less understood. The -Clustering problem is naturally “multivariate”: in addition to the input size , there are also parameters like space dimension , number of clusters or the cost of clustering . The problem is known to be -complete for [3, 15] and for [28, 26]. By the classical work of Inaba et al. [21], in the case when both and are constants, -Clustering is solvable in polynomial time . Under ETH, the lower bound of , even when , was shown by Cohen-Addad et al. in [11] for the settings where the set of potential candidate centers is explicitly given as input. However the lower bound of Cohen-Addad et al. does not generalize to the settings of this paper when any point in Euclidean space can serve as a center. For the special case, when the input consists of binary vectors and the distance is Hamming, the problem is solvable in time [18].
Our results and approaches. In this paper we investigate the dependence of the complexity of -Clustering from the cost of clustering . It appears, that adding this new “dimension” makes the complexity landscape of -Clustering intricate and interesting. More precisely, we consider the following problem.
Input:
A multiset of vectors in , a positive integer , and a nonnegative number .
Task:
Decide whether there is a partition of into clusters and vectors , called centroids, in such that
-Clustering with distance parameterized by
Let us remark that vector set (like the column set of a matrix) can contain many equal vectors. Also we consider the situation when vectors from are integer vectors, while centroid vectors are not necessarily from . Moreover, coordinates of centroids can be reals.
Our main algorithmic result is the following theorem.
Theorem 1**.**
-Clustering* with distance is solvable in time for every .*
Thus -Clustering when parameterized by is fixed-parameter tractable () for Minkowski distance of order . Superficially, the general idea of the proof of Theorem 1 is similar to the idea behind the algorithm for Binary -Means for from [18]. However there are several differences; the main is that the proof in [18] is crucially based on the fact that the clustering is performed on binary vectors. Thus the reductions from [18] cannot be applied in our case. Moreover, as we will see in Theorem 2, the existence of an algorithm for -Clustering in is highly unlikely.
In the first step of our algorithm we use color coding to reduce solution of the problem to the Cluster Selection problem, which we find interesting on its own. In Cluster Selection we have groups of weighted vectors and the task is to select exactly one vector from each group such that the weighted cost of the composite cluster is at most . More formally,
Input:
A set of vectors given together with a partition into disjoint sets, a weight function , and a nonnegative number .
Task:
Decide whether it is possible to select exactly one vector from each set such that the total cost of the composite cluster formed by , …, is at most :
Cluster Selection with distance parameterized by
Informally (see Theorem 9 for the precise statement), our reduction shows that if the distance norm satisfies some specific properties (which satisfies for all ) and if Cluster Selection is parameterized by , then so is -Clustering. Therefore, in order to prove Theorem 1, all we need is to show that Cluster Selection is parameterized by when . This is the most difficult part of the proof. Here we invoke the theorem of Marx [27] on the number of subhypergraphs in hypergraphs of bounded fractional edge cover.
Interestingly, Theorem 1 does not hold for distance . More precisely, for clustering in we prove the following theorem.
Theorem 2**.**
With distance , -Clustering parameterized by and Cluster Selection parameterized by are -hard.
In particular, this means that up to a widely-believed assumption in complexity that , Theorem 2 rules out algorithms solving -Clustering in time and algorithms solving Cluster Selection in in time for any functions and . Similar hardness result holds for .
Theorem 3**.**
With distance , -Clustering parameterized by and Cluster Selection parameterized by are -hard.
This naturally brings us to the question: What happens with -Clustering for , especially for the Euclidean distance, that is . Unfortunately, we are not able to answer this question when the parameter is only. However, we can prove that
Theorem 4**.**
-Clustering* and Cluster Selection with distance are when parameterized by .*
Thus in particular, Theorem 4 implies that -Clustering with distance is parameterized by . On the other hand, we prove that
Theorem 5**.**
Cluster Selection* with distance is -hard for every when parameterized by .*
In particular, Theorem 5 yields that the approach we used to establish the tractability (with parameter ) of -Clustering for will not work for .
We summarize our and previously known algorithmic and hardness results for -Clustering and Cluster Selection with different distances in Table 1.
The remaining part of this paper is organized as follows. Section 2 contains preliminaries. In Section 3 we prove Theorem 9 which provides us with Turing reduction from -Clustering to Cluster Selection. Theorem 9 appears to be a handy tool to establish tractability of -Clustering. In Section 4 we collect the results on clustering with -norm for . In particular, in Subsection 4.1, we prove Theorem 1, the main algorithmic result of this work, stating that when , -Clustering and Cluster Selection admit FPT algorithms with parameter . In Subsection 4.2 we complement the algorithmic upper bounds with lower bounds by proving that Cluster Selection is -hard when and parameter is (Theorem 12). In Section 5, we consider the case and prove Theorem 2 establishing -hardness of -Clustering and Cluster Selection. Section 6 is devoted to the case . Here we establish two hardness results about -Clustering: -hardness when parameterized by and -hardness in the case . In Section 7, we look at the case , with the particular emphasis on the most commonly used case . We show that when is the parameter, then Cluster Selection and -Clustering in the distance are . We also show that Cluster Selection is -hard when parameterized by for all . We conclude with open problems in Section 8.
2 Preliminaries and notation
Cluster notation. By a cluster we always mean a multiset of vectors in . For distance , the cost of a given cluster is the total distance from all vectors in the cluster to the optimally selected cluster centroid, . An optimal cluster centroid for a given cluster is any minimizing . For most of the considered distances, we argue that an optimal cluster centroid could always be chosen among selected family of vectors (e.g. integral). Whenever we show this, we only consider optimal cluster centroids of the stated form afterwards.
Complexity. A parameterized problem is a language where is the set of strings over a finite alphabet . Respectively, an input of is a pair where and ; is the parameter of the problem. A parameterized problem is fixed-parameter tractable () if it can be decided whether in time for some function that depends of the parameter only. Respectively, the parameterized complexity class is composed by fixed-parameter tractable problems. The -hierarchy is a collection of computational complexity classes: we omit the technical definitions here. The following relation is known amongst the classes in the -hierarchy: . It is widely believed that , and hence if a problem is hard for the class (for any ) then it is considered to be fixed-parameter intractable. We refer to books [12, 14] for the detailed introduction to parameterized complexity.
We also provide conditional lower bounds by making use of the following complexity hypothesis formulated by Impagliazzo, Paturi, and Zane [20].
Exponential Time Hypothesis (ETH): There is a positive real such that 3-CNF-SAT with variables and clauses cannot be solved in time .
Graphs. For proving -hardness, we need to consider graphs. Whenever we work with a graph , we always fix some ordering on the vertices and on the edges . We drop and to simplify notation, so when we consider a vertex or an edge , and also denote integers—numbers of and according to the orderings and correspondingly.
3 From -Clustering to Cluster Selection
In this section we present a general scheme for obtaining an FPT algorithm parameterized by , which is later applied to various distances.
First, we formalize the following intuition: there is no reason to assign equal vectors to different clusters.
Definition 6** (Initial cluster and regular partition).**
For a multiset of vectors , an inclusion-wise maximal multiset such that all vectors in are equal is called an initial cluster.
We say that a clustering of is regular if for every initial cluster there is a such that .
Now we prove that it suffices to look only for regular solutions.
Proposition 1*.*
Let be a yes-instance to -Clustering. Then there exists a solution of which is a regular clustering.
Proof.
Let us assume that the instance has a solution. There are clusters and vectors in such that
[TABLE]
Note that for every , . So if we consider a new clustering with the same centroids, where are all vectors from for which is the closest centroid, the total distance does not increase. If we also break ties in favor of the lower index, then for any initial cluster the same centroid will be the closest, and all vectors from will end up in , so is a regular clustering. ∎
From now on, we consider only regular solutions.
Definition 7** (Simple and composite clusters).**
We say that a cluster is simple if it is an initial cluster. Otherwise, the cluster is composite.
Next we state a property of -Clustering with a particular distance, which is required for the algorithm. Intuitively, each unique vector adds at least some constant to the cluster cost. In the subsequent sections we show that the property holds for all distances in our consideration.
Definition 8** (-property).**
We say that a distance has the -property for some if for any the cost of any composite cluster which consists of initial clusters is at least .
The following problem is a key subroutine in our algorithm. In some cases it is solvable trivially, but it presents the main challenge for our main algorithmic result in the distance.
Input:
Family of disjoint sets of vectors , containing vectors in total, a weight function , and a nonnegative number
Task:
Determine whether it is possible to choose one vector from each set such that the total cost of forming a composite cluster out of , …, is at most :
Cluster Selection parameterized by
The intuition to the weight function in the definition of Cluster Selection is that it represents sizes of initial clusters, that is, how many equal vectors are there.
We also need a procedure to enumerate all possible optimal cluster costs which are less than . It may not be straightforward since not all distances in our consideration are integer. So we assume that the set of all possible optimal cluster costs which are less than is also given in the input. Now we are ready to state the result formally.
Theorem 9**.**
Assume that the -property holds, Cluster Selection is solvable in time , where is a non-decreasing function of its arguments, and we are given the set of all possible optimal cluster costs which are at most . Then -Clustering is solvable in time
[TABLE]
Proof.
By the -property, in any solution there are at most composite clusters, since each contains at least two initial clusters. Moreover, there are at most initial clusters in all composite clusters.
Thus by Proposition 1, solving -Clustering is equivalent to selecting at most initial clusters and grouping them into composite clusters such that the total cost of these clusters is at most . We design an algorithm which, taking as a subroutine an algorithm for Cluster Selection, solves -Clustering. The algorithm is sketched in Figure 3, an example is shown in Figure 2.
To perform the selection and grouping, our algorithm uses the color coding technique of Alon, Yuster, and Zwick from [4]. Consider the input as a family of initial clusters . We color initial clusters from independently and uniformly at random by colors 1, 2, …, . Consider any solution, and the particular set of at most initial clusters which are included into composite clusters in this solution. These initial clusters are colored by distinct colors with probability at least . Now we construct an algorithm for finding a colorful solution.
We consider all possible ways to split colors between clusters (some colors may be unused). Hence we consider all possible families of pairwise disjoint non-empty subsets of . Each family corresponds to a partition of the set of colors if we add one fictitious subset for colors which are not used in the composite clusters. The total number of partitions does not exceed .
When partition is fixed, we form clusters by solving instances of Cluster Selection: For each , we take initial clusters colored by elements of , bundle together those with the same color, and pass the resulting family to Cluster Selection. First note that there cannot be of size at most one, since then Cluster Selection has to make a simple cluster while we assume that all clusters obtained from are composite. Second, the total number of clusters has to be , the number of clusters is . For each we check that both conditions hold, and if not, we discard the choice of and move to the next one, before calling the Cluster Selection subroutine.
Next, we formalize how we call the Cluster Selection subroutine. We fix the set of colors , then take the sets for . We turn each set of initial clusters into a set of weighted vectors naturally: For each , we put one vector into , and . The family of sets of vectors , …, and the weight function are the input for Cluster Selection. Then we search for the minimum cluster cost bound from , for which the instance of Cluster Selection is a yes-instance, running each time the algorithm for Cluster Selection.
If for some setting to leads to a no-instance, or if , then we discard the choice of the partition and move to the next one. Otherwise, we report that -Clustering has a solution and stop. Next, we prove that in this case the solution indeed exists.
We reconstruct the solution to -Clustering as follows: For each the corresponding to instance of Cluster Selection has a solution . For each , consider the corresponding initial cluster consisting of vectors equal to . For each we obtain a composite cluster , all other clusters are simple. So the total cost is , which is at most . Thus, if the algorithm finds a solution, then is a yes-instance.
In the opposite direction. If there is a solution to -Clustering, then there is a regular solution, and with probability at least initial clusters which are parts of composite clusters in this solution are colored by distinct colors. Then, there is a partition which corresponds to this solution. This partition is obtained as follows: put into colors from the first composite cluster, into from the second and so on. At some point our algorithm checks the partition , and as it finds the optimal cost value for each cluster, then it is at most the cost of the corresponding cluster of the solution from which we started.
To analyze the running time, we consider partitions , for each we times search for optimal in time . And for each possible value of we make one call to the Cluster Selection algorithm, which takes time at most .
To amplify the error probability to be at least , we do iterations of the algorithm, each time with a new random coloring. As each iteration succeeds with probability at least , the probability of not finding a colorful solution after iterations is at most . So the total running time is .
The algorithm could be derandomized by the standard derandomization technique using perfect hash families [4, 29]. So -Clustering is solvable in the same deterministic time.∎
4 Algorithms and complexity for distances with
The main motivation for the results in this section is the study of -Clustering with the distance, the case widely known as -Medians. However, our main algorithmic result also extends to distances of order since in some sense they behave similarly to the distance.
4.1 FPT algorithm when parameterized by
In this subsection, we prove Theorem 1: when , -Clustering admits an FPT algorithm with parameter . First we state basic geometrical observations for cases and , Then we propose a general algorithm for Cluster Selection which relies only on these properties. Finally, we show how Theorem 9 could be applied.
The next two claims deal with the structure of optimal cluster centroids. We state and prove them in the case of weighted vectors where each vector has a positive integer weight given by a weight function . The unweighted case is just a special case when the weight of each vector is one.
First, we show that coordinates of cluster centroids could always be selected among the values present in the input, which helps greatly in enumerating cluster centroids that may be optimal.
Claim 4.1*.*
Let be a cluster and be a weight function. Then there is an optimal (subject to the weighted distance ) centroid of such that for each , the -th coordinate of the centroid is from the values present in the input in this coordinate, that is . Moreover, for we may assume that the optimal value is a weighted median of the values present in the -th coordinate.
Proof.
For cluster , consider the corresponding multiset of unweighted vectors , where each vector is repeated times. We define for . Assume that . Let us consider an optimal cluster centroid for and denote . Figure 4 shows how the cluster cost behaves with respect to on a concrete set of values for and .
For the formal proof, we start with the case . The total cost of contributed by the -the coordinate is
[TABLE]
If for , then the derivative with respect to is
[TABLE]
And when for , analogously the derivative is . So if is odd, then the derivative is zero at , strictly negative before and strictly positive after, so , which is the only median, is the optimal value for . If is even, then the derivative is zero on , strictly negative before and strictly positive after. So any value from is optimal, and we may assume that it is one of the two medians , .
Now to the case , the contribution of the coordinate is
[TABLE]
When is between and , then the derivative of the above with respect to is equal to
[TABLE]
It is monotone on : when increases, the sum decreases, as terms of the form decrease and terms of the form increase, because . Thus, the optimal value on this interval is achieved at one of its ends. Doing the same for all intervals, we conclude that the optimal value for must be in . ∎
In particular, by Claim 4.1 we may assume that the coordinates of optimal cluster centroids are integers. Then, the -property holds with since at most one of the initial clusters could have distance zero to the cluster centroid, and all others have distance at least one since the cluster centroid is integral. Namely, let be a vector in the cluster, and be the cluster centroid, if , then there is a coordinate where and differ, and since they are both integral, , and
[TABLE]
In what follows, the expression half of vectors by weight means that the total weight of the corresponding set of vectors is at least half of the total weight of .
Claim 4.2*.*
If at least half of the vectors by weight in the cluster have the same value in some coordinate , then the optimal cluster centroid is also equal to in this coordinate.
Proof.
Let be the weight-respecting multiset of values which vectors from have in the -th coordinate: . Consider the difference between selecting and some other value as the -th coordinate of the centroid:
[TABLE]
The inequality holds since at least half of the elements of are equal to , and so for any value there is a term in corresponding to one of the values from equal to . The last sum is non-positive because in every term
[TABLE]
as . This concludes the proof. ∎
In order to apply Theorem 9, we need an FPT algorithm for Cluster Selection. Before obtaining it, we state some properties of hypergraphs, which we need for the algorithm.
A hypergraph is a set of vertices and a collection of hyperedges , each hyperedge is a subset of . If and are hypergraphs, we say that appears at as a subhypergraph if there is a bijection with a property that for any there is such that , the action of is extended to subsets of in a natural way.
A fractional edge cover of a hypergraph is an assignment such that for every , . The fractional cover number is the minimum of taken over all fractional edge covers .
We need the following result of Marx [27] about finding occurences of one hypergraph in another.
Lemma 10** ([27]).**
Let be a hypergraph with fractional cover number , and let be a hypergraph where each hyperedge has size at most . There is an algorithm that enumerates in time every subset where appears in as a subhypergraph.
Also, the following version of the Chernoff Bound will be of use.
Proposition 2* ([5]).*
Let , , …, be independent 0-1 random variables. Denote and . Then for ,
[TABLE]
We are ready to proceed with the proof that Cluster Selection with is when parameterized by .
Theorem 11**.**
For every , Cluster Selection with distance is solvable in time .
Proof.
First we check if any of the given vectors could be the centroid of the resulting composite cluster. When the centroid is fixed, we find the optimal solution in polynomial time by just selecting the cheapest vector with respect to this centroid from each set. If at some point we find a suitable centroid, then we return that the solution exists. If not, we may assume that the centroid is not equal to any of the given vectors. As a consequence, any vector selected into the solution cluster contributes at least to the total distance, since the centroid must be integral by Claim 4.1. So we may now consider only vectors of weight at most and, moreover, the total weight of the resulting cluster is at most .
Consider a resulting cluster with the centroid . There is some in from , and . So if we try all possible from (there are at most of them), any feasible centroid is at distance at most from at least one of them. Since and are integral, they could be different in at most coordinates, as .
We try all possible . After is fixed, we enumerate all subsets of coordinates where and could differ, we show how to do it efficiently afterwards. When the subset of coordinates is fixed, we consider all possible centroids, which are integral, equal to in all coordinates except , and differ from by at most in each of coordinates from . If for some coordinate , then , so can not be a centroid. With restrictions stated above, there are at most possible centroids.
It remains to show that we could enumerate all possible coordinate subsets efficiently. We reduce this task to the task of finding a specific subhypergraph and then apply Lemma 10.
Claim 4.3*.*
There are coordinate subsets where and an optimal cluster centroid could differ. There exists an algorithm which enumerates all of them in time .
Proof.
Let be a hypergraph with , one vertex for each coordinate, and for each vector in we take multiple hyperedges which contains exactly the coordinates where and differ. We add an edge only if there are at most such coordinates, otherwise can not be in the same cluster as . So hyperdeges in are of size at most . Since we consider only vectors of weight at most , .
For a solution, let be the vector selected from the corresponding , for , be the solution cluster and be the centroid. All vectors in are identical in all coordinates except at most , since if there are different values in at least coordinates, the cost is at least . Denote this subset of coordinates as , could also differ from only at . Denote the subset of coordinates where differs from as , and so . The solution induces a subhypergraph of in the following way. Leave only hyperedges corresponding to the vectors in , and restrict them to vertices in . There are at most vertices and at most hyperedges in , since the total weight is at most . An example of the correspondence between input vectors and hypergraphs is given in Figure 5.
The next claim shows that the fractional cover number of is bounded by a constant.
Claim 4.4*.*
Each vertex in is covered by at least half of the hyperedges of , and .
Proof.
Consider a vertex , and assume that less than half of the hyperedges cover . It means that in the -th coordinate the centroid differs from , but less than half of the vectors in by weight differ from in this coordinate. This contradicts Claim 4.2.
So each vertex is covered by at least half of the hyperedges, and setting leads to . ∎
In order to enumerate all possible subsets of coordinates , we try all hypergraphs with at most vertices and at most hyperedges, and if each vertex is covered by at least half of the hyperedges, we find all places where appears in by Lemma 10. The last step is done in time. However, the number of possible could be up to . The following claim, which is analogous to Proposition 6.3 in [27], shows that we could consider only hypergraphs with a logarithmic number of hyperedges.
Claim 4.5*.*
If , it is possible to delete all except at most hyperedges from so that in the resulting hypergraph each vertex is covered by at least of the hyperedges, and .
Proof.
Denote , construct a new hypergraph on the same vertex set by independently selecting each hyperedge of with probability . Applying Proposition 2 with , probability of selecting more than hyperedges is at most . By Claim 4.4, each vertex of is covered by at least hyperedges, and the expected number of hyperedges covering in is at least . By Proposition 2 with , the probability that is covered by less than hyperedges in is at most . By the union bound, with probability at least we select at most hyperedges and each vertex is covered by at least hyperedges. So the claim holds, and by setting . ∎
So if there is a subhypergraph in corresponding to a solution, then there is also a subhypergraph in appearing at the same subset of with at most hyperedges and . Since we only need to enumerate possible coordinate subsets, it suffices to consider only hypergraphs with at most hyperedges, and there are of them. Since the fractional cover number is still bounded by a constant, the total running time is , as desired. ∎
With Claim 4.3 proven, the proof of the theorem is complete. The pseudocode given in Figure 6 summarizes the main steps of the algorithm. ∎
Combining Theorem 9 and Theorem 11, we obtain an algorithm for -Clustering. This proves Theorem 1, which we recall here.
See 1
Proof.
We have an algorithm for Cluster Selection whose running time is specified by Theorem 11. By Claim 4.1, the -property holds. The only missing part is to describe the way of producing the set of all possible cluster costs which are at most .
In the case all distances are integral so we can take .
For the general case, let . Consider a cluster and the corresponding optimal cluster centroid . For any , is a combination of elements of with nonnegative integer coefficients. This is because and are integral and the cluster cost is at most , hence for each . Since weights are also integral, the whole cluster cost is a combination of distances between cluster vectors and the centroid with nonnegative integer coefficients, and so also a combination of elements of with nonnegative integer coefficients. This means that we can take
[TABLE]
the sum of coefficients is at most since all elements of are at least 1. The size of is at most . ∎
Note that another widely studied version of -Clustering is where centroids could be selected only among the set of given vectors. Naturally, our algorithm also works in this setting since the set of possible centroids is only restricted further.
4.2 W[1]-hardness of Cluster Selection parameterized by for
In this subsection, we restrict our attention to the case. What happens when is not bounded, but the dimension and the number of clusters are parameters? There is a trivial XP-algorithm in time , as by Claim 4.1 it suffices to try all possible combinations of the values present in coordinates as possible cluster centroids. There are at most distinct values in each coordinate, so at most candidates for a cluster centroid. After the cluster centroids are fixed, each vector goes to the cluster with the closest centroid.
We do not know of a lower bound for -Clustering complementing this algorithm. However, we are able to show the hardness of Cluster Selection with respect to the dimension.
Theorem 12**.**
Cluster Selection* with distance is -hard when parameterized by .*
Proof.
We construct a reduction from Multicolored Clique with the input and . We set to , for each pair of colors and each between a vertex of color and a vertex of color we add a vector to the set , such that , and all other coordinates are set to zero, and a vector to the set which is the same as , only coordinates other that and are set to . We will refer to 0 and as boundary values. The sets and are the input to Cluster Selection, so is , and we set to . Intuitively, the set corresponds to the choice of the clique edge between -th and -th color, and mirrors it. All vectors have weight one. An example is given in Figure 7.
Note that in any feasible cluster, each coordinate has exactly values in , one from each of the sets and for . Out of all other values, exactly half are zero and half are . So the median is always in , and the boundary values in each column contribute exactly to the total distance.
Assume there is a colorful -clique in , with vertices , , …, . We form the resulting cluster by choosing the vector corresponding to the clique’s edge between its -th and -th vertices from , and also from , for all . For this cluster, in the -th coordinate we have all non-boundary values equal to . So the median is also , and the total distance is , since non-boundary values do not contribute anything.
In the other direction, if we are able to select a cluster of cost exactly , then all non-boundary values in each coordinate must be equal, denote this common value in the -th coordinate as . We claim that vertices , , …, form a colorful clique in . Indeed, since we have times in the -th column, then we have of them from the sets , one from each, and in the -th column the only non-boundary value is . So must have an edge to each for . By construction, vertices in the -th coordinate are of color .
∎
5 The distance
In this section, we consider the case . It is a natural measure of difference to consider since observation parameters are often incomparable, and we very well may be interested in counting only the number of different entries. From another point of view, the distance gives the -Clustering problem a more combinatorial flavor, since the input vectors could be viewed as strings and we are interested about how close they are according to the Hamming distance. However, in comparison to a number of problems on strings, the size of the alphabet is unbounded.
First, note that there is a simple rule of finding the optimal cluster centroid for a given cluster.
Observation 1*.*
For a given cluster , the coordinates of the optimal cluster centroid could be set as
[TABLE]
breaking ties in favor of the lowest values.
By Observation 1, we may assume that optimal cluster centroids could never have values not present in the input, and in particular that they are integral.
We prove W[1]-hardness of -Clustering with the distance by showing a reduction from Clique. The reduction also shows hardness of Cluster Selection.
Note that when is fixed, we could apply Theorem 9 to obtain an FPT algorithm: Cluster Selection solves trivially by trying every present value in each coordinate as a value for the centroid, there are only variants. The -property holds for distance with since at most one initial cluster could coincide with the cluster centroid, and all others have distance at least one.
We restate Theorem 2, which we prove next.
See 2
Proof.
First we show how to obtain an FPT reduction from Clique parameterized by the clique size to -Clustering.
Given an instance (, ) of Clique, for each pair of indices , , we make vectors in , assume . For each , we add a vector : two coordinates are set to vertex values, , , and in all other coordinates is set to the special padding value . In total, there are vectors and different values, since there are vertex values, all padding values are distinct from vertex values and from each other.
Finally, we set and . An example of the reduction is shown in Figure 8.
Now we prove that the original instance has a -clique iff the transformed instance has a -clustering of cost at most .
If there is a -clique, there is a clustering with cost : we take one nontrivial cluster of size and all other clusters are of size 1. Let ,…, be the vertices of the clique, for each , we take into the cluster. The cluster centroid is , each vector in the cluster has distance to the centroid of exactly .
Now to the opposite direction. Assume that there is a clustering of cost at most , and there are composite clusters: , …, . In each cluster and each coordinate, by Observation 1 we may assume that we select the most frequent vertex there as the value of the centroid, since all padding values are distinct. If there are no vertex values in this cluster in this coordinate, we may assume that we select any of the occuring padding values. For a cluster , denote the number of vertex-containing coordinates as , and the total number of vertex-valued entries which do not match with the centroid value in the corresponding coordinate as . We could write the total cost of the clustering as
[TABLE]
That holds since in each cluster each of the padding values is not matched with the cluster centroid and increases the total distance by one, except for the vertex-free coordinates, where exacly one of the padding values is selected as the value of the centroid. Also each vertex-valued entry which is not matched with the centroid increases the total distance by one, there are of them.
There are clusters in total, of them are simple. We may assume that in the optimal clustering there are no empty clusters, since we could always move a vector from a composite cluster to an empty one without increasing the cost. So there are vectors in the composite clusters, which is equal to . We could rewrite the total cost as
[TABLE]
Now we show that for any clustering the value is at least , and it is equal to only in the -clique clustering. It suffices to prove the following lemma.
Lemma 13**.**
For any cluster such that , , where , and the equality holds only when is a -clique.
The lemma implies
[TABLE]
and also that the equality holds only when each term is equal to , so each is a -clique, but then since . So must contain a -clique if there is a clustering of cost at most , and the reduction is correct. Note that none of the could have size larger than since there are clusters in total.
Proof of Lemma 13.
First, we consider the case , so in each coordinate all vertex values are equal.
Claim 5.1*.*
If is a cluster of vectors obtained by applying the reduction described in the proof of Theorem 2 to any graph , , and , then .
Proof.
The proof is by induction on . The base is , and each non-empty cluster contains at least one vector and so at least 2 coordinates with vertices, we assume .
For the general case, if there are at least occurences of a vertex in a coordinate , then there are at least coordinates with vertices. Each vector with in the -th coordinate has also some other vertex in some other coordinate. As in each coordinate all vertex values are equal, it could not be that two of the vectors with the value in the -th coordinate share the second vertex-valued coordinate, since then they would represent the same edge.
So each coordinate has at most vertex occurences, otherwise the claim holds. Select a coordinate which contains some vertex value and remove the -th coordinate and all vectors which have the value in the -th coordinate. That corresponds to the natural restriction of the cluster to a subgraph . The size of is at least , and by induction there are at least coordinates which contain vertex values, so the original cluster has at least such coordinates, since there is also the -th coordinate with the vertex value . ∎
Now consider a cluster with . Let be the largest value with , so . Since , . By Claim 5.1, , then
[TABLE]
and so if , the inequality is strict. It is also strict if and , as the denominator becomes larger in the first step. Thus the only possibility of getting exactly is when .
But then we have exactly vertex values across coordinates, and each coordinate has at most vertex values by the argument in Claim 5.1, so each coordinate must have exactly vertex values. Since , they must be all equal. Denote the common vertex value in the -th coordinate as . Since each occurence of in the -th coordinate corresponds to an edge to a different , vertices , …, form a clique in .
In the case , consider a new cluster which is obtained from by removing all vectors which have a vertex-valued entry not equal to the centroid value. Assume for now that . By the proof above, , since . The value could be obtained from by adding to the numerator and to the denominator. Removing vectors could not increase , so , and since each of the removed vectors has at least one vertex value not equal to the centroid value. If , then the new fraction is also at least 1 and so striclty greater than . If , then since and . If , then the new fraction became strictly larger, and so stricly larger than . In all cases, the inequality is strict when .
∎
Now to Cluster Selection: the reduction is almost the same, only we start from Multicolored Clique, and for each pair of indices , we obtain the set of vectors from edges in starting in color and ending in color . The vectors are constructed in the same way as in the previous reduction. All weights are set to one. The value of is the same, .
Since vectors are constructed in the same way, all statements about the cost of grouping them remain valid, in particular Lemma 13. Only now the statement of Cluster Selection already guarantees that we select exactly one cluster and exactly one vector from each , so exactly one edge between each pair of colors. And by Lemma 13 only the proper -clique has the optimal cost.
∎
Note that Cluster Selection with the distance is very similar to the known problem Consensus String With Outliers, studied e.g. in [7]. The only difference of Cluster Selection is that we have to select one point from each of the given sets, whereas in Consensus String With Outliers the goal is to select the arbitrary subset of size . The construction from Theorem 2 also shows W[1]-hardness of Consensus String With Outliers with respect to in the case of unbounded alphabet.
6 The distance
In this section, we consider the case . We prove two hardness results of -Clustering: -hardness when parameterized by and -hardness in the case .
First, we prove some useful facts about the structure of optimal cluster centroids. The one thing, in which the distance is harder than all other distances in our consideration, is that even when the cluster is given, we can not just find the optimal cluster centroid by optimizing the value in each coordinate independently. So there seems to be no simple rule of finding the optimal cluster centroid of a given cluster. However, one could still do that in polynomial time by solving a linear program.
Claim 6.1*.*
Given a multiset of vectors in , there is a polynomial time algorithm to find minimizing
[TABLE]
Proof.
We reduce to solving a linear program, which we define next. Denote , introduce variables , …, corresponding to coordinates of the cluster centroid and variables , …, , where corresponds to the value . The following linear program solves to the minimum total distance.
[TABLE]
∎
The next claim shows that we could only consider half-integral cluster centroids.
Claim 6.2*.*
For any multiset of vectors in , the vector which minimizes
[TABLE]
could always be chosen from (coordinates are either integer or half-integer).
Proof.
Assume that we have an optimal solution which has at least one coordinate not of the form , . For we denote , and
[TABLE]
calling this value the remainder of .
We could partition all coordinates on equivalence classes by remainder of . One could also define a partition of all vectors by the remainder of the distance to . These two partitions are related in the following sense: if has remainder then each coordinate where also has remainder , and vice versa. Now we take one particular remainder and show that we can shift it without losing optimality.
There are two kinds of vectors with the particular remainder : call bottom those vectors for which , and call top those vectors for which . Similarly, there are also two kinds of coordinates of , which we also call bottom and top depending of the value of .
Consider a bottom cordinate . Increasing increases for all bottom vectors , and decreases for all top vectors . Decreasing does the opposite, as well as increasing a top coordinate. So if we take some sufficiently small value and simultaneously increase all bottom coordinates and decrease all top coordinates by then for all bottom vectors their distance will become larger by , and for all top vectors — smaller by . An if we do the opposite, the bottom vectors will cost less and the top vectors will cost more. Then, we could just take the group which has more vectors (bottom or top) and choose that action which decreases the distance for these vectors. The larger group has at least as many vectors as the smaller group, so the total distance does not increase.
It remains to see which value of we could take. We could safely shift until we either reach a value in or another remainder. In any case, we reduce the number of distinct remainders by one, and so we conclude the proof by doing this inductively over the number of distinct remainders.
∎
By Claim 6.2, the -property holds with , since at most one vector could be equal to the cluster centroid, and all others have distance at least due to half-integrality. We can also see that when the problem is parameterized by , it is FPT.
Claim 6.3*.*
-Clustering with the distance is FPT when parameterized by .
Proof.
We use Theorem 9. We have the -property, and for the set of all possible cluster costs not exceeding we could take all half-integral values not exceeding by Claim 6.2. All that remains is to solve Cluster Selection in FPT time.
For that, we try all possible , and then try each possible resulting cluster centroid . Since and is half-integral by Claim 6.2, we can try only vectors of this form, and that is done in time . ∎
6.1 -hardness when parameterized by
Knowing that -Clustering with the distance is FPT when parameterized by , the next natural question is, is the problem FPT or -hard when parameterized only by ? We show that -hardness is the case, proving Theorem 3, which we recall here for convenience.
See 3
Proof.
First, we show a reduction from Clique to -Clustering. Given a graph and a clique size , we construct the following instance of the clustering problem.
We set the dimension to . We take vectors corresponding to vertices. For the vertex , first coordinates are set to zero, except -th coordinate, which is set to 2.
The last coordinates correspond to non-edges, vertex pairs which are not connected by an edge. For each vertex pair in the coordinate we set to , to , the order on , is chosen arbitrarily, and all other vectors to zero.
Finally, we set the number of clusters to and the total distance to . We show an example on how the reduction works in Figure 9.
If there is a clique of size in , then we have a solution of cost : take vectors corresponding to the clique vertices in one cluster, and make all other clusters trivial. For the only nontrivial cluster , we can always choose so that for any and for any coordinate . Each vertex coordinate has only 0 and , so setting to 1 there suffices. As in we have an edge between any two vertices, in any non-edge coordinate there are either all zeroes, or zeroes and , or zeroes and . In each of the cases there is a suitable value for : [math], or correspondingly.
Next, we prove that any solution has cost at least , and any solution which is not a -clique has stricly larger cost. For that, we prove the following claim.
Claim 6.4*.*
In the instance above, the cost of any cluster containing at least two vectors is at least . If there is at least one non-edge in , then the cost is at least .
Proof.
Denote the cluster centroid as . If each vector in has , the first statement is trivial. So assume that there is a vector in such that . Consider the coordinate which corresponds to the same vertex as the vector , , and all other vectors are zero in the coordinate . As , . Then, for any other , . The total cost of the cluster is at least , as .
Now to the second part of the claim. Assume there are only two vectors in and they do not have an edge, there is a coordinate where one is 2 and the other is . No matter what we choose for , the cost is at least , and the statement follows. So assume that and there is a coordinate corresponding to a non-edge in . One vector from has 2 in the coordinate , another , and all others have 0. Then there is a vector in with distance to of at least 2, as either and or and . Let us just forget about this vector and consider all other vectors in . There are of them, and by the reasoning in the proof of the first statement, their cost is at least . In this proof we considered only vertex coordinates, so the vector we forgot and the -th coordinate (which is a non-edge coordinate) does not affect it. So, the total cost is at least . ∎
Assume that we have nontrivial clusters of sizes , nontrivial means that the size is at least two, for . By Claim 6.4, the total cost is at least
[TABLE]
as there are clusters in total, trivial clusters, and the total number of vectors is , from which it follows that . So no solution has cost less than .
Also, if there are at least two nontrivial clusters, then . So if a solution has cost , it must have only one nontrivial cluster, and its size must be .
Finally, assume that the solution indeed has only one nontrivial cluster, but there is a non-edge in it. Then, as the size is , by Claim 6.4 its cost is at least . So only a -clique has cost , which proves the correctness of the reduction.
Now, to Cluster Selection. We consider essentially the same reduction, only we start from Multicolored Clique. We obtain sets of vectors , …, in the same way as in the reduction above, only vectors obtained from vertices of color are put into . The total distance parameter is also set to . So parameters and of the obtained instance have the same value as the starting parameter .
Since vectors are constructed in the same way, Claim 6.4 still works. And now the statement of Cluster Selection enforces that exactly one cluster of vectors is selected. By Claim 6.4 it could be done with the cost if and only if there is a colorful -clique in the original graph.
∎
6.2 -hardness when
In this subsection we prove -hardness of -Clustering with the distance when . Intuitively, if we consider the previous reduction, partitioning the vectors optimally into two clusters loosely corresponds to partitioning the vertices into two sets such that there are as many as possible vertices having no edges inside their set. Which, in turn, is Odd Cycle Transversal: the problem of removing the smallest number of vertices so that the remaining graph is biparite. However, to make everything really work, we need to consider a modified version of Odd Cycle Transversal which we call Half-Integral Odd Cycle Transversal.
Input:
An undirected graph , an integer .
Task:
Is there an assignment , such that and is bipartite, where ?
Half-Integral Odd Cycle Transversal parameterized by
First we show that Half-Integral Odd Cycle Transversal is also -hard by constructing a reduction from 3-SAT.
Lemma 14**.**
There is a polynomial time reduction from 3-SAT to Half-Integral Odd Cycle Transversal.
Proof.
Given an instance of 3-SAT with variables and clauses, make a graph as follows. The example of the reduction is given in Figure 10. For each variable , introduce two vertices and , connect them with an edge. Also introduce vertices connect them to both and .
For each clause introduce four vertices ,…,. Consider following seven vertices: , …, , and three variable vertices which are present in : if then we consider the vertex , and if then we consider the vertex . Connect all these seven vertices in a cycle such that each variable vertex is adjacent to two clause vertices. Finally, set to .
First, assume there is a satisfying assignment. Consider the following : if is true, , otherwise , on all other vertices . Clearly, .
Since does not take value , deleting edges with is equivalent to deleting vertices on which is 2. From each vertex gadget we deleted either or , so the remaining part is a star with leaves and center or . Since the assignment we started from is satisfying, from each clause cycle we deleted at least one vertex. So each cycle present in lost at least one vertex, and what remains is bipartite.
Now assume there is a solution to the Half-Integral Odd Cycle Transversal instance. We claim that for each variable . Consider a 2-coloring of : either and have the same color or not. In the former case, since the edge must be removed.
If and have different colors, assume that and . Then, each of the vertices takes one of the two colors, and so has an incident edge to or which needs to be deleted. But then, for each , and the total cost on these vertices is already . Then either or .
So we have variables and is at least on each pair of variable vertices, and in total is at most . Then has to be exactly on each variable pair, and zero on all other vertices. Now we claim that on each clause cycle there is a variable vertex with . If not, then none of the cycle edges gets deleted, as is equal to zero on clause vertices. But then the remaining graph could not be bipartite, since it contains an odd cycle.
To get a satisfying assignment, set to true if , or to false otherwise. In particular, if , is set to false, since . Each clause is satisfied since each clause cycle contains a variable vertex on which is equal to . ∎
Now we prove -hardness of -Clustering with and by constructing a reduction from Half-Integral Odd Cycle Transversal.
Theorem 15**.**
-Clustering* with distance is –hard when .*
Proof.
Consider an instance of Half-Integral Odd Cycle Transversal, if , we have a yes-instance since deletes all edges from the graph, so we may assume . Remove all isolated vertices in and add isolated edges to , it clearly does not change the type of the instance. The number of clusters is , set the dimension to , each coordinate corresponds to an edge. For each vertex add a vector to with all coordinates set to zero. Then, for each edge set to and to , the order on is chosen arbitrarily. Finally, set to . An example is given in Figure 11, additional isolated edges are dropped out for clarity.
If is a yes-instance of Half-Integral Odd Cycle Transversal, consider the solution . Split vectors into clusters according to any proper 2-coloring of . Now we show the way to select cluster centroids so that each vertex has distance at most to the corresponding centroid. We consider separately each of two clusters and each coordinate, indexed by an edge . For a cluster , there are three cases on how and are present in the cluster, for each of them we assign a particular value to the cluster centroid in the coordinate .
- •
If and are both not in , for vectors in all entries in the coordinate are zero, and we set also to zero. Each vector is at distance zero to the centroid in this coordinate.
- •
If only one of and are in , for vectors in all entries in the corresponding coordinate are zero, except one entry corresponding to the edge’s endpoint belonging to , which is either or . Set to or , correspondingly, then each vector is at distance in this coordinate.
- •
If both and are in , w.l.o.g is and is , and all other points are zero. It must hold that , either or w.l.o.g and . In the former case, set to zero, then all vectors have distance zero, and have distance in this coordinate. In the latter case, set to , then is at distance , and all other vectors, including , are at distance .
For any , since it holds for all coordinates that distance from to the corresponding cluster centroid is at most , then the distance is also at most , and the total cost of the clustering defined above is at most
[TABLE]
In the other direction, assume there is a clustering , with centroids , such that the total cost is at most . By Claim 6.2 we may assume that centroids are integral, and for any vector the distance to the nearest centroid is also an integer. We also may assume that centroids are between and in each coordinate since all the input vectors have entries in this range, and so we could move the centroids to the same range without increasing distances.
So, each vector has distance in to the closest centroid. We claim that it could not be that a vector has distance zero: in this case w.l.o.g , and so is equal to or in some coordinate, since each vertex has at least one incident edge. But then each vector in has distance at least to . And since at most two vectors could be equal to the centroids, each of the remaining vectors has distance at least 1. Consider isolated edges, at least of them do not have any endpoint equal to one of and . For these edges, the total distance of their endpoints is at least : either their endpoints are in different clusters, and so the endpoint in costs at least , or both endpoints are in the same cluster, and in total they cost since there are simultaneously values and in the coordinate corresponding to this edge. So each of the edges increases the cost by additional , and the total cost is at least .
Since each vector has distance at least , we may assume that the centroids are in . If we have (or ) we could change it to (or ), all vectors which could become farther from the centroid have in this coordinate. But then the distance for these vectors is still at most . We also may assume that distances are in , since distance could be only from to .
We claim that if we set , is a solution to Half-Integral Odd Cycle Transversal. Remove all edges with , and consider 2-coloring of induced by the partition . Assume that we have an edge such that and and are in the same cluster (w.l.o.g ). Then we have a coordinate such that w.l.o.g and , but due to and so , which is a contradiction. So is also a yes-instance. ∎
Note that the reduction from 15 also implements -Coloring, if we set to the number of colors and to , since with such a small budget we can not allow any same-colored neighbors in the optimal clustering.
7 The case
In this section we consider the case , with the particular emphasis on the most commonly used case . With the distance, the -Clustering problem is widely studied under the name -Means.
7.1 when parameterized by for
When we consider both and as the parameters, Cluster Selection in the distance becomes , and so -Clustering is also by Theorem 9.
Note that in any composite cluster, each vector except at most one is at distance at least from the centroid, so the -property holds with . Consider two different vectors, they have different values in some coordinate, and in this coordinate at least one of them is at distance at least from the centroid.
Now we prove Theorem 4, which we restate here.
See 4
Proof.
We start with the proof that Cluster Selection is . Distance enjoys the -property. Hence if then any composite cluster costs more than and the instance is clearly a no-instance. So we may assume that .
We claim that there are at most possible total weights of the resulting composite cluster. First, in the resulting cluster there could be at most one vector with weight strictly larger than . Otherwise, let us consider two such vectors and the coordinate in which they differ. No matter which value the centroid has there, it is at distance of at least from at least one of the vectors, so the total cost is larger than . So there are at most possibilities for the largest weight, and all of the other weights are at most .
We fix the total resulting cluster weight , the vector in the resulting cluster with the largest weight , and the coordinate . Since the centroid is the mean of the vectors in the resulting cluster, is of form , where . We claim that the distance from to is bounded by a function of , and so each possible could be enumerated in time. Moreover, all possible centroids could also be enumerated in time since is a parameter.
Let be the resulting cluster, for all . The difference between and could be written as
[TABLE]
The absolute value of the numerator is since , gets multiplied by zero, and all other weights are at most . Also, for any , , since
[TABLE]
The total running time is at most
[TABLE]
since we try all possible cluster weights, all possible out of the input vectors, then all possible centroids which differ from by in each coordinate. And then for each centroid we check whether the optimal cluster for it has cost at most by selecting the best for each . This concludes the proof that Cluster Selection is when parameterized by .
Now we proceed with the proof that -Clustering is parameterized by . For that we employ Theorem 9. We already have the -property and algorithm for Cluster Selection. Hence the only thing left is to enumerate the set of all possible optimal cluster costs not exceeding .
Since there are vectors in total, each cluster contains from to vectors. For each possible cluster size the centroid is of the form , where . Since input vectors have integer coordinates, the cost of any cluster of size is of form , where . And since the cost is at most , . We enumerate all possible cluster sizes in , and for each cluster size all possible cluster costs in . In this way we obtain , and . ∎
7.2 -hardness when parameterized by
In our setting, -Clustering for seems to be harder than for , since we do not have the nice property that if many vectors have the same value in some coordinate then the centroid must also have this value. On the contrary, even if only one vector diverges from the rest, the optimal centroid also diverges. So the approach with enumerating nontrivial coordinate sets, which we successfully used in the case, is not likely to work.
We are able to prove that Cluster Selection for is W[1]-hard parameterized by . It remains open whether -Clustering for or specifically for is W[1]-hard or not, but our result shows that at least the approach we used to obtain an algorithm in the case would not yield an algorithm for .
First we state and prove two technical claims about the geometrical properties of clustering zero-one valued vectors in the case.
Claim 7.1*.*
If we have a cluster of size where vectors have zero and vectors have one in the coordinate , then the optimal centroid value in this coordinate is equal to
[TABLE]
and the coordinate contributes
[TABLE]
to the total cost.
Proof.
Assume that the centroid value in the coordinate is equal to , then the cost is
[TABLE]
It is easy to see that is worse than , and similarly is worse than , so we could restrict to . The derivative with respect to is
[TABLE]
as , the derivative is zero if and only if
[TABLE]
The derivative increases monotonically: when we increase , increases and decreases as . So the optimal value must be at its unique root defined by the expression above. Thus, the optimal cost is equal to
[TABLE]
∎
Now we prove that it is optimal to have as many ones in the same coordinate as possible. For that, we calculate how much each one adds to the total cost depending on how many ones are there in a coordinate.
Claim 7.2*.*
Consider a cluster of zero-one valued vectors, denote as the contribution of a coordinate in which there are ones and zeroes. The function is strictly decreasing for .
Proof.
Denote the number of zeroes in the coordinate as . By Claim 7.1, the contribution of the coordinate per each one is
[TABLE]
Let us denote , , the derivative of the above with respect to is equal to
[TABLE]
which is strictly positive for , hence proving the claim. ∎
Now we are ready to prove the hardness result, which was stated in the introduction as Theorem 5. We recall the statement here.
See 5
Proof.
We construct a reduction from Multicolored Clique. Given a graph and a clique size , we construct the following instance of Cluster Selection.
We set to , each input set of vectors represents a choice of an edge of the clique between two particular colors, so we number them by unordered pairs of indices from 1 to . We set the dimension to , coordinates are numbered by vertices.
The set consists of the following vectors: for each edge between a vertex of color and vertex of color , we add a vector with in the coordinate and in the coordinate , all other coordinates are set to zero. All vectors have weight one. Finally, we set
[TABLE]
In Figure 12, we show the intuition behind the reduction by considering a simple example.
If there is a colorful -clique in then we construct a solution to our instance of Cluster Selection. Assume the clique is formed by vertices , , …, , for each vertex is of color . From each choose the vector corresponding to the edge . Among the chosen vectors, in every coordinate of the form there are ones from edges to and zeroes. All other coordinates are zeroes in the chosen vectors, so they do not contribute anything to the total distance. By Claim 7.1, the total distance is
[TABLE]
In the other direction, we prove that only the solution described above could have the cost , all others have strictly larger cost. First notice that in any resulting cluster there are at most ones in each coordinate, since for any vertex , if we denote its color by , only vectors from sets of the form () have ones in the coordinate , and we take one vector from each set by the definition of Cluster Selection.
Each vector has exactly two ones, so in any resulting cluster there are ones in total. By Claim 7.2, any resulting cluster which does not have ones in coordinates has strictly larger cost, since only coordinates with exactly ones have the optimal cost per each one.
So, if the resulting cluster has the cost , then there are coordinates such that in each of them exactly of the chosen vectors have one. We show that in this case the original instance of Clique has a -clique. For any color there are at most ones in all coordinates indexed by vertices of color in the resulting cluster. So all of these ones are in the same coordinate for some . We claim that the vertices , …, form a clique. Consider vertices and , we have taken some vector from , and this vector must have added a one to the coordinates and , then by construction the edge is in .
∎
8 Conclusion and open problems
In this paper, we presented an algorithm for -Clustering with parameterized by . However, for the case we were able only to show the -hardness of Cluster Selection. While intractability of Cluster Selection does not exclude that -Clustering could be with , it indicates that the proof of this (if it is true at all) would require an approach completely different from ours. Thus an interesting and very concrete open question concerns the parameterized complexity of -Clustering with and parameter .
Another open question is about the fine-grained complexity of -Clustering when parameterized by . For several distances, we know XP-algorithms: an algorithm by Inaba et. al. [21] for , as well as trivial algorithms for . For the case when the possible cluster centroids are given in the input, the matching lower bound is shown in [11]. However, we are not aware of a lower bound complementing the algorithmic results in the case when any point in Euclidean space can serve as a centroid.
Finally, let us note that our -hardness reductions could be easily adapted to obtain ETH-hardness results. Our reductions are from Clique and, assuming ETH, there is no algorithm for Clique. In most of our results, the ETH lower bounds derived from our reductions, can be complemented by matching upper bounds through a trivial algorithm for Cluster Selection in time or and, consequently, an algorithm for -Clustering obtained by Theorem 9. However, the reduction in Theorem 5 excludes only a algorithm for Cluster Selection with under ETH. Both the trivial algorithm in time and the algorithm from Theorem 4 in time (which could also be turned into a -time algorithm) fail to match this lower bound. So, another open question is, whether there exists a better reduction or a subexponential algorithm could be obtained in this case.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. R. Ackermann, J. Blömer, and C. Sohler , Clustering for metric and nonmetric distance measures , ACM Trans. Algorithms, 6 (2010), pp. 59:1–59:26.
- 2[2] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan , Approximating extent measures of points , J. ACM, 51 (2004), pp. 606–635.
- 3[3] D. Aloise, A. Deshpande, P. Hansen, and P. Popat , NP-hardness of Euclidean sum-of-squares clustering , Machine Learning, 75 (2009), pp. 245–248.
- 4[4] N. Alon, R. Yuster, and U. Zwick , Color-coding , J. ACM, 42 (1995), pp. 844–856.
- 5[5] D. Angluin and L. Valiant , Fast probabilistic algorithms for hamiltonian circuits and matchings , J. Computer and System Sciences, 18 (1979), pp. 155 – 193.
- 6[6] M. Badoiu, S. Har-Peled, and P. Indyk , Approximate clustering via core-sets , in Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC), ACM, 2002, pp. 250–257.
- 7[7] C. Boucher, C. Lo, and D. Lokshtanov , Outlier detection for DNA fragment assembly , Co RR, abs/1111.0376 (2011).
- 8[8] C. Boutsidis, A. Zouzias, M. W. Mahoney, and P. Drineas , Randomized dimensionality reduction for k-means clustering , IEEE Trans. Information Theory, 61 (2015), pp. 1045–1062.
