Parameterized k-Clustering: The distance matters!

Fedor V. Fomin; Petr A. Golovach; Kirill Simonov

arXiv:1902.08559·cs.DS·February 25, 2019

Parameterized k-Clustering: The distance matters!

Fedor V. Fomin, Petr A. Golovach, Kirill Simonov

PDF

TL;DR

This paper investigates the parameterized complexity of the k-Clustering problem under different Minkowski distances, revealing tractability for p in (0,1] and hardness for p=0 and p=∞.

Contribution

It establishes the fixed-parameter tractability of k-Clustering for p in (0,1], and proves hardness results for p=0 and p=∞, highlighting the importance of distance choice.

Findings

01

FPT algorithm for p in (0,1] with runtime 2^{O(D log D)}(nd)^{O(1)}.

02

Hardness results for p=0 and p=∞, unless FPT=W[1].

03

Distance order p critically affects the complexity of k-Clustering.

Abstract

We consider the $k$ -Clustering problem, which is for a given multiset of $n$ vectors $X \subset Z^{d}$ and a nonnegative number $D$ , to decide whether $X$ can be partitioned into $k$ clusters $C_{1}, \dots, C_{k}$ such that the cost \[\sum_{i=1}^k \min_{c_i\in \mathbb{R}^d}\sum_{x \in C_i} \|x-c_i\|_p^p \leq D,\] where $∥ \cdot ∥_{p}$ is the Minkowski ( $L_{p}$ ) norm of order $p$ . For $p = 1$ , $k$ -Clustering is the well-known $k$ -Median. For $p = 2$ , the case of the Euclidean distance, $k$ -Clustering is $k$ -Means. We show that the parameterized complexity of $k$ -Clustering strongly depends on the distance order $p$ . In particular, we prove that for every $p \in (0, 1]$ , $k$ -Clustering is solvable in time $2^{O (D l o g D)} (n d)^{O (1)}$ , and hence is fixed-parameter tractable when parameterized by $D$ . On the other hand, we prove that for distances of orders $p = 0$ and $p = \infty$ , no such…

Tables1

Table 1. Table 1 : Complexity of k 𝑘 k -Clustering and Cluster Selection . In the table, known 𝖭𝖯 𝖭𝖯 \operatorname{{\sf NP}} -completeness results are for p = 1 𝑝 1 p=1 and p = 2 𝑝 2 p=2 only.

{dist}_{p}

k

-Clustering

Cluster Selection

p = 0

𝖶 [1]

-hard param.

d + D

[Thm 2]

𝖭𝖯

-c for

k = 2

[15]

𝖶 [1]

-hard param.

d + t + D

[Thm 2]

0 < p \leq 1

2^{𝒪 ​ (D ​ \log D)} ​ {(n ​ d)}^{𝒪 ​ (1)}

[Thm 1]

𝖭𝖯

-c for

k = 2

[15]

𝖭𝖯

-c for

d = 2

[28]

2^{𝒪 ​ (D ​ \log D)} ​ {(n ​ d)}^{𝒪 ​ (1)}

[Thm 11]

𝖶 [1]

-hard param.

t + d

for

p = 1

[Thm 12]

1 < p < + \infty

𝖥𝖯𝖳

param.

d + D

for

p = 2

[Thm 4]

𝖭𝖯

-c for

k = 2

[3]

𝖭𝖯

-c for

d = 2

[26]

𝖥𝖯𝖳

param.

d + D

for

p = 2

[Thm 4]

𝖶 [1]

-hard param.

t + D

[Thm 5]

p = \infty

𝖶 [1]

-hard param.

D

[Thm 3]

𝖭𝖯

-c for

k = 2

[Thm 15]

𝖶 [1]

-hard param.

t + D

[Thm 3]

Equations94

i = 1 \sum k c_{i} \in R^{d} min x \in C_{i} \sum ∥ x - c_{i} ∥_{p}^{p} \leq D,

i = 1 \sum k c_{i} \in R^{d} min x \in C_{i} \sum ∥ x - c_{i} ∥_{p}^{p} \leq D,

\|x\|_{p}=\big{(}\sum_{i=1}^{d}|x[i]|^{p}\big{)}^{1/p}.

\|x\|_{p}=\big{(}\sum_{i=1}^{d}|x[i]|^{p}\big{)}^{1/p}.

dist_{p} (x, y) = ∥ x - y ∥_{p}^{p} = i = 1 \sum d ∣ x [i] - y [i] ∣^{p} .

dist_{p} (x, y) = ∥ x - y ∥_{p}^{p} = i = 1 \sum d ∣ x [i] - y [i] ∣^{p} .

dist_{0} (x, y) = ∣ {i \in {1, \dots, d} ∣ x [i] \neq = y [i]} ∣.

dist_{0} (x, y) = ∣ {i \in {1, \dots, d} ∣ x [i] \neq = y [i]} ∣.

dist_{\infty} (x, y) = i \in {1, \dots, d} max ∣ x [i] - y [i] ∣.

dist_{\infty} (x, y) = i \in {1, \dots, d} max ∣ x [i] - y [i] ∣.

i = 1 \sum k c_{i} \in R^{d} min x \in C_{i} \sum dist_{p} (x, c_{i}) .

i = 1 \sum k c_{i} \in R^{d} min x \in C_{i} \sum dist_{p} (x, c_{i}) .

i = 1 \sum k x \in C_{i} \sum dist (x, c_{i}) \leq D .

i = 1 \sum k x \in C_{i} \sum dist (x, c_{i}) \leq D .

c \in R^{d} min i = 1 \sum t w (x_{i}) \cdot dist (x_{i}, c) \leq D .

c \in R^{d} min i = 1 \sum t w (x_{i}) \cdot dist (x_{i}, c) \leq D .

i = 1 \sum k x \in C_{i} \sum dist (x, c_{i}) \leq D .

i = 1 \sum k x \in C_{i} \sum dist (x, c_{i}) \leq D .

c \in R^{d} min i = 1 \sum t w (x_{i}) dist (x_{i}, c) \leq D .

c \in R^{d} min i = 1 \sum t w (x_{i}) dist (x_{i}, c) \leq D .

2^{O (D l o g D)} (n d)^{O (1)} ∣ D ∣Φ (n, d, 2 D / α, D) .

2^{O (D l o g D)} (n d)^{O (1)} ∣ D ∣Φ (n, d, 2 D / α, D) .

∣ y_{1} - z ∣ + ∣ y_{2} - z ∣ + \dots + ∣ y_{t} - z ∣.

∣ y_{1} - z ∣ + ∣ y_{2} - z ∣ + \dots + ∣ y_{t} - z ∣.

((z - y_{1}) + \dots + (z - y_{i}) + (y_{i + 1} - z) + \dots + (y_{t} - z))^{'} = i - (t - i) .

((z - y_{1}) + \dots + (z - y_{i}) + (y_{i + 1} - z) + \dots + (y_{t} - z))^{'} = i - (t - i) .

∣ y_{1} - z ∣^{p} + ∣ y_{2} - z ∣^{p} + \dots + ∣ y_{t} - z ∣^{p} .

∣ y_{1} - z ∣^{p} + ∣ y_{2} - z ∣^{p} + \dots + ∣ y_{t} - z ∣^{p} .

p \cdot ((z - y_{1})^{p - 1} + \dots + (z - y_{i})^{p - 1} - (y_{i + 1} - z)^{p - 1} - \dots - (y_{t} - z)^{p - 1}) .

p \cdot ((z - y_{1})^{p - 1} + \dots + (z - y_{i})^{p - 1} - (y_{i + 1} - z)^{p - 1} - \dots - (y_{t} - z)^{p - 1}) .

dist_{p} (x, c) = i = 1 \sum d ∣ x [i] - c [i] ∣^{p} \geq ∣ x [j] - c [j] ∣^{p} \geq 1^{p} = 1.

dist_{p} (x, c) = i = 1 \sum d ∣ x [i] - c [i] ∣^{p} \geq ∣ x [j] - c [j] ∣^{p} \geq 1^{p} = 1.

y \in S \sum ∣ y - z ∣^{p} - y \in S \sum ∣ y - z^{'} ∣^{p} \leq y \in S, y \neq = z \sum (∣ y - z ∣^{p} - ∣ y - z^{'} ∣^{p} - ∣ z - z^{'} ∣^{p}) .

y \in S \sum ∣ y - z ∣^{p} - y \in S \sum ∣ y - z^{'} ∣^{p} \leq y \in S, y \neq = z \sum (∣ y - z ∣^{p} - ∣ y - z^{'} ∣^{p} - ∣ z - z^{'} ∣^{p}) .

∣ y - z ∣^{p} \leq ∣ y - z^{'} ∣^{p} + ∣ z - z^{'} ∣^{p},

∣ y - z ∣^{p} \leq ∣ y - z^{'} ∣^{p} + ∣ z - z^{'} ∣^{p},

P [X \leq (1 - β) μ] \leq exp (- β^{2} μ /2),

P [X \leq (1 - β) μ] \leq exp (- β^{2} μ /2),

P [X \geq (1 + β) μ] \leq exp (- β^{2} μ /3) .

D = {b \in B \sum a_{b} \cdot b : a_{b} \in Z, a_{b} \geq 0, b \in B \sum a_{b} \leq D},

D = {b \in B \sum a_{b} \cdot b : a_{b} \in Z, a_{b} \geq 0, b \in B \sum a_{b} \leq D},

c [i] = the most frequent element of the multiset {x [i]}_{x \in C}, 1 \leq i \leq d,

c [i] = the most frequent element of the multiset {x [i]}_{x \in C}, 1 \leq i \leq d,

i = 1 \sum t (∣ C_{i} ∣ (k - 2) - (k - β (C_{i})) + γ (C_{i})) .

i = 1 \sum t (∣ C_{i} ∣ (k - 2) - (k - β (C_{i})) + γ (C_{i})) .

(t + (2 k) - 1) (k - 2) - t k + i = 1 \sum t (β (C_{i}) + γ (C_{i})) = (2 k) (k - 2) - (k - 2) + i = 1 \sum t (β (C_{i}) - 2 + γ (C_{i})) .

(t + (2 k) - 1) (k - 2) - t k + i = 1 \sum t (β (C_{i}) + γ (C_{i})) = (2 k) (k - 2) - (k - 2) + i = 1 \sum t (β (C_{i}) - 2 + γ (C_{i})) .

i = 1 \sum t (β (C_{i}) - 2 + γ (C_{i})) = i = 1 \sum t \frac{β ( C _{i} ) - 2 + γ ( C _{i} )}{∣ C _{i} ∣ - 1} (∣ C_{i} ∣ - 1) \geq κ i = 1 \sum t (∣ C_{i} ∣ - 1) = κ ((2 k) - 1) = k - 2,

i = 1 \sum t (β (C_{i}) - 2 + γ (C_{i})) = i = 1 \sum t \frac{β ( C _{i} ) - 2 + γ ( C _{i} )}{∣ C _{i} ∣ - 1} (∣ C_{i} ∣ - 1) \geq κ i = 1 \sum t (∣ C_{i} ∣ - 1) = κ ((2 k) - 1) = k - 2,

\frac{β ( C ) - 2}{∣ C ∣ - 1} \geq \frac{l - 1}{( 2 l + 1 ) - 1} = \frac{2}{l + 2} \geq \frac{2}{k + 1} = κ,

\frac{β ( C ) - 2}{∣ C ∣ - 1} \geq \frac{l - 1}{( 2 l + 1 ) - 1} = \frac{2}{l + 2} \geq \frac{2}{k + 1} = κ,

x \in C \sum dist_{\infty} (x, c) .

x \in C \sum dist_{\infty} (x, c) .

i = 1 \sum n d_{i} \to min

i = 1 \sum n d_{i} \to min

x_{i} [j] - c_{j} \leq d_{i} \forall i, j : 1 \leq i \leq n, 1 \leq j \leq d

c_{j} - x_{i} [j] \leq d_{i} \forall i, j : 1 \leq i \leq n, 1 \leq j \leq d

x \in C \sum dist_{\infty} (x, c)

x \in C \sum dist_{\infty} (x, c)

rem (a) = {frac (a), if frac (a) < 1/2 1 - frac (a), otherwise,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Parameterized $k$ -Clustering: The distance matters!

Fedor V. Fomin

Department of Informatics, University of Bergen, Norway.

Petr A. Golovach00footnotemark: 0

Kirill Simonov00footnotemark: 0

Abstract

We consider the $k$ -Clustering problem, which is for a given multiset of $n$ vectors $X\subset\mathbb{Z}^{d}$ and a nonnegative number $D$ , to decide whether $X$ can be partitioned into $k$ clusters $C_{1},\dots,C_{k}$ such that the cost

[TABLE]

where $\|\cdot\|_{p}$ is the Minkowski ( $L_{p}$ ) norm of order $p$ . For $p=1$ , $k$ -Clustering is the well-known $k$ -Median. For $p=2$ , the case of the Euclidean distance, $k$ -Clustering is $k$ -Means. We show that the parameterized complexity of $k$ -Clustering strongly depends on the distance order $p$ . In particular, we prove that for every $p\in(0,1]$ , $k$ -Clustering is solvable in time $2^{\mathcal{O}(D\log{D})}(nd)^{\mathcal{O}(1)}$ , and hence is fixed-parameter tractable when parameterized by $D$ . On the other hand, we prove that for distances of orders $p=0$ and $p=\infty$ , no such algorithm exists, unless $\operatorname{{\sf FPT}}=\operatorname{{\sf W}}[1]$ .

1 Introduction

Recall that for $p>0$ , the Minkowski or $L_{p}$ -norm of a vector $x=(x[1],\dots,x[d])\in\mathbb{R}^{d}$ is defined as

[TABLE]

Respectively, we define the ( $L_{p}$ -norm) distance between two vectors $x=(x[1],\dots,x[d])$ and $y=(y[1],\dots,y[d])$ as

[TABLE]

We also consider $\operatorname{dist}_{p}$ for $p=0$ and $p=\infty$ . For $p=0$ , $\operatorname{dist}_{p}$ is $L_{0}$ (or the Hamming) distance, that is the number of different coordinates in $x$ and $y$ :

[TABLE]

For $p=\infty$ , $\operatorname{dist}_{p}$ is $L_{\infty}$ -distance, which is defined as

[TABLE]

The $k$ -Clustering problem is defined as follows. For a given (multi) dataset of $n$ vectors (points) $X\subset\mathbb{Z}^{d}$ , the task is to find a partition of $X$ into $k$ clusters $C_{1},\dots,C_{k}$ minimizing the cost

[TABLE]

In particular, for $p=1$ , $\operatorname{dist}_{p}$ is the $L_{1}$ -distance and the corresponding clustering problem is known as $k$ -Median. (Often in the literature, $k$ -Median is also used for clustering minimizing the sums of the Euclidean distances.) For $p=2$ , $\operatorname{dist}_{p}$ is the $L_{2}$ (Euclidean) distance, and then the clustering problem becomes $k$ -Means.

Let us note that optimal clusterings for the same set of vectors can be drastically different for various values of $p$ , as shown in Figure 1. The main conceptual contribution of this paper is that the complexity of $k$ -Clustering also strongly depends on the choice of $p$ .

$k$ -Clustering, and especially $k$ -Median and $k$ -Means, are among the most prevalent problems occurring in virtually every subarea of data science. We refer to the survey of Jain [22] for an extensive overview. While in practice the most common approaches to clustering are based on different variations of Lloyd’s heuristic [25], the problem is interesting from the theoretical perspective as well. In particular, there is a vast amount of literature on approximation algorithms for $k$ -Clustering whose behavior can be analyzed rigorously, see e.g. [1, 2, 6, 8, 9, 16, 17, 19, 24, 13, 23, 10, 30].

When it comes to exact solutions, the complexity of $k$ -Clustering is less understood. The $k$ -Clustering problem is naturally “multivariate”: in addition to the input size $n$ , there are also parameters like space dimension $d$ , number of clusters $k$ or the cost of clustering $D$ . The problem is known to be $\operatorname{{\sf NP}}$ -complete for $k=2$ [3, 15] and for $d=2$ [28, 26]. By the classical work of Inaba et al. [21], in the case when both $d$ and $k$ are constants, $k$ -Clustering is solvable in polynomial time $\mathcal{O}(n^{dk+1})$ . Under ETH, the lower bound of $n^{\Omega(k)}$ , even when $d=4$ , was shown by Cohen-Addad et al. in [11] for the settings where the set of potential candidate centers is explicitly given as input. However the lower bound of Cohen-Addad et al. does not generalize to the settings of this paper when any point in Euclidean space can serve as a center. For the special case, when the input consists of binary vectors and the distance is Hamming, the problem is solvable in time $2^{\mathcal{O}(D\log D)}(nd)^{\mathcal{O}(1)}$ [18].

Our results and approaches. In this paper we investigate the dependence of the complexity of $k$ -Clustering from the cost of clustering $D$ . It appears, that adding this new “dimension” makes the complexity landscape of $k$ -Clustering intricate and interesting. More precisely, we consider the following problem.

Input:

A multiset $X$ of $n$ vectors in $\mathbb{Z}^{d}$ , a positive integer $k$ , and a nonnegative number $D$ .

Task:

Decide whether there is a partition of $X$ into $k$ clusters $\{C_{i}\}_{i=1}^{k}$ and $k$ vectors $\{c_{i}\}_{i=1}^{k}$ , called centroids, in $\mathbb{R}^{d}$ such that

$\sum_{i=1}^{k}\sum_{x\in C_{i}}\operatorname{dist}(x,c_{i})\leq D.$

$k$ -Clustering with distance $\operatorname{dist}$ parameterized by

Let us remark that vector set $X$ (like the column set of a matrix) can contain many equal vectors. Also we consider the situation when vectors from $X$ are integer vectors, while centroid vectors are not necessarily from $X$ . Moreover, coordinates of centroids can be reals.

Our main algorithmic result is the following theorem.

Theorem 1.

$k$ -Clustering* with distance $\operatorname{dist}_{p}$ is solvable in time $2^{\mathcal{O}(D\log D)}(nd)^{\mathcal{O}(1)}$ for every $p\in(0,1]$ .*

Thus $k$ -Clustering when parameterized by $D$ is fixed-parameter tractable ( $\operatorname{{\sf FPT}}$ ) for Minkowski distance $\operatorname{dist}_{p}$ of order $0<p\leq 1$ . Superficially, the general idea of the proof of Theorem 1 is similar to the idea behind the algorithm for Binary $r$ -Means for $L_{0}$ from [18]. However there are several differences; the main is that the proof in [18] is crucially based on the fact that the clustering is performed on binary vectors. Thus the reductions from [18] cannot be applied in our case. Moreover, as we will see in Theorem 2, the existence of an $\operatorname{{\sf FPT}}$ algorithm for $k$ -Clustering in $L_{0}$ is highly unlikely.

In the first step of our algorithm we use color coding to reduce solution of the problem to the Cluster Selection problem, which we find interesting on its own. In Cluster Selection we have $t$ groups of weighted vectors and the task is to select exactly one vector from each group such that the weighted cost of the composite cluster is at most $D$ . More formally,

Input:

A set of $m$ vectors $X$ given together with a partition $X=X_{1}\cup\cdots\cup X_{t}$ into $t$ disjoint sets, a weight function $w:X\to\mathbb{Z}_{+}$ , and a nonnegative number $D$ .

Task:

Decide whether it is possible to select exactly one vector $x_{i}$ from each set $X_{i}$ such that the total cost of the composite cluster formed by $x_{1}$ , …, $x_{t}$ is at most $D$ :

$\min_{c\in\mathbb{R}^{d}}\sum_{i=1}^{t}w(x_{i})\cdot\operatorname{dist}(x_{i},c)\leq D.$

Cluster Selection with distance $\operatorname{dist}$ parameterized by

Informally (see Theorem 9 for the precise statement), our reduction shows that if the distance norm satisfies some specific properties (which $\operatorname{dist}_{p}$ satisfies for all $p$ ) and if Cluster Selection is $\operatorname{{\sf FPT}}$ parameterized by $D$ , then so is $k$ -Clustering. Therefore, in order to prove Theorem 1, all we need is to show that Cluster Selection is $\operatorname{{\sf FPT}}$ parameterized by $D$ when $p\in(0,1]$ . This is the most difficult part of the proof. Here we invoke the theorem of Marx [27] on the number of subhypergraphs in hypergraphs of bounded fractional edge cover.

Interestingly, Theorem 1 does not hold for distance $\operatorname{dist}_{0}$ . More precisely, for clustering in $L_{0}$ we prove the following theorem.

Theorem 2.

With distance $\operatorname{dist}_{0}$ , $k$ -Clustering parameterized by $d+D$ and Cluster Selection parameterized by $d+t+D$ are $\operatorname{{\sf W}}[1]$ -hard.

In particular, this means that up to a widely-believed assumption in complexity that $\operatorname{{\sf FPT}}\neq\operatorname{{\sf W}}[1]$ , Theorem 2 rules out algorithms solving $k$ -Clustering in time $f(d,D)\cdot n^{\mathcal{O}(1)}$ and algorithms solving Cluster Selection in $L_{0}$ in time $g(t,d,D)\cdot n^{\mathcal{O}(1)}$ for any functions $f(d,D)$ and $g(t,d,D)$ . Similar hardness result holds for $L_{\infty}$ .

Theorem 3.

With distance $\operatorname{dist}_{\infty}$ , $k$ -Clustering parameterized by $D$ and Cluster Selection parameterized by $t+D$ are $\operatorname{{\sf W}}[1]$ -hard.

This naturally brings us to the question: What happens with $k$ -Clustering for $p\in(1,\infty)$ , especially for the Euclidean distance, that is $p=2$ . Unfortunately, we are not able to answer this question when the parameter is $D$ only. However, we can prove that

Theorem 4.

$k$ -Clustering* and Cluster Selection with distance $\operatorname{dist}_{2}$ are $\operatorname{{\sf FPT}}$ when parameterized by $d+D$ .*

Thus in particular, Theorem 4 implies that $k$ -Clustering with distance $\operatorname{dist}_{2}$ is $\operatorname{{\sf FPT}}$ parameterized by $d+D$ . On the other hand, we prove that

Theorem 5.

Cluster Selection* with distance $\operatorname{dist}_{p}$ is $\operatorname{{\sf W}}[1]$ -hard for every $p\in(1,\infty)$ when parameterized by $t+D$ .*

In particular, Theorem 5 yields that the approach we used to establish the tractability (with parameter $D$ ) of $k$ -Clustering for $p=1$ will not work for $p>1$ .

We summarize our and previously known algorithmic and hardness results for $k$ -Clustering and Cluster Selection with different distances in Table 1.

The remaining part of this paper is organized as follows. Section 2 contains preliminaries. In Section 3 we prove Theorem 9 which provides us with $\operatorname{{\sf FPT}}$ Turing reduction from $k$ -Clustering to Cluster Selection. Theorem 9 appears to be a handy tool to establish tractability of $k$ -Clustering. In Section 4 we collect the results on clustering with $L_{p}$ -norm for $p\in(0,1]$ . In particular, in Subsection 4.1, we prove Theorem 1, the main algorithmic result of this work, stating that when $p\in(0,1]$ , $k$ -Clustering and Cluster Selection admit FPT algorithms with parameter $D$ . In Subsection 4.2 we complement the algorithmic upper bounds with lower bounds by proving that Cluster Selection is $\operatorname{{\sf W}}[1]$ -hard when $p=1$ and parameter is $t+d$ (Theorem 12). In Section 5, we consider the case $p=0$ and prove Theorem 2 establishing $\operatorname{{\sf W}}[1]$ -hardness of $k$ -Clustering and Cluster Selection. Section 6 is devoted to the case $p=\infty$ . Here we establish two hardness results about $k$ -Clustering: $\operatorname{{\sf W}}[1]$ -hardness when parameterized by $D$ and $\operatorname{{\sf NP}}$ -hardness in the case $k=2$ . In Section 7, we look at the case $p\in(1,\infty)$ , with the particular emphasis on the most commonly used case $p=2$ . We show that when $d+D$ is the parameter, then Cluster Selection and $k$ -Clustering in the $L_{2}$ distance are $\operatorname{{\sf FPT}}$ . We also show that Cluster Selection is $\operatorname{{\sf W}}[1]$ -hard when parameterized by $t+D$ for all $p\in(1,\infty)$ . We conclude with open problems in Section 8.

2 Preliminaries and notation

Cluster notation. By a cluster we always mean a multiset of vectors in $\mathbb{Z}^{d}$ . For distance $\operatorname{dist}$ , the cost of a given cluster $C$ is the total distance from all vectors in the cluster to the optimally selected cluster centroid, $\min_{c\in\mathbb{R}^{d}}\sum_{x\in C}\operatorname{dist}(x,c)$ . An optimal cluster centroid for a given cluster $C$ is any $c\in\mathbb{R}^{d}$ minimizing $\sum_{x\in C}\operatorname{dist}(x,c)$ . For most of the considered distances, we argue that an optimal cluster centroid could always be chosen among selected family of vectors (e.g. integral). Whenever we show this, we only consider optimal cluster centroids of the stated form afterwards.

Complexity. A parameterized problem is a language $Q\subseteq\Sigma^{*}\times\mathbb{N}$ where $\Sigma^{*}$ is the set of strings over a finite alphabet $\Sigma$ . Respectively, an input of $Q$ is a pair $(I,k)$ where $I\subseteq\Sigma^{*}$ and $k\in\mathbb{N}$ ; $k$ is the parameter of the problem. A parameterized problem $Q$ is fixed-parameter tractable ( $\operatorname{{\sf FPT}}$ ) if it can be decided whether $(I,k)\in Q$ in time $f(k)\cdot|I|^{\mathcal{O}(1)}$ for some function $f$ that depends of the parameter $k$ only. Respectively, the parameterized complexity class $\operatorname{{\sf FPT}}$ is composed by fixed-parameter tractable problems. The $\operatorname{{\sf W}}$ -hierarchy is a collection of computational complexity classes: we omit the technical definitions here. The following relation is known amongst the classes in the $\operatorname{{\sf W}}$ -hierarchy: $\operatorname{{\sf FPT}}=\operatorname{{\sf W}}[0]\subseteq\operatorname{{\sf W}}[1]\subseteq\operatorname{{\sf W}}[2]\subseteq\ldots\subseteq\operatorname{{\sf W}}[P]$ . It is widely believed that $\operatorname{{\sf FPT}}\neq\operatorname{{\sf W}}[1]$ , and hence if a problem is hard for the class $\operatorname{{\sf W}}[i]$ (for any $i\geq 1$ ) then it is considered to be fixed-parameter intractable. We refer to books [12, 14] for the detailed introduction to parameterized complexity.

We also provide conditional lower bounds by making use of the following complexity hypothesis formulated by Impagliazzo, Paturi, and Zane [20].

Exponential Time Hypothesis (ETH): There is a positive real $s$ such that 3-CNF-SAT with $n$ variables and $m$ clauses cannot be solved in time $2^{sn}(n+m)^{\mathcal{O}(1)}$ .

Graphs. For proving $\operatorname{{\sf W}}[1]$ -hardness, we need to consider graphs. Whenever we work with a graph $G$ , we always fix some ordering on the vertices $\pi_{V}:V(G)\to\{1,\dots,|V(G)|\}$ and on the edges $\pi_{E}:E(G)\to\{1,\dots,|E(G)|\}$ . We drop $\pi_{V}$ and $\pi_{E}$ to simplify notation, so when we consider a vertex $v\in V(G)$ or an edge $e\in E(G)$ , $v$ and $e$ also denote integers—numbers of $v$ and $e$ according to the orderings $\pi_{V}$ and $\pi_{E}$ correspondingly.

3 From $k$ -Clustering to Cluster Selection

In this section we present a general scheme for obtaining an FPT algorithm parameterized by $D$ , which is later applied to various distances.

First, we formalize the following intuition: there is no reason to assign equal vectors to different clusters.

Definition 6 (Initial cluster and regular partition).

For a multiset of vectors $X$ , an inclusion-wise maximal multiset $I\subset X$ such that all vectors in $I$ are equal is called an initial cluster.

We say that a clustering $\{C_{1},\dots,C_{k}\}$ of $X$ is regular if for every initial cluster $I$ there is a $i\in\{1,\dots,k\}$ such that $I\subset C_{i}$ .

Now we prove that it suffices to look only for regular solutions.

*Proposition 1**.*

Let $(X,k,D)$ be a yes-instance to $k$ -Clustering. Then there exists a solution of $(X,k,D)$ which is a regular clustering.

Proof.

Let us assume that the instance $(X,k,D)$ has a solution. There are $k$ clusters $\{C_{i}\}_{i=1}^{k}$ and $k$ vectors $\{c_{i}\}_{i=1}^{k}$ in $\mathbb{R}^{d}$ such that

[TABLE]

Note that for every $x\in C_{j}$ , $\operatorname{dist}(x,c_{j})\geq\min_{1\leq i\leq k}\operatorname{dist}(x,c_{i})$ . So if we consider a new clustering $\{C_{1}^{\prime},\dots,C_{k}^{\prime}\}$ with the same centroids, where $C_{j}^{\prime}$ are all vectors from $X$ for which $c_{j}$ is the closest centroid, the total distance does not increase. If we also break ties in favor of the lower index, then for any initial cluster $I$ the same centroid $c_{i}$ will be the closest, and all vectors from $I$ will end up in $C_{i}^{\prime}$ , so $\{C_{1}^{\prime},\dots,C_{k}^{\prime}\}$ is a regular clustering. ∎

From now on, we consider only regular solutions.

Definition 7 (Simple and composite clusters).

We say that a cluster $C$ is simple if it is an initial cluster. Otherwise, the cluster is composite.

Next we state a property of $k$ -Clustering with a particular distance, which is required for the algorithm. Intuitively, each unique vector adds at least some constant to the cluster cost. In the subsequent sections we show that the property holds for all distances in our consideration.

Definition 8 ( $\alpha$ -property).

We say that a distance has the $\alpha$ -property for some $\alpha>0$ if for any $s$ the cost of any composite cluster which consists of $s$ initial clusters is at least $\alpha(s-1)$ .

The following problem is a key subroutine in our algorithm. In some cases it is solvable trivially, but it presents the main challenge for our main algorithmic result in the $L_{1}$ distance.

Input:

Family of $t$ disjoint sets of vectors $X_{1},\dots,X_{t}$ , containing $m$ vectors in total, a weight function $w:\cup_{i=1}^{t}X_{i}\to\mathbb{Z}_{+}$ , and a nonnegative number $D$

Task:

Determine whether it is possible to choose one vector $x_{i}$ from each set $X_{i}$ such that the total cost of forming a composite cluster out of $x_{1}$ , …, $x_{t}$ is at most $D$ :

$\min_{c\in\mathbb{R}^{d}}\sum_{i=1}^{t}w(x_{i})\operatorname{dist}(x_{i},c)\leq D.$

Cluster Selection parameterized by

The intuition to the weight function in the definition of Cluster Selection is that it represents sizes of initial clusters, that is, how many equal vectors are there.

We also need a procedure to enumerate all possible optimal cluster costs which are less than $D$ . It may not be straightforward since not all distances in our consideration are integer. So we assume that the set of all possible optimal cluster costs which are less than $D$ is also given in the input. Now we are ready to state the result formally.

Theorem 9.

Assume that the $\alpha$ -property holds, Cluster Selection is solvable in time $\Phi(m,d,t,D)$ , where $\Phi$ is a non-decreasing function of its arguments, and we are given the set $\mathcal{D}$ of all possible optimal cluster costs which are at most $D$ . Then $k$ -Clustering is solvable in time

[TABLE]

Proof.

By the $\alpha$ -property, in any solution there are at most $D/\alpha$ composite clusters, since each contains at least two initial clusters. Moreover, there are at most $2D/\alpha$ initial clusters in all composite clusters.

Thus by Proposition 1, solving $k$ -Clustering is equivalent to selecting at most $T:=\lceil 2D/\alpha\rceil$ initial clusters and grouping them into composite clusters such that the total cost of these clusters is at most $D$ . We design an algorithm which, taking as a subroutine an algorithm for Cluster Selection, solves $k$ -Clustering. The algorithm is sketched in Figure 3, an example is shown in Figure 2.

To perform the selection and grouping, our algorithm uses the color coding technique of Alon, Yuster, and Zwick from [4]. Consider the input as a family of initial clusters $\mathcal{I}$ . We color initial clusters from $\mathcal{I}$ independently and uniformly at random by $T$ colors 1, 2, …, $T$ . Consider any solution, and the particular set of at most $T$ initial clusters which are included into composite clusters in this solution. These initial clusters are colored by distinct colors with probability at least $\frac{T!}{T^{T}}\geq e^{-T}$ . Now we construct an algorithm for finding a colorful solution.

We consider all possible ways to split colors between clusters (some colors may be unused). Hence we consider all possible families $\mathcal{P}=\{P_{1},\dots,P_{h}\}$ of pairwise disjoint non-empty subsets of $\{c\in\{1,\dots,T\}:\exists J\in\mathcal{I}\text{ colored by }c\}$ . Each family $\mathcal{P}$ corresponds to a partition of the set of colors $\{1,\dots,T\}$ if we add one fictitious subset for colors which are not used in the composite clusters. The total number of partitions does not exceed $T^{T}=2^{\mathcal{O}(D\log D)}$ .

When partition $\mathcal{P}$ is fixed, we form clusters by solving instances of Cluster Selection: For each $i\in\{1,\dots,h\}$ , we take initial clusters colored by elements of $P_{i}$ , bundle together those with the same color, and pass the resulting family to Cluster Selection. First note that there cannot be $P\in\mathcal{P}$ of size at most one, since then Cluster Selection has to make a simple cluster while we assume that all clusters obtained from $\mathcal{P}$ are composite. Second, the total number of clusters has to be $k$ , the number of clusters is $|\mathcal{I}|-\sum_{P\in\mathcal{P}}|P|+|\mathcal{P}|$ . For each $\mathcal{P}$ we check that both conditions hold, and if not, we discard the choice of $\mathcal{P}$ and move to the next one, before calling the Cluster Selection subroutine.

Next, we formalize how we call the Cluster Selection subroutine. We fix the set of colors $P_{i}=\{c_{1},\dots,c_{t}\}$ , then take the sets $I_{j}=\{J\in\mathcal{I}:J\text{ is colored by }c_{j}\}$ for $j\in\{1,\dots,t\}$ . We turn each set of initial clusters $I_{j}$ into a set of weighted vectors $X_{j}$ naturally: For each $J\in I_{j}$ , we put one vector $x\in J$ into $X_{j}$ , and $w(x):=|J|$ . The family of sets of vectors $X_{1}$ , …, $X_{t}$ and the weight function $w$ are the input for Cluster Selection. Then we search for the minimum cluster cost bound $d_{i}\leq D$ from $\mathcal{D}$ , for which the instance $(X_{1},\dots,X_{t},d_{i})$ of Cluster Selection is a yes-instance, running each time the algorithm for Cluster Selection.

If for some $i$ setting $d_{i}$ to $D$ leads to a no-instance, or if $\sum_{i=1}^{h}d_{i}>D$ , then we discard the choice of the partition $\mathcal{P}$ and move to the next one. Otherwise, we report that $k$ -Clustering has a solution and stop. Next, we prove that in this case the solution indeed exists.

We reconstruct the solution to $k$ -Clustering as follows: For each $i\in\{1,\dots,h\}$ the corresponding to $P_{i}=\{c_{1},\dots,c_{t}\}$ instance of Cluster Selection has a solution $\{x_{1},\dots,x_{t}\}$ . For each $j\in\{1,\dots,t\}$ , consider the corresponding initial cluster $J_{j}$ consisting of $w(x_{j})$ vectors equal to $x_{j}$ . For each $i\in\{1,\dots,h\}$ we obtain a composite cluster $\cup_{j=1}^{t}J_{j}$ , all other clusters are simple. So the total cost is $\sum_{i=1}^{h}d_{i}$ , which is at most $D$ . Thus, if the algorithm finds a solution, then $(X,d,D)$ is a yes-instance.

In the opposite direction. If there is a solution to $k$ -Clustering, then there is a regular solution, and with probability at least $e^{-T}$ initial clusters which are parts of composite clusters in this solution are colored by distinct colors. Then, there is a partition $\mathcal{P}=\{P_{1},\dots,P_{h}\}$ which corresponds to this solution. This partition is obtained as follows: put into $P_{1}$ colors from the first composite cluster, into $P_{2}$ from the second and so on. At some point our algorithm checks the partition $\mathcal{P}$ , and as it finds the optimal cost value for each cluster, then it is at most the cost of the corresponding cluster of the solution from which we started.

To analyze the running time, we consider $2^{\mathcal{O}(D\log D)}$ partitions $\mathcal{P}$ , for each $\mathcal{P}$ we $|\mathcal{P}|=\mathcal{O}(D)$ times search for optimal $d_{i}$ in time $|\mathcal{D}|$ . And for each possible value of $d_{i}$ we make one call to the Cluster Selection algorithm, which takes time at most $\Phi(n,d,T,D)$ .

To amplify the error probability to be at least $1/e$ , we do $N=\lceil e^{T}\rceil$ iterations of the algorithm, each time with a new random coloring. As each iteration succeeds with probability at least $e^{-T}$ , the probability of not finding a colorful solution after $N$ iterations is at most $(1-e^{-T})^{e^{T}}\leq e^{-1}<1$ . So the total running time is $2^{\mathcal{O}(D\log D)}\cdot(nd)^{\mathcal{O}(1)}|\mathcal{D}|\Phi(n,d,2D/\alpha,D)$ .

The algorithm could be derandomized by the standard derandomization technique using perfect hash families [4, 29]. So $k$ -Clustering is solvable in the same deterministic time.∎

4 Algorithms and complexity for distances with $p\in(0,1]$

The main motivation for the results in this section is the study of $k$ -Clustering with the $L_{1}$ distance, the case widely known as $k$ -Medians. However, our main algorithmic result also extends to distances of order $p\in(0,1)$ since in some sense they behave similarly to the $L_{1}$ distance.

4.1 FPT algorithm when parameterized by $D$

In this subsection, we prove Theorem 1: when $p\in(0,1]$ , $k$ -Clustering admits an FPT algorithm with parameter $D$ . First we state basic geometrical observations for cases $p=1$ and $p\in(0,1)$ , Then we propose a general algorithm for Cluster Selection which relies only on these properties. Finally, we show how Theorem 9 could be applied.

The next two claims deal with the structure of optimal cluster centroids. We state and prove them in the case of weighted vectors where each vector has a positive integer weight given by a weight function $w$ . The unweighted case is just a special case when the weight of each vector is one.

First, we show that coordinates of cluster centroids could always be selected among the values present in the input, which helps greatly in enumerating cluster centroids that may be optimal.

*Claim 4.1**.*

Let $C=\{x_{1},\dots,x_{t}\}$ be a cluster and $w:\{x_{1},\cdots,x_{t}\}\to\mathbb{Z}_{+}$ be a weight function. Then there is an optimal (subject to the weighted distance $w(x_{i})\cdot\operatorname{dist}_{p}(x_{i},c)$ ) centroid $c$ of $C$ such that for each $i\in\{1,\dots,d\}$ , the $i$ -th coordinate $c[i]$ of the centroid is from the values present in the input in this coordinate, that is $c[i]\in\{x_{1}[i],\dots,x_{t}[i]\}$ . Moreover, for $p=1$ we may assume that the optimal value is a weighted median of the values present in the $i$ -th coordinate.

Proof.

For cluster $C$ , consider the corresponding multiset of unweighted vectors $C^{\prime}=\{x_{1},\dots,x_{t}\}$ , where each vector $x\in C$ is repeated $w(x)$ times. We define $y_{j}=x_{j}[i]$ for $j\in\{1,\dots,t\}$ . Assume that $y_{1}\leq y_{2}\leq\dots\leq y_{t}$ . Let us consider an optimal cluster centroid $c$ for $C$ and denote $z=c[i]$ . Figure 4 shows how the cluster cost behaves with respect to $z$ on a concrete set of values $\{y_{i}\}$ for $p=1$ and $p=1/2$ .

For the formal proof, we start with the case $p=1$ . The total cost of $C$ contributed by the $i$ -the coordinate is

[TABLE]

If $z\in(y_{i},y_{i+1})$ for $i\in\{1,\dots,t-1\}$ , then the derivative with respect to $z$ is

[TABLE]

And when $z=y_{i}$ for $i\in\{1,\dots,t\}$ , analogously the derivative is $i-1-(t-i)$ . So if $t$ is odd, then the derivative is zero at $y_{\lceil t/2\rceil}$ , strictly negative before and strictly positive after, so $y_{\lceil t/2\rceil}$ , which is the only median, is the optimal value for $z$ . If $t$ is even, then the derivative is zero on $[y_{t/2},y_{t/2+1}]$ , strictly negative before and strictly positive after. So any value from $[y_{t/2},y_{t/2+1}]$ is optimal, and we may assume that it is one of the two medians $y_{t/2}$ , $y_{t/2+1}$ .

Now to the case $p\in(0,1)$ , the contribution of the coordinate $i$ is

[TABLE]

When $z$ is between $y_{i}$ and $y_{i+1}$ , then the derivative of the above with respect to $z$ is equal to

[TABLE]

It is monotone on $(y_{i},y_{i+1})$ : when $z$ increases, the sum decreases, as terms of the form $(z-y_{j})^{p-1}$ decrease and terms of the form $(y_{j}-z)^{p-1}$ increase, because $p-1<0$ . Thus, the optimal value on this interval is achieved at one of its ends. Doing the same for all intervals, we conclude that the optimal value for $z$ must be in $\{y_{1},\dots,y_{t}\}$ . ∎

In particular, by Claim 4.1 we may assume that the coordinates of optimal cluster centroids are integers. Then, the $\alpha$ -property holds with $\alpha=1$ since at most one of the initial clusters could have distance zero to the cluster centroid, and all others have distance at least one since the cluster centroid is integral. Namely, let $x$ be a vector in the cluster, and $c$ be the cluster centroid, if $x\neq c$ , then there is a coordinate $j$ where $x$ and $c$ differ, and since they are both integral, $|x[j]-c[j]|\geq 1$ , and

[TABLE]

In what follows, the expression half of vectors by weight means that the total weight of the corresponding set of vectors is at least half of the total weight of $C$ .

*Claim 4.2**.*

If at least half of the vectors by weight in the cluster $C$ have the same value $z$ in some coordinate $i$ , then the optimal cluster centroid is also equal to $z$ in this coordinate.

Proof.

Let $S$ be the weight-respecting multiset of values which vectors from $C$ have in the $i$ -th coordinate: $S=\{x[i]:x\in C,w(x)\text{ times}\}$ . Consider the difference between selecting $z$ and some other value $z^{\prime}$ as the $i$ -th coordinate of the centroid:

[TABLE]

The inequality holds since at least half of the elements of $S$ are equal to $z$ , and so for any value $y\neq z$ there is a term $|z-z^{\prime}|^{p}$ in $\sum_{y\in S}|y-z^{\prime}|^{p}$ corresponding to one of the values from $S$ equal to $z$ . The last sum is non-positive because in every term

[TABLE]

as $p\in(0,1]$ . This concludes the proof. ∎

In order to apply Theorem 9, we need an FPT algorithm for Cluster Selection. Before obtaining it, we state some properties of hypergraphs, which we need for the algorithm.

A hypergraph $G$ is a set of vertices $V(G)$ and a collection of hyperedges $E(G)$ , each hyperedge is a subset of $V(G)$ . If $G$ and $H$ are hypergraphs, we say that $H$ appears at $V^{\prime}\subset V(G)$ as a subhypergraph if there is a bijection $\pi:V(H)\to V^{\prime}$ with a property that for any $E\in E(H)$ there is $E^{\prime}\in E(G)$ such that $\pi(E)=E^{\prime}\cap V^{\prime}$ , the action of $\pi$ is extended to subsets of $V(H)$ in a natural way.

A fractional edge cover of a hypergraph $H$ is an assignment $\psi:E(H)\to[0,1]$ such that for every $v\in V(H)$ , $\sum_{E\in E(H):v\in E}\psi(E)\geq 1$ . The fractional cover number $\rho^{*}(H)$ is the minimum of $\sum_{E\in E(H)}\psi(E)$ taken over all fractional edge covers $\psi$ .

We need the following result of Marx [27] about finding occurences of one hypergraph in another.

Lemma 10 ([27]).

Let $H$ be a hypergraph with fractional cover number $\rho^{*}(H)$ , and let $G$ be a hypergraph where each hyperedge has size at most $\ell$ . There is an algorithm that enumerates in time $|V(H)|^{\mathcal{O}(|V(H)|)}\cdot\ell^{|V(H)|\rho^{*}(H)+1}\cdot|E(G)|^{\rho^{*}(H)+1}\cdot|V(G)|^{2}$ every subset $V^{\prime}\subset V(G)$ where $H$ appears in $G$ as a subhypergraph.

Also, the following version of the Chernoff Bound will be of use.

*Proposition 2** ([5]).*

Let $X_{1}$ , $X_{2}$ , …, $X_{n}$ be independent 0-1 random variables. Denote $X=\sum_{i=1}^{n}X_{i}$ and $\mu=E[X]$ . Then for $0<\beta\leq 1$ ,

[TABLE]

We are ready to proceed with the proof that Cluster Selection with $p\in(0,1]$ is $\operatorname{{\sf FPT}}$ when parameterized by $D$ .

Theorem 11.

For every $p\in(0,1]$ , Cluster Selection with distance $\operatorname{dist}_{p}$ is solvable in time $2^{\mathcal{O}(D\log D)}(md)^{\mathcal{O}(1)}$ .

Proof.

First we check if any of the given vectors could be the centroid of the resulting composite cluster. When the centroid is fixed, we find the optimal solution in polynomial time by just selecting the cheapest vector with respect to this centroid from each set. If at some point we find a suitable centroid, then we return that the solution exists. If not, we may assume that the centroid is not equal to any of the given vectors. As a consequence, any vector $x$ selected into the solution cluster contributes at least $w(x)$ to the total distance, since the centroid must be integral by Claim 4.1. So we may now consider only vectors of weight at most $D$ and, moreover, the total weight of the resulting cluster is at most $D$ .

Consider a resulting cluster $C$ with the centroid $c$ . There is some $x_{1}$ in $C$ from $X_{1}$ , and $\operatorname{dist}_{p}(x_{1},c)\leq D$ . So if we try all possible $x_{1}$ from $X_{1}$ (there are at most $m$ of them), any feasible centroid is at distance at most $D$ from at least one of them. Since $x_{1}$ and $c$ are integral, they could be different in at most $D$ coordinates, as $\operatorname{dist}_{p}(x_{1},c)=\sum_{i=1}^{d}|x_{1}[i]-c[i]|^{p}\leq D$ .

We try all possible $x_{1}\in X_{1}$ . After $x_{1}$ is fixed, we enumerate all subsets $P$ of coordinates $\{1,\dots,d\}$ where $x_{1}$ and $c$ could differ, we show how to do it efficiently afterwards. When the subset of coordinates $P$ is fixed, we consider all possible centroids, which are integral, equal to $x_{1}$ in all coordinates except $P$ , and differ from $x_{1}$ by at most $D^{1/p}$ in each of coordinates from $P$ . If $|x_{1}[i^{*}]-c[i^{*}]|>D^{1/p}$ for some coordinate $i^{*}$ , then $\operatorname{dist}_{p}(x_{1},c)=\sum_{i=1}^{d}|x_{1}[i]-c[i]|^{p}\geq|x_{1}[i^{*}]-c[i^{*}]|^{p}>D$ , so $c$ can not be a centroid. With restrictions stated above, there are at most $2^{\mathcal{O}(D\log D)}$ possible centroids.

It remains to show that we could enumerate all possible coordinate subsets efficiently. We reduce this task to the task of finding a specific subhypergraph and then apply Lemma 10.

*Claim 4.3**.*

There are $2^{\mathcal{O}(D\log D)}$ coordinate subsets where $x_{1}$ and an optimal cluster centroid $c$ could differ. There exists an algorithm which enumerates all of them in time $2^{\mathcal{O}(D\log D)}(md)^{\mathcal{O}(1)}$ .

Proof.

Let $G$ be a hypergraph with $V(G)=\{1,\dots,d\}$ , one vertex for each coordinate, and for each vector $x$ in $\cup_{j=1}^{t}X_{j}$ we take $w(x)$ multiple hyperedges $E_{x}$ which contains exactly the coordinates where $x$ and $x_{1}$ differ. We add an edge only if there are at most $D$ such coordinates, otherwise $x$ can not be in the same cluster as $x_{1}$ . So hyperdeges in $G$ are of size at most $D$ . Since we consider only vectors of weight at most $D$ , $|E(G)|\leq Dm$ .

For a solution, let $x_{j}$ be the vector selected from the corresponding $X_{j}$ , for $j\in\{1,\dots,t\}$ , $C=\{x_{1},\dots,x_{t}\}$ be the solution cluster and $c$ be the centroid. All vectors in $C$ are identical in all coordinates except at most $D$ , since if there are different values in at least $D+1$ coordinates, the cost is at least $D+1$ . Denote this subset of coordinates as $Q$ , $c$ could also differ from $x_{1}$ only at $Q$ . Denote the subset of coordinates where $c$ differs from $x_{1}$ as $P$ , $P\subset Q$ and so $|P|\leq D$ . The solution $(C,c)$ induces a subhypergraph $H$ of $G$ in the following way. Leave only hyperedges corresponding to the vectors in $C$ , and restrict them to vertices in $P$ . There are at most $D$ vertices and at most $D$ hyperedges in $H$ , since the total weight is at most $D$ . An example of the correspondence between input vectors and hypergraphs is given in Figure 5.

The next claim shows that the fractional cover number of $H$ is bounded by a constant.

*Claim 4.4**.*

Each vertex in $H$ is covered by at least half of the hyperedges of $H$ , and $\rho^{*}(H)\leq 2$ .

Proof.

Consider a vertex $p\in P$ , and assume that less than half of the hyperedges cover $p$ . It means that in the $p$ -th coordinate the centroid $c$ differs from $x_{1}$ , but less than half of the vectors in $C$ by weight differ from $x_{1}$ in this coordinate. This contradicts Claim 4.2.

So each vertex is covered by at least half of the hyperedges, and setting $\psi\equiv\frac{2}{|E(H)|}$ leads to $\rho^{*}(H)\leq 2$ . ∎

In order to enumerate all possible subsets of coordinates $P$ , we try all hypergraphs $H$ with at most $D$ vertices and at most $D$ hyperedges, and if each vertex is covered by at least half of the hyperedges, we find all places where $H$ appears in $G$ by Lemma 10. The last step is done in $2^{\mathcal{O}(D\log D)}\cdot(md)^{\mathcal{O}(1)}$ time. However, the number of possible $H$ could be up to $2^{\Omega(D^{2})}$ . The following claim, which is analogous to Proposition 6.3 in [27], shows that we could consider only hypergraphs with a logarithmic number of hyperedges.

*Claim 4.5**.*

If $D\geq 2$ , it is possible to delete all except at most $160\ln D$ hyperedges from $H$ so that in the resulting hypergraph $H^{*}$ each vertex is covered by at least $1/4$ of the hyperedges, and $\rho^{*}(H^{*})\leq 4$ .

Proof.

Denote $s=|E(H)|$ , construct a new hypergraph $H^{*}$ on the same vertex set $V(H)$ by independently selecting each hyperedge of $H$ with probability $(120\ln D)/s$ . Applying Proposition 2 with $\beta=1/3$ , probability of selecting more than $160\ln D$ hyperedges is at most $\exp((-120\ln D)/27)<1/D^{2}$ . By Claim 4.4, each vertex $v$ of $H$ is covered by at least $s/2$ hyperedges, and the expected number of hyperedges covering $v$ in $H^{*}$ is at least $60\ln D$ . By Proposition 2 with $\beta=1/3$ , the probability that $v$ is covered by less than $40\ln D$ hyperedges in $H^{*}$ is at most $\exp(-60\ln D/18)\leq 1/D^{3}$ . By the union bound, with probability at least $1-1/D^{2}-D\cdot 1/D^{3}>0$ we select at most $160\ln D$ hyperedges and each vertex is covered by at least $40\ln D$ hyperedges. So the claim holds, and $\rho^{*}(H^{*})\leq 4$ by setting $\psi\equiv\frac{4}{|E(H^{*})|}$ . ∎

So if there is a subhypergraph $H$ in $G$ corresponding to a solution, then there is also a subhypergraph $H^{*}$ in $G$ appearing at the same subset of $V(G)$ with at most $160\ln D$ hyperedges and $\rho^{*}(H^{*})\leq 4$ . Since we only need to enumerate possible coordinate subsets, it suffices to consider only hypergraphs with at most $160\ln D$ hyperedges, and there are $2^{\mathcal{O}(D\log D)}$ of them. Since the fractional cover number is still bounded by a constant, the total running time is $2^{\mathcal{O}(D\log D)}\cdot(md)^{\mathcal{O}(1)}$ , as desired. ∎

With Claim 4.3 proven, the proof of the theorem is complete. The pseudocode given in Figure 6 summarizes the main steps of the algorithm. ∎

Combining Theorem 9 and Theorem 11, we obtain an $\operatorname{{\sf FPT}}$ algorithm for $k$ -Clustering. This proves Theorem 1, which we recall here.

See 1

Proof.

We have an algorithm for Cluster Selection whose running time is specified by Theorem 11. By Claim 4.1, the $\alpha$ -property holds. The only missing part is to describe the way of producing the set $\mathcal{D}$ of all possible cluster costs which are at most $D$ .

In the case $p=1$ all distances are integral so we can take $\mathcal{D}=\{0,\dots,D\}$ .

For the general case, let $\mathcal{B}=\{a^{p}:a\in\{1,\dots,\lceil D^{1/p}\rceil\}\}$ . Consider a cluster $C=\{x_{1},\dots,x_{t}\}$ and the corresponding optimal cluster centroid $c$ . For any $x_{j}\in C$ , $\operatorname{dist}_{p}(x_{j},c)=\sum_{i=1}^{d}|x_{j}[i]-c[i]|^{p}$ is a combination of elements of $\mathcal{B}$ with nonnegative integer coefficients. This is because $x_{j}$ and $c$ are integral and the cluster cost is at most $D$ , hence $|x_{j}[i]-c[i]|\leq D^{1/p}$ for each $i\in\{1,\dots,d\}$ . Since weights are also integral, the whole cluster cost is a combination of distances between cluster vectors and the centroid with nonnegative integer coefficients, and so also a combination of elements of $\mathcal{B}$ with nonnegative integer coefficients. This means that we can take

[TABLE]

the sum of coefficients $a_{b}$ is at most $D$ since all elements of $\mathcal{B}$ are at least 1. The size of $\mathcal{D}$ is at most $|\mathcal{B}|^{D}=2^{\mathcal{O}(D\log D)}$ . ∎

Note that another widely studied version of $k$ -Clustering is where centroids $c_{i}$ could be selected only among the set of given vectors. Naturally, our algorithm also works in this setting since the set of possible centroids is only restricted further.

4.2 W[1]-hardness of Cluster Selection parameterized by $t+d$ for $p=1$

In this subsection, we restrict our attention to the $p=1$ case. What happens when $D$ is not bounded, but the dimension $d$ and the number of clusters $k$ are parameters? There is a trivial XP-algorithm in time $n^{\mathcal{O}(kd)}$ , as by Claim 4.1 it suffices to try all possible combinations of the values present in coordinates as possible cluster centroids. There are at most $n$ distinct values in each coordinate, so at most $n^{d}$ candidates for a cluster centroid. After the cluster centroids are fixed, each vector goes to the cluster with the closest centroid.

We do not know of a lower bound for $k$ -Clustering complementing this algorithm. However, we are able to show the hardness of Cluster Selection with respect to the dimension.

Theorem 12.

Cluster Selection* with distance $\operatorname{dist}_{1}$ is $\operatorname{{\sf W}}[1]$ -hard when parameterized by $t+d$ .*

Proof.

We construct a reduction from Multicolored Clique with the input $G$ and $k$ . We set $d$ to $k$ , for each pair of colors $1\leq i<j\leq k$ and each $e=\{u,v\}$ between a vertex $u$ of color $i$ and a vertex $v$ of color $j$ we add a vector $x_{e}$ to the set $X_{i,j}$ , such that $x_{e}[i]=u$ , $x_{e}[j]=v$ and all other coordinates are set to zero, and a vector $y_{e}$ to the set $Y_{i,j}$ which is the same as $x_{e}$ , only coordinates other that $i$ and $j$ are set to $|V(G)|+1$ . We will refer to 0 and $|V(G)|+1$ as boundary values. The sets $X_{i,j}$ and $Y_{i,j}$ are the input to Cluster Selection, so $t$ is $2\binom{k}{2}$ , and we set $D$ to $k(|V(G)|+1)\binom{k-1}{2}$ . Intuitively, the set $X_{i,j}$ corresponds to the choice of the clique edge between $i$ -th and $j$ -th color, and $Y_{i,j}$ mirrors it. All vectors have weight one. An example is given in Figure 7.

Note that in any feasible cluster, each coordinate $i$ has exactly $2(k-1)$ values in $[1,|V(G)|]$ , one from each of the sets $X_{i,j}$ and $Y_{i,j}$ for $j\neq i$ . Out of all $2(\binom{k}{2}-k+1)=2\binom{k-1}{2}$ other values, exactly half are zero and half are $|V(G)|+1$ . So the median is always in $[1,|V(G)|]$ , and the boundary values in each column contribute exactly $(|V(G)|+1)\binom{k-1}{2}$ to the total distance.

Assume there is a colorful $k$ -clique in $G$ , with vertices $v_{1}$ , $v_{2}$ , …, $v_{k}$ . We form the resulting cluster by choosing the vector corresponding to the clique’s edge between its $i$ -th and $j$ -th vertices from $X_{i,j}$ , and also from $Y_{i,j}$ , for all $1\leq i<j\leq k$ . For this cluster, in the $i$ -th coordinate we have all non-boundary values equal to $v_{i}$ . So the median is also $v_{i}$ , and the total distance is $D$ , since non-boundary values do not contribute anything.

In the other direction, if we are able to select a cluster of cost exactly $D$ , then all non-boundary values in each coordinate must be equal, denote this common value in the $i$ -th coordinate as $v_{i}$ . We claim that vertices $v_{1}$ , $v_{2}$ , …, $v_{k}$ form a colorful clique in $G$ . Indeed, since we have $2(k-1)$ times $v_{i}$ in the $i$ -th column, then we have $(k-1)$ of them from the sets $X_{i,j}$ , one from each, and in the $j$ -th column the only non-boundary value is $v_{j}$ . So $v_{i}$ must have an edge to each $v_{j}$ for $j\neq i$ . By construction, vertices in the $i$ -th coordinate are of color $i$ .

∎

5 The $L_{0}$ distance

In this section, we consider the case $p=0$ . It is a natural measure of difference to consider since observation parameters are often incomparable, and we very well may be interested in counting only the number of different entries. From another point of view, the $L_{0}$ distance gives the $k$ -Clustering problem a more combinatorial flavor, since the input vectors could be viewed as strings and we are interested about how close they are according to the Hamming distance. However, in comparison to a number of problems on strings, the size of the alphabet is unbounded.

First, note that there is a simple rule of finding the optimal cluster centroid for a given cluster.

*Observation 1**.*

For a given cluster $C$ , the coordinates of the optimal cluster centroid $c$ could be set as

[TABLE]

breaking ties in favor of the lowest values.

By Observation 1, we may assume that optimal cluster centroids could never have values not present in the input, and in particular that they are integral.

We prove W[1]-hardness of $k$ -Clustering with the $L_{0}$ distance by showing a reduction from Clique. The reduction also shows hardness of Cluster Selection.

Note that when $d$ is fixed, we could apply Theorem 9 to obtain an FPT algorithm: Cluster Selection solves trivially by trying every present value in each coordinate as a value for the centroid, there are only $n^{d}$ variants. The $\alpha$ -property holds for $L_{0}$ distance with $\alpha=1$ since at most one initial cluster could coincide with the cluster centroid, and all others have distance at least one.

We restate Theorem 2, which we prove next.

See 2

Proof.

First we show how to obtain an FPT reduction from Clique parameterized by the clique size to $k$ -Clustering.

Given an instance ( $G$ , $k$ ) of Clique, for each pair of indices $\{i,j\}$ , $1\leq i<j\leq k$ , we make $|E(G)|$ vectors in $\mathbb{Z}^{k}$ , assume $k\geq 3$ . For each $e=\{u,v\}\in E(G)$ , we add a vector $x_{i,j,e}$ : two coordinates are set to vertex values, $x_{i,j,e}[i]=u$ , $x_{i,j,e}[j]=v$ , and in all other coordinates $x_{i,j,e}$ is set to the special padding value $c_{i,j,e}=|V(G)|+(k\cdot i+j)\cdot|E(G)|+e$ . In total, there are $n=\binom{k}{2}|E(G)|$ vectors and $|V(G)|+\binom{k}{2}|E(G)|$ different values, since there are $|V(G)|$ vertex values, all padding values are distinct from vertex values and from each other.

Finally, we set $k^{\prime}=n-\binom{k}{2}+1$ and $D=\binom{k}{2}(k-2)$ . An example of the reduction is shown in Figure 8.

Now we prove that the original instance has a $k$ -clique iff the transformed instance has a $k^{\prime}$ -clustering of cost at most $D$ .

If there is a $k$ -clique, there is a clustering with cost $D$ : we take one nontrivial cluster of size $\binom{k}{2}$ and all other clusters are of size 1. Let $v_{1}$ ,…, $v_{k}$ be the vertices of the clique, for each $\{i,j\}$ , $1\leq i<j\leq k$ we take $x_{i,j,\{v_{i},v_{j}\}}$ into the cluster. The cluster centroid is $(v_{1},...,v_{k})$ , each vector in the cluster has distance to the centroid of exactly $(k-2)$ .

Now to the opposite direction. Assume that there is a clustering of cost at most $D$ , and there are $t$ composite clusters: $C_{1}$ , …, $C_{t}$ . In each cluster and each coordinate, by Observation 1 we may assume that we select the most frequent vertex there as the value of the centroid, since all padding values are distinct. If there are no vertex values in this cluster in this coordinate, we may assume that we select any of the occuring padding values. For a cluster $C$ , denote the number of vertex-containing coordinates as $\beta(C)$ , and the total number of vertex-valued entries which do not match with the centroid value in the corresponding coordinate as $\gamma(C)$ . We could write the total cost of the clustering as

[TABLE]

That holds since in each cluster $C_{i}$ each of the $|C_{i}|(k-2)$ padding values is not matched with the cluster centroid and increases the total distance by one, except for the $(k-\beta(C_{i}))$ vertex-free coordinates, where exacly one of the padding values is selected as the value of the centroid. Also each vertex-valued entry which is not matched with the centroid increases the total distance by one, there are $\gamma(C_{i})$ of them.

There are $n-\binom{k}{2}+1$ clusters in total, $n-\binom{k}{2}+1-t$ of them are simple. We may assume that in the optimal clustering there are no empty clusters, since we could always move a vector from a composite cluster to an empty one without increasing the cost. So there are $n-(n-\binom{k}{2}+1-t)=t+\binom{k}{2}-1$ vectors in the composite clusters, which is equal to $\sum_{i=1}^{t}|C_{i}|$ . We could rewrite the total cost as

[TABLE]

Now we show that for any clustering the value $\sum_{i=1}^{t}(\beta(C_{i})-2+\gamma(C_{i}))$ is at least $(k-2)$ , and it is equal to $(k-2)$ only in the $k$ -clique clustering. It suffices to prove the following lemma.

Lemma 13.

For any cluster $C$ such that $2\leq|C|\leq\binom{k}{2}$ , $\frac{\beta(C)-2+\gamma(C)}{|C|-1}\geq\kappa$ , where $\kappa=\frac{k-2}{\binom{k}{2}-1}=\frac{2}{k+1}$ , and the equality holds only when $C$ is a $k$ -clique.

The lemma implies

[TABLE]

and also that the equality holds only when each term is equal to $\kappa$ , so each $C_{i}$ is a $k$ -clique, but then $t=1$ since $\sum_{i=1}^{t}(|C_{i}|-1)=\binom{k}{2}-1$ . So $G$ must contain a $k$ -clique if there is a clustering of cost at most $D$ , and the reduction is correct. Note that none of the $C_{i}$ could have size larger than $\binom{k}{2}$ since there are $n-\binom{k}{2}+1$ clusters in total.

Proof of Lemma 13.

First, we consider the case $\gamma(C)=0$ , so in each coordinate all vertex values are equal.

*Claim 5.1**.*

If $C$ is a cluster of vectors obtained by applying the reduction described in the proof of Theorem 2 to any graph $H$ , $\gamma(C)=0$ , and $\binom{l}{2}<|C|$ , then $\beta(C)\geq l+1$ .

Proof.

The proof is by induction on $l$ . The base is $l=1$ , and each non-empty cluster contains at least one vector and so at least 2 coordinates with vertices, we assume $\binom{1}{2}=0$ .

For the general case, if there are at least $l$ occurences of a vertex $v$ in a coordinate $i$ , then there are at least $(l+1)$ coordinates with vertices. Each vector with $v$ in the $i$ -th coordinate has also some other vertex in some other coordinate. As in each coordinate all vertex values are equal, it could not be that two of the vectors with the value $v$ in the $i$ -th coordinate share the second vertex-valued coordinate, since then they would represent the same edge.

So each coordinate has at most $(l-1)$ vertex occurences, otherwise the claim holds. Select a coordinate $j$ which contains some vertex value $u$ and remove the $j$ -th coordinate and all vectors which have the value $u$ in the $j$ -th coordinate. That corresponds to the natural restriction $C^{\prime}$ of the cluster $C$ to a subgraph $H-u$ . The size of $C^{\prime}$ is at least $\binom{l}{2}+1-(l-1)=\binom{l-1}{2}+1$ , and by induction there are at least $l$ coordinates which contain vertex values, so the original cluster $C$ has at least $l+1$ such coordinates, since there is also the $j$ -th coordinate with the vertex value $u$ . ∎

Now consider a cluster $C$ with $\gamma(C)=0$ . Let $l$ be the largest value with $\binom{l}{2}+1\leq|C|$ , so $|C|\leq\binom{l+1}{2}$ . Since $|C|\leq\binom{k}{2}$ , $l+1\leq k$ . By Claim 5.1, $\beta(C)\geq l+1$ , then

[TABLE]

and so if $l+1<k$ , the inequality is strict. It is also strict if $l+1=k$ and $|C|<\binom{k}{2}$ , as the denominator becomes larger in the first step. Thus the only possibility of getting exactly $\kappa$ is when $|C|=\binom{k}{2}$ .

But then we have exactly $k\cdot(k-1)$ vertex values across $k$ coordinates, and each coordinate has at most $(k-1)$ vertex values by the argument in Claim 5.1, so each coordinate must have exactly $(k-1)$ vertex values. Since $\gamma(C)=0$ , they must be all equal. Denote the common vertex value in the $i$ -th coordinate as $v_{i}$ . Since each occurence of $v_{i}$ in the $i$ -th coordinate corresponds to an edge to a different $v_{j}$ , vertices $v_{1}$ , …, $v_{k}$ form a clique in $G$ .

In the case $\gamma(C)>0$ , consider a new cluster $C^{\prime}$ which is obtained from $C$ by removing all vectors which have a vertex-valued entry not equal to the centroid value. Assume for now that $|C^{\prime}|\geq 2$ . By the proof above, $\frac{\beta(C^{\prime})-2}{|C^{\prime}|-1}\geq\kappa$ , since $\gamma(C^{\prime})=0$ . The value $\frac{\beta(C)-2+\gamma(C)}{|C|-1}$ could be obtained from $\frac{\beta(C^{\prime})-2}{|C^{\prime}|-1}$ by adding $\gamma(C)+(\beta(C)-\beta(C^{\prime})$ to the numerator and $|C|-|C^{\prime}|$ to the denominator. Removing vectors could not increase $\beta$ , so $\beta(C)-\beta(C^{\prime})\geq 0$ , and $\gamma(C)\geq|C|-|C^{\prime}|$ since each of the removed vectors has at least one vertex value not equal to the centroid value. If $\frac{\beta(C^{\prime})-2}{|C^{\prime}|-1}\geq 1$ , then the new fraction is also at least 1 and so striclty greater than $\kappa$ . If $|C^{\prime}|\leq 1$ , then $\frac{\beta(C)-2+\gamma(C)}{|C|-1}\geq 1$ since $\beta(C)\geq 2$ and $\gamma(C)\geq|C|-|C^{\prime}|$ . If $\frac{\beta(C^{\prime})-2}{|C^{\prime}|-1}<1$ , then the new fraction became strictly larger, and so stricly larger than $\kappa$ . In all cases, the inequality is strict when $\gamma(C)>0$ .

∎

Now to Cluster Selection: the reduction is almost the same, only we start from Multicolored Clique, and for each pair of indices $\{i,j\}$ , $1\leq i<j\leq k$ we obtain the set of vectors $X_{i,j}$ from edges in $G$ starting in color $i$ and ending in color $j$ . The vectors are constructed in the same way as in the previous reduction. All weights are set to one. The value of $D$ is the same, $D=\binom{k}{2}(k-2)$ .

Since vectors are constructed in the same way, all statements about the cost of grouping them remain valid, in particular Lemma 13. Only now the statement of Cluster Selection already guarantees that we select exactly one cluster and exactly one vector from each $X_{i,j}$ , so exactly one edge between each pair of colors. And by Lemma 13 only the proper $k$ -clique has the optimal cost.

∎

Note that Cluster Selection with the $L_{0}$ distance is very similar to the known problem Consensus String With Outliers, studied e.g. in [7]. The only difference of Cluster Selection is that we have to select one point from each of the given sets, whereas in Consensus String With Outliers the goal is to select the arbitrary subset of size $(n-k)$ . The construction from Theorem 2 also shows W[1]-hardness of Consensus String With Outliers with respect to $(d+D+n-k)$ in the case of unbounded alphabet.

6 The $L_{\infty}$ distance

In this section, we consider the case $p=\infty$ . We prove two hardness results of $k$ -Clustering: $\operatorname{{\sf W}}[1]$ -hardness when parameterized by $D$ and $\operatorname{{\sf NP}}$ -hardness in the case $k=2$ .

First, we prove some useful facts about the structure of optimal cluster centroids. The one thing, in which the $L_{\infty}$ distance is harder than all other distances in our consideration, is that even when the cluster is given, we can not just find the optimal cluster centroid by optimizing the value in each coordinate independently. So there seems to be no simple rule of finding the optimal cluster centroid of a given cluster. However, one could still do that in polynomial time by solving a linear program.

*Claim 6.1**.*

Given a multiset $C$ of vectors in $\mathbb{Z}^{d}$ , there is a polynomial time algorithm to find $c\in\mathbb{R}^{d}$ minimizing

[TABLE]

Proof.

We reduce to solving a linear program, which we define next. Denote $C=\{x_{1},\dots,x_{n}\}$ , introduce variables $c_{1}$ , …, $c_{d}$ corresponding to coordinates of the cluster centroid and variables $d_{1}$ , …, $d_{n}$ , where $d_{i}$ corresponds to the value $\operatorname{dist}_{\infty}(x_{i},c)$ . The following linear program solves to the minimum total distance.

[TABLE]

∎

The next claim shows that we could only consider half-integral cluster centroids.

*Claim 6.2**.*

For any multiset $C$ of vectors in $\mathbb{Z}^{d}$ , the vector $c\in\mathbb{R}^{d}$ which minimizes

[TABLE]

could always be chosen from $\frac{1}{2}\mathbb{Z}^{d}$ (coordinates are either integer or half-integer).

Proof.

Assume that we have an optimal solution $c$ which has at least one coordinate not of the form $z/2$ , $z\in\mathbb{Z}$ . For $a\in\mathbb{R}$ we denote $\text{frac}(a)=a-\lfloor a\rfloor$ , and

[TABLE]

calling this value the remainder of $a$ .

We could partition all coordinates on equivalence classes by remainder of $c$ . One could also define a partition of all vectors by the remainder of the distance to $c$ . These two partitions are related in the following sense: if $\operatorname{dist}_{\infty}(x,c)$ has remainder $\xi$ then each coordinate $j$ where $|x[j]-c[j]|=\operatorname{dist}_{\infty}(x,c)$ also has remainder $\xi$ , and vice versa. Now we take one particular remainder and show that we can shift it without losing optimality.

There are two kinds of vectors with the particular remainder $\xi$ : call bottom those vectors $x$ for which $\text{frac}(\operatorname{dist}_{\infty}(x,c))=\xi$ , and call top those vectors $x$ for which $\text{frac}(\operatorname{dist}_{\infty}(x,c))=1-\xi$ . Similarly, there are also two kinds of coordinates of $c$ , which we also call bottom and top depending of the value of $\text{frac}(c[j])$ .

Consider a bottom cordinate $j$ . Increasing $c[j]$ increases $|x[j]-c[j]|$ for all bottom vectors $x$ , and decreases $|x[j]-c[j]|$ for all top vectors $x$ . Decreasing $c[j]$ does the opposite, as well as increasing a top coordinate. So if we take some sufficiently small value $\beta$ and simultaneously increase all bottom coordinates and decrease all top coordinates by $\beta$ then for all bottom vectors their distance will become larger by $\beta$ , and for all top vectors — smaller by $\beta$ . An if we do the opposite, the bottom vectors will cost less and the top vectors will cost more. Then, we could just take the group which has more vectors (bottom or top) and choose that action which decreases the distance for these vectors. The larger group has at least as many vectors as the smaller group, so the total distance does not increase.

It remains to see which value of $\beta$ we could take. We could safely shift until we either reach a value in $\frac{1}{2}\mathbb{Z}$ or another remainder. In any case, we reduce the number of distinct remainders by one, and so we conclude the proof by doing this inductively over the number of distinct remainders.

∎

By Claim 6.2, the $\alpha$ -property holds with $\alpha=1/2$ , since at most one vector could be equal to the cluster centroid, and all others have distance at least $1/2$ due to half-integrality. We can also see that when the problem is parameterized by $d+D$ , it is FPT.

*Claim 6.3**.*

$k$ -Clustering with the $L_{\infty}$ distance is FPT when parameterized by $d+D$ .

Proof.

We use Theorem 9. We have the $\alpha$ -property, and for the set $\mathcal{D}$ of all possible cluster costs not exceeding $D$ we could take all half-integral values not exceeding $D$ by Claim 6.2. All that remains is to solve Cluster Selection in FPT time.

For that, we try all possible $x_{1}\in X_{1}$ , and then try each possible resulting cluster centroid $c$ . Since $\operatorname{dist}_{\infty}(x_{1},c)\leq D$ and $c$ is half-integral by Claim 6.2, we can try only vectors $c$ of this form, and that is done in time $(2D+1)^{d}$ . ∎

6.1 $\operatorname{{\sf W}}[1]$ -hardness when parameterized by $D$

Knowing that $k$ -Clustering with the $L_{\infty}$ distance is FPT when parameterized by $d+D$ , the next natural question is, is the problem FPT or $\operatorname{{\sf W}}[1]$ -hard when parameterized only by $D$ ? We show that $\operatorname{{\sf W}}[1]$ -hardness is the case, proving Theorem 3, which we recall here for convenience.

See 3

Proof.

First, we show a reduction from Clique to $k$ -Clustering. Given a graph $G$ and a clique size $k$ , we construct the following instance of the clustering problem.

We set the dimension to $|V(G)|+\binom{|V(G)|}{2}-|E(G)|$ . We take $|V(G)|$ vectors $\{x_{i}\}_{i=1}^{|V(G)|}$ corresponding to vertices. For the vertex $v$ , first $|V(G)|$ coordinates are set to zero, except $v$ -th coordinate, which is set to 2.

The last $\binom{|V(G)|}{2}-|E(G)|$ coordinates correspond to non-edges, vertex pairs which are not connected by an edge. For each vertex pair $\{u,v\}\notin E(G)$ in the coordinate $\{u,v\}$ we set $x_{u}$ to $2$ , $x_{v}$ to $-2$ , the order on $u$ , $v$ is chosen arbitrarily, and all other vectors to zero.

Finally, we set the number of clusters to $|V(G)|-k+1$ and the total distance to $k$ . We show an example on how the reduction works in Figure 9.

If there is a clique of size $k$ in $G$ , then we have a solution of cost $k$ : take $k$ vectors corresponding to the clique vertices in one cluster, and make all other clusters trivial. For the only nontrivial cluster $C$ , we can always choose $c$ so that $|x[j]-c[j]|\leq 1$ for any $x\in C$ and for any coordinate $j$ . Each vertex coordinate has only 0 and $2$ , so setting $c$ to 1 there suffices. As in $C$ we have an edge between any two vertices, in any non-edge coordinate $j$ there are either all zeroes, or zeroes and $2$ , or zeroes and $-2$ . In each of the cases there is a suitable value for $c_{j}$ : [math], $1$ or $-1$ correspondingly.

Next, we prove that any solution has cost at least $k$ , and any solution which is not a $k$ -clique has stricly larger cost. For that, we prove the following claim.

*Claim 6.4**.*

In the instance above, the cost of any cluster $C$ containing at least two vectors is at least $|C|$ . If there is at least one non-edge in $C$ , then the cost is at least $|C|+1$ .

Proof.

Denote the cluster centroid as $c$ . If each vector $x$ in $C$ has $\operatorname{dist}_{\infty}(x,c)\geq 1$ , the first statement is trivial. So assume that there is a vector $x^{*}$ in $C$ such that $\operatorname{dist}_{\infty}(x^{*},c)=\xi<1$ . Consider the coordinate $j^{*}$ which corresponds to the same vertex as the vector $x^{*}$ , $x^{*}[j^{*}]=2$ , and all other vectors are zero in the coordinate $j^{*}$ . As $\operatorname{dist}_{\infty}(x^{*},c)=\xi$ , $c[j^{*}]\geq 2-\xi$ . Then, for any other $x\in C$ , $\operatorname{dist}_{\infty}(x,c)\geq 2-\xi>1$ . The total cost of the cluster is at least $\xi+(|C|-1)(2-\xi)=2+(|C|-2)(2-\xi)\geq|C|$ , as $2-\xi>1$ .

Now to the second part of the claim. Assume there are only two vectors in $C$ and they do not have an edge, there is a coordinate $j^{*}$ where one is 2 and the other is $-2$ . No matter what we choose for $c[j^{*}]$ , the cost is at least $|2-c[j^{*}]|+|-2-c[j^{*}]|\geq 4$ , and the statement follows. So assume that $|C|\geq 3$ and there is a coordinate $j^{*}$ corresponding to a non-edge in $C$ . One vector from $C$ has 2 in the coordinate $j^{*}$ , another $-2$ , and all others have 0. Then there is a vector in $C$ with distance to $c$ of at least 2, as either $c[j^{*}]\geq 0$ and $|-2-c[j^{*}]|\geq 2$ or $c[j^{*}]<0$ and $|2-c[j^{*}]|>2$ . Let us just forget about this vector and consider all other vectors in $C$ . There are $|C|-1\geq 2$ of them, and by the reasoning in the proof of the first statement, their cost is at least $|C|-1$ . In this proof we considered only vertex coordinates, so the vector we forgot and the $j^{*}$ -th coordinate (which is a non-edge coordinate) does not affect it. So, the total cost is at least $|C|-1+2=|C|+1$ . ∎

Assume that we have $l\geq 1$ nontrivial clusters of sizes $\{t_{i}\}_{i=1}^{l}$ , nontrivial means that the size is at least two, $t_{i}\geq 2$ for $i\in\{1,\dots,l\}$ . By Claim 6.4, the total cost is at least

[TABLE]

as there are $|V(G)|-k+1$ clusters in total, $|V(G)|-k+1-l$ trivial clusters, and the total number of vectors is $|V(G)|=\sum_{i=1}^{l}t_{i}+|V(G)|-k+1-l$ , from which it follows that $\sum_{i=1}^{l}t_{i}=k+l-1$ . So no solution has cost less than $k$ .

Also, if there are at least two nontrivial clusters, then $k+l-1\geq k+1$ . So if a solution has cost $k$ , it must have only one nontrivial cluster, and its size must be $k$ .

Finally, assume that the solution indeed has only one nontrivial cluster, but there is a non-edge in it. Then, as the size is $k$ , by Claim 6.4 its cost is at least $k+1$ . So only a $k$ -clique has cost $k$ , which proves the correctness of the reduction.

Now, to Cluster Selection. We consider essentially the same reduction, only we start from Multicolored Clique. We obtain sets of vectors $X_{1}$ , …, $X_{k}$ in the same way as $X$ in the reduction above, only vectors obtained from vertices of color $j$ are put into $X_{j}$ . The total distance parameter is also set to $k$ . So parameters $t$ and $D$ of the obtained instance have the same value as the starting parameter $k$ .

Since vectors are constructed in the same way, Claim 6.4 still works. And now the statement of Cluster Selection enforces that exactly one cluster of $k$ vectors is selected. By Claim 6.4 it could be done with the cost $k$ if and only if there is a colorful $k$ -clique in the original graph.

∎

6.2 $\operatorname{{\sf NP}}$ -hardness when $k=2$

In this subsection we prove $\operatorname{{\sf NP}}$ -hardness of $k$ -Clustering with the $L_{\infty}$ distance when $k=2$ . Intuitively, if we consider the previous reduction, partitioning the vectors optimally into two clusters loosely corresponds to partitioning the vertices into two sets such that there are as many as possible vertices having no edges inside their set. Which, in turn, is Odd Cycle Transversal: the problem of removing the smallest number of vertices so that the remaining graph is biparite. However, to make everything really work, we need to consider a modified version of Odd Cycle Transversal which we call Half-Integral Odd Cycle Transversal.

Input:

An undirected graph $G$ , an integer $t$ .

Task:

Is there an assignment $\delta:V(G)\to\{0,1,2\}$ , such that $\sum_{v\in V(G)}\delta(v)\leq t$ and $G-S$ is bipartite, where $S=\{\{u,v\}\in E(G):\delta(u)+\delta(v)\geq 2\}$ ?

Half-Integral Odd Cycle Transversal parameterized by

First we show that Half-Integral Odd Cycle Transversal is also $\operatorname{{\sf NP}}$ -hard by constructing a reduction from 3-SAT.

Lemma 14.

There is a polynomial time reduction from 3-SAT to Half-Integral Odd Cycle Transversal.

Proof.

Given an instance of 3-SAT with $n$ variables and $m$ clauses, make a graph $G$ as follows. The example of the reduction is given in Figure 10. For each variable $x_{i}$ , introduce two vertices $x_{i}$ and $x_{i}^{\prime}$ , connect them with an edge. Also introduce $2n+1$ vertices $y_{i,j}$ connect them to both $x_{i}$ and $x_{i}^{\prime}$ .

For each clause $C_{j}$ introduce four vertices $C_{j,1}$ ,…, $C_{j,4}$ . Consider following seven vertices: $C_{j,1}$ , …, $C_{j,4}$ , and three variable vertices which are present in $C_{j}$ : if $x_{i}\in C_{j}$ then we consider the vertex $x_{i}$ , and if $\neg x_{i}\in C_{j}$ then we consider the vertex $x_{i}^{\prime}$ . Connect all these seven vertices in a cycle such that each variable vertex is adjacent to two clause vertices. Finally, set $t$ to $2n$ .

First, assume there is a satisfying assignment. Consider the following $\delta:V(G)\to\{0,1,2\}$ : if $x_{i}$ is true, $\delta(x_{i})=2$ , otherwise $\delta(x_{i}^{\prime})=2$ , on all other vertices $\delta\equiv 0$ . Clearly, $\sum_{v\in V(G)}\delta(v)=2n$ .

Since $\delta$ does not take value $1$ , deleting edges $\{u,v\}$ with $\delta(u)+\delta(v)\geq 2$ is equivalent to deleting vertices on which $\delta$ is 2. From each vertex gadget we deleted either $x_{i}$ or $x_{i}^{\prime}$ , so the remaining part is a star with leaves $y_{i,j}$ and center $x_{i}$ or $x_{i}^{\prime}$ . Since the assignment we started from is satisfying, from each clause cycle we deleted at least one vertex. So each cycle present in $G$ lost at least one vertex, and what remains is bipartite.

Now assume there is a solution $\delta$ to the Half-Integral Odd Cycle Transversal instance. We claim that $\delta(x_{i})+\delta(x_{i}^{\prime})\geq 2$ for each variable $x_{i}$ . Consider a 2-coloring of $G-S$ : either $x_{i}$ and $x_{i}^{\prime}$ have the same color or not. In the former case, $\delta(x_{i})+\delta(x_{i}^{\prime})\geq 2$ since the edge $\{x_{i},x_{i}^{\prime}\}$ must be removed.

If $x_{i}$ and $x_{i}^{\prime}$ have different colors, assume that $\delta(x_{i})\leq 1$ and $\delta(x_{i}^{\prime})\leq 1$ . Then, each of the $2n+1$ vertices $y_{i,j}$ takes one of the two colors, and so has an incident edge to $x_{i}$ or $x_{i}^{\prime}$ which needs to be deleted. But then, $\delta(y_{i,j})\geq 1$ for each $j$ , and the total cost on these vertices is already $2n+1$ . Then either $\delta(x_{i})=2$ or $\delta(x_{i}^{\prime})=2$ .

So we have $n$ variables and $\delta$ is at least $2$ on each pair of variable vertices, and in total $\delta$ is at most $2n$ . Then $\delta$ has to be exactly $2$ on each variable pair, and zero on all other vertices. Now we claim that on each clause cycle there is a variable vertex $v$ with $\delta(v)=2$ . If not, then none of the cycle edges gets deleted, as $\delta$ is equal to zero on clause vertices. But then the remaining graph could not be bipartite, since it contains an odd cycle.

To get a satisfying assignment, set $x_{i}$ to true if $\delta(x_{i})=2$ , or to false otherwise. In particular, if $\delta(x_{i}^{\prime})=2$ , $x_{i}$ is set to false, since $\delta(x_{1})+\delta(x_{1}^{\prime})=2$ . Each clause is satisfied since each clause cycle contains a variable vertex on which $\delta$ is equal to $2$ . ∎

Now we prove $\operatorname{{\sf NP}}$ -hardness of $k$ -Clustering with $p=\infty$ and $k=2$ by constructing a reduction from Half-Integral Odd Cycle Transversal.

Theorem 15.

$k$ -Clustering* with distance $\operatorname{dist}_{\infty}$ is $\operatorname{{\sf NP}}$ –hard when $k=2$ .*

Proof.

Consider an instance $(G,t)$ of Half-Integral Odd Cycle Transversal, if $t\geq|V(G)|$ , we have a yes-instance since $\delta\equiv 1$ deletes all edges from the graph, so we may assume $t<|V(G)|$ . Remove all isolated vertices in $G$ and add $t+5$ isolated edges to $G$ , it clearly does not change the type of the instance. The number of clusters $k$ is $2$ , set the dimension $d$ to $|E(G)|$ , each coordinate corresponds to an edge. For each vertex $v\in V(G)$ add a vector $x_{v}$ to $X$ with all coordinates set to zero. Then, for each edge $\{u,v\}\in E(G)$ set $x_{u}[u,v]$ to $2$ and $x_{v}[u,v]$ to $-2$ , the order on $u,v$ is chosen arbitrarily. Finally, set $D$ to $|V(G)|+t$ . An example is given in Figure 11, additional isolated edges are dropped out for clarity.

If $(G,t)$ is a yes-instance of Half-Integral Odd Cycle Transversal, consider the solution $\delta$ . Split vectors into clusters according to any proper 2-coloring of $G-S$ . Now we show the way to select cluster centroids so that each vertex $v$ has distance at most $1+\delta(v)$ to the corresponding centroid. We consider separately each of two clusters and each coordinate, indexed by an edge $\{u,v\}\in E(G)$ . For a cluster $C$ , there are three cases on how $u$ and $v$ are present in the cluster, for each of them we assign a particular value to the cluster centroid $c$ in the coordinate $\{u,v\}$ .

•

If $u$ and $v$ are both not in $C$ , for vectors in $C$ all entries in the coordinate $\{u,v\}$ are zero, and we set $c[u,v]$ also to zero. Each vector is at distance zero to the centroid in this coordinate.

•

If only one of $u$ and $v$ are in $C$ , for vectors in $C$ all entries in the corresponding coordinate are zero, except one entry corresponding to the edge’s endpoint belonging to $C$ , which is either $2$ or $-2$ . Set $c[u,v]$ to $1$ or $-1$ , correspondingly, then each vector is at distance $1$ in this coordinate.

•

If both $u$ and $v$ are in $C$ , w.l.o.g $x_{u}[u,v]$ is $2$ and $x_{v}[u,v]$ is $-2$ , and all other points are zero. It must hold that $\delta(u)+\delta(v)\geq 2$ , either $\delta(u)=\delta(v)=1$ or w.l.o.g $\delta(u)=2$ and $\delta(v)=0$ . In the former case, set $c[u,v]$ to zero, then all vectors have distance zero, $x_{u}$ and $x_{v}$ have distance $2$ in this coordinate. In the latter case, set $c[u,v]$ to $-1$ , then $u$ is at distance $3$ , and all other vectors, including $v$ , are at distance $1$ .

For any $v\in V(G)$ , since it holds for all coordinates that distance from $x_{v}$ to the corresponding cluster centroid is at most $1+\delta(v)$ , then the $L_{\infty}$ distance is also at most $1+\delta(v)$ , and the total cost of the clustering defined above is at most

[TABLE]

In the other direction, assume there is a clustering $C_{1}$ , $C_{2}$ with centroids $c_{1}$ , $c_{2}$ such that the total cost is at most $D$ . By Claim 6.2 we may assume that centroids are integral, and for any vector the distance to the nearest centroid is also an integer. We also may assume that centroids are between $-2$ and $2$ in each coordinate since all the input vectors have entries in this range, and so we could move the centroids to the same range without increasing distances.

So, each vector has distance in $\{0,1,2,3,4\}$ to the closest centroid. We claim that it could not be that a vector $x_{v}$ has distance zero: in this case w.l.o.g $x_{v}=c_{1}$ , and so $c_{1}$ is equal to $2$ or $-2$ in some coordinate, since each vertex has at least one incident edge. But then each vector in $C_{1}$ has distance at least $2$ to $c_{1}$ . And since at most two vectors could be equal to the centroids, each of the remaining $|V(G)|-2$ vectors has distance at least 1. Consider $t+5$ isolated edges, at least $t+3$ of them do not have any endpoint equal to one of $c_{1}$ and $c_{2}$ . For these edges, the total distance of their endpoints is at least $3$ : either their endpoints are in different clusters, and so the endpoint in $C_{1}$ costs at least $2$ , or both endpoints are in the same cluster, and in total they cost $4$ since there are simultaneously values $2$ and $-2$ in the coordinate corresponding to this edge. So each of the $t+3$ edges increases the cost by additional $1$ , and the total cost is at least $|V(G)|-2+t+3>|V(G)|+t$ .

Since each vector has distance at least $1$ , we may assume that the centroids are in $\{-1,0,1\}^{d}$ . If we have $2$ (or $-2$ ) we could change it to $1$ (or $-1$ ), all vectors which could become farther from the centroid have $2$ in this coordinate. But then the distance for these vectors is still at most $1$ . We also may assume that distances are in $\{1,2,3\}$ , since distance $4$ could be only from $2$ to $-2$ .

We claim that if we set $\delta(v):=\min_{i=1}^{2}\operatorname{dist}_{\infty}(x_{v},c_{i})$ , $\delta$ is a solution to Half-Integral Odd Cycle Transversal. Remove all edges $\{u,v\}$ with $\delta(u)+\delta(v)\geq 2$ , and consider 2-coloring of $G$ induced by the partition $\{C_{1},C_{2}\}$ . Assume that we have an edge $\{u,v\}$ such that $\delta(u)+\delta(v)\leq 1$ and $u$ and $v$ are in the same cluster (w.l.o.g $C_{1}$ ). Then we have a coordinate $\{u,v\}$ such that w.l.o.g $x_{u}[u,v]=2$ and $x_{v}[u,v]=-2$ , but $\operatorname{dist}_{\infty}(x_{u},c_{1})+\operatorname{dist}_{\infty}(x_{v},c_{1})\leq 3$ due to $\delta(u)+\delta(v)\leq 1$ and so $|x_{u}[u,v]-c_{1}[u,v]|+|x_{v}[u,v]-c_{1}[u,v]|\leq 3$ , which is a contradiction. So $(G,t)$ is also a yes-instance. ∎

Note that the reduction from 15 also implements $k$ -Coloring, if we set $k$ to the number of colors and $D$ to $|V(G)|$ , since with such a small budget we can not allow any same-colored neighbors in the optimal clustering.

7 The case $p\in(1,\infty)$

In this section we consider the case $p\in(1,\infty)$ , with the particular emphasis on the most commonly used case $p=2$ . With the $L_{2}$ distance, the $k$ -Clustering problem is widely studied under the name $k$ -Means.

7.1 $\operatorname{{\sf FPT}}$ when parameterized by $d+D$ for $p=2$

When we consider both $d$ and $D$ as the parameters, Cluster Selection in the $L_{2}$ distance becomes $\operatorname{{\sf FPT}}$ , and so $k$ -Clustering is also $\operatorname{{\sf FPT}}$ by Theorem 9.

Note that in any composite cluster, each vector except at most one is at distance at least $1/4$ from the centroid, so the $\alpha$ -property holds with $\alpha=1/4$ . Consider two different vectors, they have different values in some coordinate, and in this coordinate at least one of them is at distance at least $(1/2)^{2}=1/4$ from the centroid.

Now we prove Theorem 4, which we restate here.

See 4

Proof.

We start with the proof that Cluster Selection is $\operatorname{{\sf FPT}}$ . Distance $\operatorname{dist}_{2}$ enjoys the $\alpha$ -property. Hence if $t>4D+1$ then any composite cluster costs more than $D$ and the instance is clearly a no-instance. So we may assume that $t\leq 4D+1$ .

We claim that there are at most $4mtD$ possible total weights of the resulting composite cluster. First, in the resulting cluster there could be at most one vector with weight strictly larger than $4D$ . Otherwise, let us consider two such vectors and the coordinate in which they differ. No matter which value the centroid has there, it is at distance of at least $1/2$ from at least one of the vectors, so the total cost is larger than $4D(1/2)^{2}\geq D$ . So there are at most $m$ possibilities for the largest weight, and all of the other $(t-1)$ weights are at most $4D$ .

We fix the total resulting cluster weight $W$ , the vector in the resulting cluster with the largest weight $x_{j^{*}}\in X_{j^{*}}$ , and the coordinate $i$ . Since the centroid $c$ is the mean of the vectors in the resulting cluster, $c[i]$ is of form $\frac{y}{W}$ , where $y\in\mathbb{Z}$ . We claim that the distance from $y$ to $W\cdot x_{j^{*}}[i]$ is bounded by a function of $D$ , and so each possible $y$ could be enumerated in $\operatorname{{\sf FPT}}$ time. Moreover, all possible centroids could also be enumerated in $\operatorname{{\sf FPT}}$ time since $d$ is a parameter.

Let $\{x_{1},\dots,x_{t}\}$ be the resulting cluster, $x_{j}\in X_{j}$ for all $j\in\{1,\dots,t\}$ . The difference between $c[i]$ and $x_{j^{*}}[i]$ could be written as

[TABLE]

The absolute value of the numerator is $\mathcal{O}(D^{3})$ since $t=\mathcal{O}(D)$ , $w(x_{j^{*}})$ gets multiplied by zero, and all other weights are at most $4D$ . Also, for any $j\in\{1,\dots,t\}$ , $|x_{j^{*}}[i]-x_{j}[i]|\leq 4D$ , since

[TABLE]

The total running time is at most

[TABLE]

since we try all possible cluster weights, all possible $x_{j^{*}}$ out of the input vectors, then all possible centroids which differ from $x_{j^{*}}$ by $\mathcal{O}(D^{3})$ in each coordinate. And then for each centroid we check whether the optimal cluster for it has cost at most $D$ by selecting the best $x_{j}\in X_{j}$ for each $j\in\{1,\dots,t\}$ . This concludes the proof that Cluster Selection is $\operatorname{{\sf FPT}}$ when parameterized by $d+D$ .

Now we proceed with the proof that $k$ -Clustering is $\operatorname{{\sf FPT}}$ parameterized by $d+D$ . For that we employ Theorem 9. We already have the $\alpha$ -property and $\operatorname{{\sf FPT}}$ algorithm for Cluster Selection. Hence the only thing left is to enumerate the set $\mathcal{D}$ of all possible optimal cluster costs not exceeding $D$ .

Since there are $n$ vectors in total, each cluster contains from $1$ to $n$ vectors. For each possible cluster size $s$ the centroid is of the form $\frac{y}{s}$ , where $y\in\mathbb{Z}$ . Since input vectors have integer coordinates, the cost of any cluster of size $s$ is of form $\frac{z}{s^{2}}$ , where $z\in\mathbb{Z}$ . And since the cost is at most $D$ , $z\in\{0,\dots,Ds^{2}\}$ . We enumerate all possible cluster sizes in $\{1,\dots,n\}$ , and for each cluster size $s$ all possible cluster costs in $\{0/s^{2},\dots,Ds^{2}/s^{2}\}$ . In this way we obtain $\mathcal{D}$ , and $|\mathcal{D}|=\mathcal{O}(Dn^{3})$ . ∎

7.2 $\operatorname{{\sf W}}[1]$ -hardness when parameterized by $t+D$

In our setting, $k$ -Clustering for $p=2$ seems to be harder than for $p=1$ , since we do not have the nice property that if many vectors have the same value in some coordinate then the centroid must also have this value. On the contrary, even if only one vector diverges from the rest, the optimal centroid also diverges. So the approach with enumerating nontrivial coordinate sets, which we successfully used in the $p\in(0,1]$ case, is not likely to work.

We are able to prove that Cluster Selection for $p\in(1,\infty)$ is W[1]-hard parameterized by $t+D$ . It remains open whether $k$ -Clustering for $p\in(1,\infty)$ or specifically for $p=2$ is W[1]-hard or not, but our result shows that at least the approach we used to obtain an $\operatorname{{\sf FPT}}$ algorithm in the $p\in(0,1]$ case would not yield an $\operatorname{{\sf FPT}}$ algorithm for $p\in(1,\infty)$ .

First we state and prove two technical claims about the geometrical properties of clustering zero-one valued vectors in the $p\in(1,\infty)$ case.

*Claim 7.1**.*

If we have a cluster of size $a+b$ where $a$ vectors have zero and $b$ vectors have one in the coordinate $i$ , then the optimal centroid value in this coordinate is equal to

[TABLE]

and the coordinate $i$ contributes

[TABLE]

to the total cost.

Proof.

Assume that the centroid value in the coordinate $i$ is equal to $c$ , then the cost is

[TABLE]

It is easy to see that $c<0$ is worse than $c=0$ , and similarly $c>1$ is worse than $c=1$ , so we could restrict $c$ to $[0,1]$ . The derivative with respect to $c$ is

[TABLE]

as $p>1$ , the derivative is zero if and only if

[TABLE]

The derivative increases monotonically: when we increase $c$ , $c^{p-1}$ increases and $(1-c)^{p-1}$ decreases as $p-1>0$ . So the optimal value must be at its unique root defined by the expression above. Thus, the optimal cost is equal to

[TABLE]

∎

Now we prove that it is optimal to have as many ones in the same coordinate as possible. For that, we calculate how much each one adds to the total cost depending on how many ones are there in a coordinate.

*Claim 7.2**.*

Consider a cluster of $s$ zero-one valued vectors, denote as $f(b)$ the contribution of a coordinate in which there are $b$ ones and $s-b$ zeroes. The function $f(b)/b$ is strictly decreasing for $0<b<s$ .

Proof.

Denote the number of zeroes in the coordinate as $a:=s-b$ . By Claim 7.1, the contribution of the coordinate per each one is

[TABLE]

Let us denote $x=a/s$ , $0<x<1$ , the derivative of the above with respect to $x$ is equal to

[TABLE]

which is strictly positive for $0<x<1$ , hence proving the claim. ∎

Now we are ready to prove the hardness result, which was stated in the introduction as Theorem 5. We recall the statement here.

See 5

Proof.

We construct a reduction from Multicolored Clique. Given a graph $G$ and a clique size $k$ , we construct the following instance of Cluster Selection.

We set $t$ to $\binom{k}{2}$ , each input set of vectors represents a choice of an edge of the clique between two particular colors, so we number them by unordered pairs of indices from 1 to $k$ . We set the dimension $d$ to $|V(G)|$ , coordinates are numbered by vertices.

The set $X_{i,j}$ consists of the following vectors: for each edge $\{u,v\}\in E(G)$ between a vertex $u$ of color $i$ and vertex $v$ of color $j$ , we add a vector with $1$ in the coordinate $u$ and $1$ in the coordinate $v$ , all other coordinates are set to zero. All vectors have weight one. Finally, we set

[TABLE]

In Figure 12, we show the intuition behind the reduction by considering a simple example.

If there is a colorful $k$ -clique in $G$ then we construct a solution to our instance of Cluster Selection. Assume the clique is formed by vertices $v_{1}$ , $v_{2}$ , …, $v_{k}$ , for each $i\in\{1,\cdots,l\}$ vertex $v_{i}$ is of color $i$ . From each $X_{i,j}$ choose the vector corresponding to the edge $\{v_{i},v_{j}\}\in E(G)$ . Among the chosen vectors, in every coordinate of the form $v_{i}$ there are $(k-1)$ ones from edges to $v_{i}$ and $\binom{k}{2}-(k-1)=\binom{k-1}{2}$ zeroes. All other coordinates are zeroes in the chosen vectors, so they do not contribute anything to the total distance. By Claim 7.1, the total distance is

[TABLE]

In the other direction, we prove that only the solution described above could have the cost $D$ , all others have strictly larger cost. First notice that in any resulting cluster there are at most $(k-1)$ ones in each coordinate, since for any vertex $v\in V(G)$ , if we denote its color by $i$ , only vectors from $(k-1)$ sets of the form $X_{i,j}$ ( $j\in\{1,\dots,k\}\setminus\{i\}$ ) have ones in the coordinate $v$ , and we take one vector from each set by the definition of Cluster Selection.

Each vector has exactly two ones, so in any resulting cluster there are $2\cdot\binom{k}{2}$ ones in total. By Claim 7.2, any resulting cluster which does not have $(k-1)$ ones in $k$ coordinates has strictly larger cost, since only coordinates with exactly $(k-1)$ ones have the optimal cost per each one.

So, if the resulting cluster has the cost $D$ , then there are $k$ coordinates such that in each of them exactly $(k-1)$ of the chosen vectors have one. We show that in this case the original instance of Clique has a $k$ -clique. For any color $i\in\{1,\dots,k\}$ there are at most $(k-1)$ ones in all coordinates indexed by vertices of color $i$ in the resulting cluster. So all of these ones are in the same coordinate $v_{i}$ for some $v_{i}$ . We claim that the vertices $v_{1}$ , …, $v_{k}$ form a clique. Consider vertices $v_{i}$ and $v_{j}$ , we have taken some vector from $X_{i,j}$ , and this vector must have added a one to the coordinates $v_{i}$ and $v_{j}$ , then by construction the edge $\{v_{i},v_{j}\}$ is in $E(G)$ .

∎

8 Conclusion and open problems

In this paper, we presented an $\operatorname{{\sf FPT}}$ algorithm for $k$ -Clustering with $p\in(0,1]$ parameterized by $D$ . However, for the case $p\in(1,\infty)$ we were able only to show the $\operatorname{{\sf W}}[1]$ -hardness of Cluster Selection. While intractability of Cluster Selection does not exclude that $k$ -Clustering could be $\operatorname{{\sf FPT}}$ with $p\in(1,\infty)$ , it indicates that the proof of this (if it is true at all) would require an approach completely different from ours. Thus an interesting and very concrete open question concerns the parameterized complexity of $k$ -Clustering with $p\in(1,\infty)$ and parameter $D$ .

Another open question is about the fine-grained complexity of $k$ -Clustering when parameterized by $k+d$ . For several distances, we know XP-algorithms: an $\mathcal{O}(n^{dk+1})$ algorithm by Inaba et. al. [21] for $p=2$ , as well as trivial algorithms for $p\in[0,1]$ . For the case when the possible cluster centroids are given in the input, the matching lower bound is shown in [11]. However, we are not aware of a lower bound complementing the algorithmic results in the case when any point in Euclidean space can serve as a centroid.

Finally, let us note that our $\operatorname{{\sf W}}[1]$ -hardness reductions could be easily adapted to obtain ETH-hardness results. Our reductions are from Clique and, assuming ETH, there is no $n^{o(k)}$ algorithm for Clique. In most of our results, the ETH lower bounds derived from our reductions, can be complemented by matching upper bounds through a trivial algorithm for Cluster Selection in time $n^{\mathcal{O}(d)}$ or $n^{\mathcal{O}(t)}$ and, consequently, an algorithm for $k$ -Clustering obtained by Theorem 9. However, the reduction in Theorem 5 excludes only a $(nd)^{o(t^{1/2}+D^{1/2})}$ algorithm for Cluster Selection with $p\in(1,\infty)$ under ETH. Both the trivial algorithm in time $n^{\mathcal{O}(t)}$ and the algorithm from Theorem 4 in time $D^{\mathcal{O}(d)}$ (which could also be turned into a $d^{\mathcal{O}(D)}$ -time algorithm) fail to match this lower bound. So, another open question is, whether there exists a better reduction or a subexponential algorithm could be obtained in this case.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. R. Ackermann, J. Blömer, and C. Sohler , Clustering for metric and nonmetric distance measures , ACM Trans. Algorithms, 6 (2010), pp. 59:1–59:26.
2[2] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan , Approximating extent measures of points , J. ACM, 51 (2004), pp. 606–635.
3[3] D. Aloise, A. Deshpande, P. Hansen, and P. Popat , NP-hardness of Euclidean sum-of-squares clustering , Machine Learning, 75 (2009), pp. 245–248.
4[4] N. Alon, R. Yuster, and U. Zwick , Color-coding , J. ACM, 42 (1995), pp. 844–856.
5[5] D. Angluin and L. Valiant , Fast probabilistic algorithms for hamiltonian circuits and matchings , J. Computer and System Sciences, 18 (1979), pp. 155 – 193.
6[6] M. Badoiu, S. Har-Peled, and P. Indyk , Approximate clustering via core-sets , in Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC), ACM, 2002, pp. 250–257.
7[7] C. Boucher, C. Lo, and D. Lokshtanov , Outlier detection for DNA fragment assembly , Co RR, abs/1111.0376 (2011).
8[8] C. Boutsidis, A. Zouzias, M. W. Mahoney, and P. Drineas , Randomized dimensionality reduction for k-means clustering , IEEE Trans. Information Theory, 61 (2015), pp. 1045–1062.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Parameterized kkk-Clustering: The distance matters!

Abstract

1 Introduction

Theorem 1**.**

Theorem 2**.**

Theorem 3**.**

Theorem 4**.**

Theorem 5**.**

2 Preliminaries and notation

3 From kkk-Clustering to Cluster Selection

Definition 6** (Initial cluster and regular partition).**

Proposition 1*.*

Proof.

Definition 7** (Simple and composite clusters).**

Definition 8** (α\alphaα-property).**

Theorem 9**.**

Proof.

4 Algorithms and complexity for distances with p∈(0,1]p\in(0,1]p∈(0,1]

4.1 FPT algorithm when parameterized by DDD

Claim 4.1*.*

Proof.

Claim 4.2*.*

Proof.

Lemma 10** ([27]).**

Proposition 2* ([5]).*

Theorem 11**.**

Proof.

Claim 4.3*.*

Proof.

Claim 4.4*.*

Proof.

Claim 4.5*.*

Proof.

Proof.

4.2 W[1]-hardness of Cluster Selection parameterized by t+dt+dt+d for p=1p=1p=1

Theorem 12**.**

Proof.

5 The L0L_{0}L0​ distance

Observation 1*.*

Proof.

Lemma 13**.**

Proof of Lemma 13.

Claim 5.1*.*

Proof.

6 The L∞L_{\infty}L∞​ distance

Claim 6.1*.*

Proof.

Claim 6.2*.*

Proof.

Claim 6.3*.*

Proof.

6.1 W⁡[1]\operatorname{{\sf W}}[1]W[1]-hardness when parameterized by DDD

Proof.

Claim 6.4*.*

Proof.

6.2 NP⁡\operatorname{{\sf NP}}NP-hardness when k=2k=2k=2

Lemma 14**.**

Proof.

Theorem 15**.**

Proof.

7 The case p∈(1,∞)p\in(1,\infty)p∈(1,∞)

7.1 FPT⁡\operatorname{{\sf FPT}}FPT when parameterized by d+Dd+Dd+D for p=2p=2p=2

Proof.

7.2 W⁡[1]\operatorname{{\sf W}}[1]W[1]-hardness when parameterized by t+Dt+Dt+D

Claim 7.1*.*

Proof.

Claim 7.2*.*

Proof.

Proof.

8 Conclusion and open problems

Parameterized $k$ -Clustering: The distance matters!

Theorem 1.

Theorem 2.

Theorem 3.

Theorem 4.

Theorem 5.

3 From $k$ -Clustering to Cluster Selection

Definition 6 (Initial cluster and regular partition).

*Proposition 1**.*

Definition 7 (Simple and composite clusters).

Definition 8 ( $\alpha$ -property).

Theorem 9.

4 Algorithms and complexity for distances with $p\in(0,1]$

4.1 FPT algorithm when parameterized by $D$

*Claim 4.1**.*

*Claim 4.2**.*

Lemma 10 ([27]).

*Proposition 2** ([5]).*

Theorem 11.

*Claim 4.3**.*

*Claim 4.4**.*

*Claim 4.5**.*

4.2 W[1]-hardness of Cluster Selection parameterized by $t+d$ for $p=1$

Theorem 12.

5 The $L_{0}$ distance

*Observation 1**.*

Lemma 13.

*Claim 5.1**.*

6 The $L_{\infty}$ distance

*Claim 6.1**.*

*Claim 6.2**.*

*Claim 6.3**.*

6.1 $\operatorname{{\sf W}}[1]$ -hardness when parameterized by $D$

*Claim 6.4**.*

6.2 $\operatorname{{\sf NP}}$ -hardness when $k=2$

Lemma 14.

Theorem 15.

7 The case $p\in(1,\infty)$

7.1 $\operatorname{{\sf FPT}}$ when parameterized by $d+D$ for $p=2$

7.2 $\operatorname{{\sf W}}[1]$ -hardness when parameterized by $t+D$

*Claim 7.1**.*

*Claim 7.2**.*