Tight FPT Approximations for $k$-Median and $k$-Means
Vincent Cohen-Addad, Anupam Gupta, Amit Kumar, Euiwoong Lee, Jason Li

TL;DR
This paper presents fixed-parameter tractable algorithms that achieve near-optimal approximation ratios for $k$-median and $k$-means clustering in metric spaces, and establishes hardness results indicating these ratios are essentially best possible under certain complexity assumptions.
Contribution
The authors develop FPT algorithms with improved approximation factors for $k$-median and $k$-means, and prove matching hardness bounds under complexity conjectures.
Findings
Achieved approximation ratios of (1+2/e+ε) for $k$-median and (1+8/e+ε) for $k$-means.
Established FPT hardness results showing no better ratios are possible under certain conjectures.
Provided insights into the complexity landscape of clustering problems in metric spaces.
Abstract
We investigate the fine-grained complexity of approximating the classical -median / -means clustering problems in general metric spaces. We show how to improve the approximation factors to and respectively, using algorithms that run in fixed-parameter time. Moreover, we show that we cannot do better in FPT time, modulo recent complexity-theoretic conjectures.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Université Pierre et Marie Curie, Paris Carnegie Mellon UniversitySupported in part by NSF awards CCF-1536002, CCF-1540541, and CCF-1617790. IIT Delhi New York UniversitySupported in part by the Simons Collaboration on Algorithms and Geometry. Carnegie Mellon UniversitySupported in part by NSF awards CCF-1536002, CCF-1540541, and CCF-1617790.
\CopyrightVincent Cohen-Addad, Anupam Gupta, Amit Kumar, Euiwoong Lee, and Jason Li \ccsdesc[500]Theory of computation Facility location and clustering \ccsdesc[500]Theory of computation Fixed parameter tractability \ccsdesc[300]Theory of computation Submodular optimization and polymatroids
Acknowledgements.
We thank Deeparnab Chakrabarty, Ola Svensson, and Pasin Manurangsi for useful discussions. This research was partially conducted when A. Kumar was visiting A. Gupta and Carnegie Mellon University as part of the Joint Indo-US Virtual Center for Algorithms under Uncertainty.
\EventEditorsJohn Q. Open and Joan R. Access \EventNoEds2 \EventLongTitle42nd Conference on Very Important Topics (CVIT 2016) \EventShortTitleCVIT 2016 \EventAcronymCVIT \EventYear2016 \EventDateDecember 24–27, 2016 \EventLocationLittle Whinging, United Kingdom \EventLogo \SeriesVolume42 \ArticleNo23
Tight FPT Approximations for -Median and
-Means
Vincent Cohen-Addad
Anupam Gupta
Amit Kumar
Euiwoong Lee
Jason Li
Abstract
We investigate the fine-grained complexity of approximating the classical -Median/-Means clustering problems in general metric spaces. We show how to improve the approximation factors to and respectively, using algorithms that run in fixed-parameter time. Moreover, we show that we cannot do better in FPT time, modulo recent complexity-theoretic conjectures.
keywords:
approximation algorithms, fixed-parameter tractability, k-median, k-means, clustering, core-sets
1 Introduction
How well can we approximate the -Median and -Means clustering problems? This question has been intensively studied over the past two decades, and many interesting algorithmic techniques have been developed and refined in an attempt to understand these problems. Let us elaborate for the -Median problem; the story for -Means is much the same. Recall that in the -Median problem, given a metric space with points and clients at some of the points, the goal is to open facilities such that the sum of distances from the clients to their closest facilities is minimized.
The first constant-factor approximation algorithm for -Median was given by Charikar et al. [6]. After many interesting developments (e.g., primal-dual schemes, sophisticated LP rounding schemes, and pseudo-approximations), today the best approximation guarantee is 2.611 [3]. The best lower bound, however, is still the -hardness from 1998, due to Guha and Khuller [16]. In this paper, we ask: can we do better if we give ourselves more resources? The problem can be solved exactly by brute-force enumeration in time , but what can we do, say, in FPT time ?
We cannot hope to solve the problem exactly in FPT time: the reduction of Guha and Khuller also shows a -hardness for finding the optimal solution for -Median/-Means exactly. Naturally, we then ask what we can achieve by combining the two approaches together, and whether good approximation algorithms can be given in FPT time.
Our Results. Our main algorithmic result is a positive result in this direction:
Theorem 1.1** (Algorithm for -Median/-Means).**
For every , there is a -approximation algorithm for the -Median problem, that runs in time FPT time, i.e., in time. For the -Means problem, we can achieve a -approximation in the same runtime.
The approximation guarantees in Theorem 1.1 match the NP-hardness results for the two problems implied by [16]. However, since we are allowing ourselves FPT time and not just time, can we do even better and go past this NP-hardness barrier? Our second main result shows that this is not possible, at least under recent complexity-theoretic conjectures. We prove that the results in Theorem 1.1 are essentially tight, assuming the Gap-Exponential Time Hypothesis [12, 25, 5]:
Theorem 1.2** (Hardness).**
There exists a function such that assuming the Gap-ETH, for any , any -approximation algorithm for -Median, and any -approximation for -Means, must run in time at least .
The basic component of the above hardness result is an FPT-hardness of a factor of for the Max -Coverage problem, again using the Gap-ETH (Theorem 3.1). Composing that hardness result with the reduction of Guha and Khuller [16] gives us Theorem 1.2 above.
Matroid Median. Finally, using our algorithmic techniques, we are able to also give an improved approximation for the matroid-median problem, which is a generalization of the -Median problem.
Theorem 1.3** (Algorithm for Matroid Median).**
There is a -approximation algorithm for the Matroid Median problem, that runs in time FPT time, i.e., in time.
Since the Matroid Median problem is a generalization of the -Median problem, the -hardness from Theorem 1.2 translates immediately to Matroid Median. It remains an open problem to close the gap between this lower bound and the -approximation in Theorem 1.3. We can also use our ideas to get an for the -Matroid Median problem.
Facility Location. Facility Location is a problem closely related to -Median, where each facility has an opening cost and the goal is to open facilities to minimize the sum of distances from clients to their closest facilities plus the sum of the total opening costs. For this problem, the best known hardness ratio is [16], which is defined to be \max_{x\geq 0}\big{(}1+\frac{x}{1+x}\ln\frac{2}{x}\big{)}. On the other hand, the best algorithm achieves an -approximation [22]. When the parameter denotes the number of facilities open in the optimal solution, we prove that our techniques also give an FPT algorithm for Facility Location whose approximation ratio matches the hardness ratio of [16].
Theorem 1.4** (Algorithm for Facility Location).**
There is a -approximation algorithm for the Facility Location problem, that runs in time FPT time, i.e., in time.
Roadmap: In Section 2, we describe the approximation algorithms for these problems. We assume throughout that the aspect ratio is polynomially bounded. (We show in Section B.1 that this assumption is without loss of generality, in the case we consider where the clients have unit weights.) In Section 3, we then give the hardness results for FPT Max -Coverage, -Median, and -Means.
1.1 Our Techniques
The algorithm is inspired by the hardness result from [16]: it relies on the result of Feige [13] that Max -Coverage is hard to approximate better than . Hence, if we build a “factor graph” with sets on one side and elements on another, with edges indicating inclusion, picking sets covers elements at distance , and the remaining at distance at least —hence . Now what if we have a general instance, with different distances? We show how to do limited enumeration (in FPT) time to restrict our choices to picking one facility each from disjoint sets. Moreover, via a surprisingly clean idea we can model the objective as submodular maximization (subject to a partition matroid constraint). And this problem can be approximated well: the factor again is , hence giving the same factor upto additive terms!
The matching hardness result is via showing an FPT hardness for Max -Coverage assuming the Gap-ETH. Firstly, we show that assuming the Gap-ETH, there is no FPT approximation algorithm for Label Cover problem parameterized by the number of vertices on one side of the bipartition. (Trying all labelings on one side takes time , and doing much better is hard.) To do this, we construct a variable-clause game from a -SAT instance, merge clause vertices into super-vertices, and then use rounds of parallel repetition. (The number of clause vertices becomes .) Then we compose this with the classical reduction from Label Cover to Max -Coverage [13]. Due to some technical details (e.g., our Label Cover instance is not guaranteed to be regular) and for the sake of completeness, we provide a formal proof in Lemma 3.5. While our techniques are similar to recent FPT hardnesses for the related -Dominating Set problem [5, 10], some technical details (e.g., the projection property of Label Cover instances) prevent us from directly using prior results to get -hardness for Max -Coverage.
1.2 Related Work
We briefly survey the state-of-the-art for -Median and -Means; please see references below for more historical context. For general metric spaces, the best approximation ratio for -Median is 2.611 [3] by Byrka et al., building on work of Li and Svensson [23]. Kanungo et al. [17] gave a -approximation algorithm for -Means in general metric spaces, which was later improved to 6.357 by Ahmadian et al. [1]. The first constant factor approximation algorithm for Matroid Median was given by Krishnaswamy et al. [18], which was improved by Swamy [27] to 8.
For Euclidean spaces, the problems are better approximable, at least when either or the dimension are fixed; we restrict this discussion to parameterizing by . Specifically, PTASs for both -Median and -Means with running time were given by Kumar et al. [19]. The running times were improved by Chen [7] to for any for -Median, and by Feldman et al. [15] to for -Means. Both these latter results were based on the notion of coresets. The -Means problem is APX hard even in Euclidean space, if both and are allowed to be arbitrary [2, 20].
A result of direct interest to this work is that of Czumaj and Sohler [11], for the min-sum clustering problem. They give a -approximation on general metrics in FPT time. They construct a small (strong) core-set for the related Balanced -Median problem, and enumerate over all choices of centers inside this core-set. We show in §B.2 that their approach extends to give a -approximation for the non-bipartite case of -Median— in this special case of -Median a facility may be opened at any client location, and hence . Theorem 1.1 above shows how to get a better guarantee for a more general case. (As an aside, the hardness for this special non-bipartite case is only ; closing this gap is another interesting open question.)
Hardness-of-approximation results for parameterized problems have been actively studied recently. Lin [24] proved -hardness of approximation for -Biclique. Chen and Lin [9] proved -hardness of approximation for -Dominating Set in any constant factor, which was later improved to any function in [5, 10]. Chalermsook et al. [5] also proved that there is no FPT -approximation algorithm for -Clique assuming the Gap-ETH.
1.3 Preliminaries
An instance of the -Median problem is defined by a tuple , where is a metric space over a set of points with denoting the distance between two points in . Further, and are subsets of and are referred as “clients” and “facility locations”, and is a positive parameter. The goal is to find a subset of facilities in to minimize
[TABLE]
In the weighted version of -Median, every client has an associated weight , and the goal is to find a subset of of size such that is minimized.
The -Means problem is defined similarly except that the objective function gets modified to (and analogously for the weighted version). The names of the two problems come from the fact that if the metric space is the real line and , the optimal solution is the median and the mean respectively. In the Matroid Median problem, we are given a matroid on the set , and the set of open facilities must be an independent set in the matroid. Again, the goal is to minimize the assignment cost of clients to the nearest open facility.
In the Facility Location problem, an instance is not given , but additionally has that indicates the opening cost of each facility. The goal is to find a subset (without any restriction on ) that minimizes where .
Finally, the aspect ratio of a metric space is .
2 The Approximation Algorithm
We now give the -approximation algorithm for -Median, where is a fixed parameter throughout this section. The running time of the algorithm is , where . We then indicate the alterations to get algorithms for -Means and Matroid Median.
2.1 The Intuition
We focus on -Median for now; the ideas for the other problems are analogous. The first idea is to reduce the size of the client set to —this can be done by results on core-sets for -Median, which consolidate the clients into a small number of distinct locations [8, 14]. The consolidated clients now have weights, but this extension to weighted -Median does not pose a problem.
The next idea is to carefully enumerate over the structure of an optimal solution. Consider an optimal solution . For a facility , let “cluster” be the clients assigned to , i.e., the subset of clients for which is closest open facility. Let be the client in closest to – we call it the leader of cluster . Let be the distance , suitably discretized. Our algorithm guesses the leaders and the distances for each . Since the size of is , there are choices for leaders,111Our analysis will tighten this bound to , but this improvement can be ignored for this intuition section. and a similar number of choices for the distances; moreover, this quantity can be shown to be .
Assume now that we have correctly guessed the leaders and distances . For each leader , let be the facilities at distance about from —this set contains . By making copies, assume the sets are disjoint. Now our task is to select one facility from each set such that the total (weighted) assignment cost of the clients in is minimized. As such, this seems like a decreasing supermodular minimization problem with a (partition) matroid constraint. (Observe that choosing an arbitrary center in each gives us a -approximation in FPT time, but we want to do much better.)
The last idea is to convert this into a monotone submodular maximization problem, again with a partition matroid constraint. For each set , we add a fictitious facility such that (i) the assignment cost of clients to the fictitious facilities is at most , and (ii) for a subset of facilities, the “improvement” , where is the set of fictitious facilities, is a monotone submodular function. We finally show that a -approximation for this submodular maximization problem gives the desired approximation guarantee. The next two sections describe the algorithm for -Median in detail. The extension to -Means, Matroid Median and Facility Location then appears in §C.
2.2 Client Reduction via Coresets
Consider an instance of the -Median problem. Let be a fixed constant. We now define the notion of core-sets and use known results to reduce the size of to (a weighted) a set of size .
Definition 2.1** (Core-set).**
A (strong) core-set for is a set of clients along with weights for all , such that
[TABLE]
for every with .
A similar definition holds for a strong core-set for the -Means problem. Since we deal only with strong core-sets in this paper, we drop the modifier and refer to them only as core-sets. The first core-sets for metric -Median were given by Chen [8]; the following result is the best current construction:
Theorem 2.2** ([14], Theorem 15.4).**
For , there exists a Monte Carlo algorithm that for each instance of -Median on a general metric, outputs a core-set with size
[TABLE]
with probability , where . Moreover, the algorithm runs in time . For -Means, the core-set is of size |C^{\prime}|=O\big{(}\frac{k\log n+\log\nicefrac{{1}}{{\delta}}}{\varepsilon^{4}}\big{)}, and the runtime remains the same.
The power of core-sets lies in the following fact.
Fact 1**.**
*Consider a -Median/-Means instance , and let be a (strong) core-set with weights . Consider the weighted instance , which is the instance with its clients replaced by the weighted clients in the core-set. Then, for any , a -approximate solution to is a -approximate solution to . *
Therefore, in order to find a -approximation to a -Median , it suffices to find a -approximation to , and analogously for -Means. Henceforth, we restrict our attention to the core-set instance . In other words, we assume that our instances have only a small number of clients, but now the clients have associated weights. In the following sections, we show how to approximate such weighted -Median/-Means instances in FPT time.
2.3 Reduction to Submodular Maximization
Given Fact 1, we only consider instances of weighted -Median, where clients in have weights in the range and is bounded by . In this section we prove the following approximation guarantee for -Median; this, combined with Fact 1, proves the -Median statement in Theorem 1.1.
Theorem 2.3**.**
Let be a fixed parameter. Given a -Median instance with , there is a -approximation algorithm that runs in time.
By scaling, assume the minimum distance between points in is 1, so the aspect ratio is the maximum distance between two points in . For a positive integer , define as the smallest power of larger than or equal to . Here, is the same fixed parameter as the one used in the core-set.
The formal algorithm follows the intuition in §2.1 and is described in Algorithm 2.1; let us step through it now. We iterate over all possible values for the leaders, and for the corresponding distances. The same vertex could appear several times in the subset , and so the latter should be thought of as a multi-set. In Step 7, we add new fictitious facilities: for each , the new facility is at distance from all the facilities in . The distance to all other points is determined by triangle inequality in Step 8. Claim 2 shows that this forms a valid metric. In Step 9, we define the “improvement” function as the reduction in cost due to adding in the facilities in . Claim 3 shows this function is monotone submodular. This means we can use the -approximation algorithm [4] for monotone submodular function maximization subject to a matroid constraint to find a set which contains exactly one facility from each of the sets , since this is a partition matroid constraint. Observe that the function can be computed efficiently. This completes the description of the algorithm.
To prove correctness of the algorithm, we need to show two things: the distance function defined on in Step 8 is a metric, and the function defined in Step 9 is monotone and submodular. We defer the simple proofs to §A.
Claim 2** (Metricity).**
Consider the set defined during an iteration of the algorithm. The distance function defined on is a metric.
Claim 3** (Submodularity).**
The function defined in Step 9 is monotone and submodular with .
Now to bound the runtime. Since , there are at most different multi-sets of size with elements in . In addition, there are many choices for for each . Therefore, the number of iterations in Step 1 of the algorithm can be bounded by
[TABLE]
As argued in §B.1, since we started with the unweighted -Median problem, the aspect ratio can be assumed to polynomially bounded in , and so the number of iterations can be bounded by , which is at most . Indeed, in case , . Else , and hence .
The algorithm for submodular maximization subject to a matroid constraint takes polynomial time, given a value oracle for the function [4, Theorem 1.1]: in fact it can be sped up for the case of partition matroid constraints [4, §3.3]. The value oracle for can itself be implemented in polynomial time. Hence each iteration of the algorithm can be run in time polynomial in .
The submodular maximization algorithm is a randomized Monte-Carlo algorithm that succeeds with only probability , but we can easily boost the success probability by repetition: by running it times for each input and returning the maximum value obtained, we can ensure that with high probability it succeeds in all the calls we make.
2.3.1 Approximation Ratio
We now argue about the approximation ratio of the algorithm. We fix an optimal solution to the instance. Let be the centers opened by this solution. Define as the clients for which the closest open center is , i.e., . We define the notion of leaders with respect to this solution.
Definition 2.4** (Leader).**
*For each , call a client that minimizes over all the leader of center . If there are multiple clients achieving the minimum, declare an arbitrary one to be the leader. Note that a client can be the leader of multiple centers . The leaders w.r.t. the solution is the multi-set . For each leader , the radius is defined as *
Consider the iteration of Algorithm 2.1 where are equal to respectively, and are equal to respectively. Let be the set output in Step 10 of the algorithm. It suffices to show that . We proceed to show this in the rest of the section.
As in the algorithm, define
[TABLE]
so that for each . (Recall that the sets are disjoint by duplicating facilities.) Let be the set of fictitious facilities defined in the algorithm.
We are interested in the solutions that consist of one center from each , since one such solution is the desired . More formally, define a solution to be valid if the set can be listed as so that for each .
Claim 4**.**
For every valid , .
Proof 2.5**.**
List the set as , where for each . Informally, this claim amounts to showing that the fictitious facilities do not improve the solution . To formalize this idea, fix a client and a fictitious facility , and let be a closest center to in . Below, we show that in fact, client is closer to than to :
[TABLE]
Therefore, we have for all clients , so
[TABLE]
as desired.
We now bound the cost of the solution which opens facilities at .
Claim 5**.**
.
Proof 2.6**.**
It suffices to show that for each client . Fix a client , and let be a center achieving . Since is the leader of center , we have
[TABLE]
Recall that . Therefore,
[TABLE]
as desired.
Let be the set output in Step 11. Since the algorithm of [4] is -approximation,
[TABLE]
Lemma 2.7**.**
The solution in (2.3) satisfies .
Proof 2.8**.**
We bound the cost associated with this solution as follows.
[TABLE]
Hence the proof.
2.3.2 Putting it all together
Our algorithm is a Monte Carlo randomized algorithm: both our subroutines use randomness. The first is the core-set construction in §2.2, and the second is the submodular maximization procedure in Step 10 of the algorithm. For each, we can make the error probability . Since each iteration of the algorithm can be implemented in time, the runtime is dominated by the number of iterations, which is . Moreover, combining the two steps of finding the core-set and the submodular maximization, the approximation ratio is This proves Theorem 1.1 for the -Median problem.
3 Gap-ETH Hardness of Max -Coverage
In this section, we show that assuming the Gap Exponential Time Hypothesis (Gap-ETH) [12, 25], for any , there is no FPT-approximation algorithm that approximates Max -Coverage better than a factor .
Theorem 3.1** (Hardness for Max-Coverage).**
There exists a function such that assuming the Gap-ETH, for any , any -approximation algorithm for Max -Coverage with elements and sets must run in time at least .
Using the reduction of Guha and Khuller [16], this immediately implies Theorem 1.2. The rest of the section is devoted to the proof of Theorem 3.1. The proof has two main components: the first part shows under the Gap-ETH, it takes at least time to approximate the Label Cover problem even when one side of the bipartition has only vertices; here is some increasing function depending on the quality of approximation. This reduction is inspired by the recent progress on the hardness of parameterized problems [5, 10] and was communicated to us by Pasin Manurangsi. The second part is the classical reduction from Label Cover to Max -Coverage given by Feige [13].
3.1 Hardness of Label Cover from Gap-ETH
We begin with the standard definition of Label Cover.
Definition 3.2** (Label Cover).**
An instance of Label Cover consists of a bipartite graph with possibly parallel edges, two label sets , and a projection for each . Given a labeling , an edge is satisfied when . The goal of Label Cover is to find a labeling that maximizes the number of satisfied edges. Let be the maximum fraction of edges simultaneously satisfied by any labeling.
Note that we include the projection property in the definition; all Label Cover instances in the paper will have this property. For a vertex , let be the degree of , and let (resp. ) be the maximum degree of (resp. ). We also call an instance -regular (resp. -regular) if all vertices in (resp. ) have the same degree. All subsequent Label Cover instances will be -regular, though the lack of -regularity will require us to do a little more work in §3.2.
Given a 3-SAT formula , let be the maximum fraction of clauses that can be satisfied by any assignment. The Gap-ETH [12, 25] states that there exist some constants for which no algorithm, given a 3-SAT formula on variables and clauses, can distinguish whether or in time . The main result of this subsection is the following lemma.
Lemma 3.3**.**
For every , there is a reduction that, given 3-SAT formula with variables and clauses, outputs a -regular Label Cover instance such that
- •
(Completeness) , and
- •
(Soundness) ,
where . The running time of this reduction is .
In particular, assuming Gap-ETH, for any , if we let so that
[TABLE]
no algorithm can take a Label Cover instance and can decide whether or in time .
Note that a brute-force algorithm that tries every assignment to and chooses the best assignment for for it runs in times a polynomial. Lemma 3.3 shows that assuming the Gap-ETH, even approximately solving Label Cover requires significant time.
Lemma 3.3 is proved by a series of well-known transformations between Label Cover instances. We start with the following basic hardness result for Label Cover assuming the Gap-ETH, which follows from essentially restating Gap-ETH as a clause-variable game:
Theorem 3.4** (Theorem 4.1 of [5]).**
There is a reduction that, given 3-SAT formula with variables and clauses, outputs a -regular Label Cover instance such that
- •
(Completeness) , and
- •
(Soundness) ,
where , and is -regular with . In particular, assuming the Gap-ETH, there exist constants , such that no algorithm can take a Label Cover instance and can decide whether or in time.
Let be a parameter that will be related to in Max -Coverage later. We can ensure divides by taking an arbitrary vertex in and making copies of it. This does not change any of the properties in Theorem 3.4 except to increase the soundness by ; however, the soundness still remains bounded away from .
Since we want few vertices on the left, we construct a new Label Cover instance by partitioning into groups and creating super-vertices for each one. Formally, index the vertices of as , and let the part be . The new instance is constructed as follows.
- •
and (the RHS remains unchanged),
- •
. (the LHS has one super-vertex for each group), and
- •
for each such that , add an edge to with the projection where the latter denotes the projection in . (Recall we allow parallel edges with different projections.)
Since the set of possible labelings and the set of edges remain the same except for syntactic changes, the completeness and the soundness do not change. The parameters become . It still maintains -regularity and .
The final transformation is the powerful parallel repetition step, which shows that the soundness decreases exponentially as we take the natural graph power. Fix . The instance is constructed as follows.
- •
and .
- •
and .
- •
. For each with and , .
The parameters become , and maintains -regularity. The completeness still remains , and by the parallel repetition theorem [26], the soundness drops , where the constant hiding in the depends on the original soundness. This proves Lemma 3.3.
3.2 Hardness of Max -Coverage from Label Cover
Given the “nice” Label Cover instance from Lemma 3.3 we now show how to reduce this to Max -Coverage. This reduction is standard and closely follows the classical one given by Feige [13], modulo some minor issues arising from it not being -regular.
Recall that an instance of of Max -Coverage consists of an underlying universe , a family of subsets, and an integer . The goal is to find a subfamily with that covers the largest number of elements. For notational simplicity, we prove the hardness of the weighted version of Max -Coverage where each element has weight and we want to maximize the total weight of the covered elements. Note that weighted instances can be easily converted to unweighted instances by duplicating elements according to their weights. In our reduction, the ratio between the maximum and the minimum weight will be bounded by the number of elements. The proof appears in Section D.
Lemma 3.5** (Reduction #2).**
There exist functions and such that for any , there exists a polynomial-time reduction that takes a Label Cover instance that is -regular and has the maximum -degree , and produces a Max -Coverage instance such that
- •
(Completeness) .
- •
(Soundness) .
The reduction satisfies , , and .
We can now finish the proof of Theorem 3.1 based on Lemma 3.3 and Lemma 3.5.
Proof 3.6** (Proof of Theorem 3.1).**
Fix that determines and in Lemma 3.5. Let in Lemma 3.3 so that the soundness .
With still being a free parameter, Lemma 3.3 shows a reduction from an initial 3-SAT instance with variables and clauses to a Label Cover instance with , and . Lemma 3.5 with this Label Cover instance produces a Max -Coverage instance with
[TABLE]
An -approximation algorithm for Max -Coverage that runs in time will distinguish whether or for some in time
[TABLE]
which will contradict the Gap-ETH for large enough . Observe that ; if we set , we get the same implication from an algorithm that runs in time , which proves the theorem.
Appendix A Omitted Proofs
See 2
Proof A.1**.**
Since the distances between points in the original metric space do not change, we only need to check triangle inequalities involving fictitious centers. We prove by induction on that the distances on form a metric. The base case holds since is metric.
For general , let be a fictitious center and be arbitrary points. We consider the following two cases.
- •
For ,
[TABLE]
where the inequality follows from the triangle inequality between .
- •
For , first note that
[TABLE]
Let and be the facilities achieving the minimum in the first and the second minimization respectively. Since and are both in , . Therefore,
[TABLE]
Therefore, all triangle inequalities are satisfied and the new distance on is a metric.
See 3
Proof A.2**.**
We have by definition. To show that is monotone, consider two subsets :
[TABLE]
as desired. Finally, to prove that is submodular, consider subsets and center . For each client , using the identity for all real numbers and , we get
[TABLE]
Therefore,
[TABLE]
proving the desired submodularity.
Appendix B Miscellaneous Proofs
B.1 Polynomial Aspect Ratio
Recall that the aspect ratio of a metric space is . For the unweighted version of the problems we consider, we can assume that is polynomially-bounded, due to the following standard result.
Proposition B.1** (folklore).**
Given an -approximation algorithm for (unweighted) -Median on instances with polynomially-bounded aspect ratio that runs in time , we can obtain an -approximation algorithm for (unweighted) -Median on all instances running in time .
Proof B.2**.**
Given an instance with large aspect ratio, we first compute a estimate for the optimal -Median cost on —say , by using an approximation algorithm for the -Center problem that runs in time for general instances. View the metric space as a complete edge-weighted graph. For long edges of length more than , reduce their length to , and for short edges of length less than , increase their lengths to . Computing all-pairs shortest paths gives a new metric space , and let be the corresponding -Median instance. Use algorithm on this instance to get an -approximate solution .
We claim is also an -approximate solution for the original instance . Firstly, if is an optimal solution to , then its cost in is greater by at most . Indeed, since , no client would use the long distances which we shortened; the increase in the short distances gives the term. Hence . Again since is an -approximation for , and the long edges had reduced length , none of the clients in will connect to it using the shortened long distances. Hence
[TABLE]
This completes the proof.
B.2 Bipartite vs. Non-Bipartite Instances
The -Median/-Means problems we defined have two different sets: clients and potential facilities . If and are allowed to be different subsets of , we call it the bipartite version of the problem. If , i.e., we can open facilities at any of the client locations (and potentially at other locations too), it is the non-bipartite case. We observe that only a -factor hardness is known for the non-bipartite case, whereas our algorithm still gives a factor- approximation for this case.
In fact, for the non-bipartite case, a simple -approximation can be obtained directly using core-sets, using a variant of the arguments of Czumaj and Sohler [11] as follows. Given an non-bipartite instance , the algorithm does the following.
Find a core-set with and . 2. 2.
Enumerate over all subsets being -subsets of , and output the set with smallest cost .
The runtime of this algorithm is easily seen to be in FPT, so we now show the approximation guarantee. Let be the optimal solution for instance with cost . By the strong core-set property, . Now, for each facility let be the closest client among those served by . Observe that satisfies , has size , and ensures that
[TABLE]
The factor of comes from the fact that . Now, since we enumerate over all subsets of , the cost of the set is no greater than the LHS of (B.5). Again using the core-set property,
[TABLE]
This completes the proof of the approximation.
Observe that this algorithm crucially uses that , so we can open a facility at the closest client location . Hence this idea does not extend to the bipartite case where may not belong to .
Appendix C Extensions to Related Problems
C.1 The Algorithm for -Means
The extension to -Means is immediate. The first change is in the definition of cost: the -Means cost is . However, the induced function is still monotone submodular. Now, by the calculations identical to Claim 5, for client ; hence
[TABLE]
Plugging this into (2.4), we immediately get
[TABLE]
The runtime is , which is the same barring a worse dependence on because of the larger core-set. This proves the result for -Means.
C.2 The Algorithm for Matroid Median
We follow the algorithm for -Median, but now we place two matroid constraints: in addition to the partition matroid constraint we add in the matroid constraint coming from the Matroid Median problem itself. Maximizing a monotone submodular function subject to two matroid constraints has a -approximation [21]. Hence, instead of (2.4), we get
[TABLE]
If the rank of the matroid is , then any valid base of the matroid is also a -subset; hence a core-set for -Median is also a core-set for Matroid Median. This means the rest of the argument remains unchanged.
C.3 Facility Location
In this subsection, we prove Theorem 1.4 for Facility Location. Given an instance for Facility Location, let be the number of facilities opened in the optimal solution. Our parameter will be this value . Let and be the total connection and opening cost of the optimal solution respectively. For sake of simplicity, we assume that is the same for every , but our idea can be easily generalized when facilities have nonuniform opening costs (by guessing the opening costs of the optimal facilities to within a -factor). This implies that .
The general structure of the algorithm resembles the algorithm for -Median. We first construct a core-set that preserves the connection cost of every with , so that we can assume . The algorithm guesses (a) the leaders , (b) the distances from the leaders to their facilities as in Algorithm 2.1, and then (c) for each , compute the set of its possible facilities . These sets give us a partition matroid on the potential facility locations.
Lemma C.1**.**
Consider a monotone submodular function , subject to a partition matroid constraint (with rank ). There exists a polynomial-time algorithm that, given , returns a set with full rank and size , such that for any with , we have
[TABLE]
Proof C.2**.**
We first use the algorithm from [4] to find a set of size such that is a base of the matroid, and
[TABLE]
Let be the residual function defined as . Since is monotone, we get
[TABLE]
Now we choose a set by picking more elements that greedily maximize the residual function. The analysis of the greedy algorithm implies that
[TABLE]
so that the total cost is at least
[TABLE]
which completes the proof.
We use the algorithm from Lemma C.1 to pick a set of size , instead of size as in Algorithm 2.1. The opening cost of this solution is , since each facility costs . Moreover, arguing as in Lemma 2.7 (but using Lemma C.1 instead of the -approximation guaranteed by algorithm from [4]), the connection cost is . Several cases arise:
- •
If : Trying gives an approximation ratio at most .
- •
If : Trying gives an approximation ratio . Recalling {\alpha_{\mathsf{FL}}}:=\max_{x\geq 0}\big{(}1+\frac{x}{1+x}\ln\frac{2}{x}\big{)}, by setting , we can see that it is upper bounded by exactly .
- •
If : Trying gives the total cost .
Trying every value of that makes an integer will achieve an approximation ratio of .
Appendix D Reduction from Label Cover to Max -Coverage
In this section, we give a reduction from Label Cover to Max -Coverage, proving Lemma 3.5.
Lemma D.1** (Restatement of Lemma 3.5).**
There exist functions and such that for any , there exists a polynomial-time reduction that takes a Label Cover instance that is -regular and has the maximum -degree , and produces a Max -Coverage instance such that
- •
(Completeness) .
- •
(Soundness) .
The reduction satisfies , , and .
Proof D.2**.**
The high-level idea of the proof is the following: we choose some value . Now the set of elements consists of many disjoint hypergrids with sides of size and with dimensions. Indeed, there is a copy of the hypergrid associated with each pair for —one for each right vertex and an -subset of its left neighbors.
Now the sets: they are associated with each and and each potential label . The sets associated with a pair have a nonempty intersection with the hypergrid of if and only if is the ’th neighbor of . Indeed, the set for contains the entire “slice” of each of these hypergrids, along the dimension. The idea is very clean: if there is a “good” labeling for , then all these slices will be chosen in a coordinated way along the same dimension, and we will cover all the hypergrids completely. If there are no good labelings for , then these slices will be chosen in an uncoordinated way along different dimensions, and then we will end up covering only a constant factor of the hypergrids. (As intuition, if and we did not manage to pick two slices of the hypercube along the same dimension, we cover only of the cube: the hypergrids allow us to get .)
For those familiar with the exposition from [13], we are considering an -prover system where the verifier first randomly chooses a variable question and each of provers gets a clause question independently sampled from ’s neighbors.
Formal construction.* For each , fix an arbitrary ordering of its incident edges so that the edges incident on are represented as . Let be an integer that will be fixed later, and consider the hypergrid . Let be the slice in the coordinate. We can describe our set system as follows.*
[TABLE]
Completeness.* Suppose the labeling satisfies every edge of , then the subsets*
[TABLE]
covers every element in ; indeed, the element is covered by the set . This proves the first claim of the theorem that we have perfect completeness.
Soundness.* For sake of a contradiction assume there exists a subfamily such that and covers elements of total weight at least . Recall that each hypergrid is indexed by a . To simplify notation, we identify a pair and its hypergrid. We also define ’s weight , which is the weight of each of the elements in that hypergrid. The sum of all hypergrids’ weights is , and let be the distribution of ’s according to their weights. For the rest of this section, an “average hypergrid” refers to a random sampled from , possibly conditioned on for some subset .*
Recall that each set intersects with one hypergrid in exactly one slice or is disjoint from it. Since the Label Cover instance is -regular, the sum of the weights of the hypergrids that intersect is
[TABLE]
which means that each set intersects with the same weighted number of hypergrids. Let be the number of sets in that intersect with the hypergrid . By double counting,
[TABLE]
So each hypergrid intersects with sets from in average.
Call a hypergrid big when , and call good if it is not big and there exist and such that and both and are in . In other words, hypergrid is intersected in at least two different slices in the same coordinate. Call the remaining ’s pseudorandom.
Since the average of , the total weight of big ’s is at most . Hence elements of total weight at least must be covered in the good or pseudorandom hypergrids. The average value of for good and pseudorandom hypergrids is still at most .
We claim that the total weight of good ’s is at least . Suppose not. Then elements of total weight at least are covered in the pseudorandom hypergrids. Note that the average value of for the pseudorandom pairs is at most . For each of those hypergrids, since it is not good, the fraction of points covered by slices is exactly , which is monotone and concave in . Therefore, the fraction of points covered in the pseudorandom cubes is at most . Fix large enough so that this quantity becomes less than , leading to the desired contradiction. Therefore, the total weight of good ’s is at least .
For and , let be the labels that correspond to . We now construct a random labeling for as follows.
- •
Randomly sample uniformly from among unordered pairs.
- •
For , let be a random label from chosen uniformly and independently (choose an arbitrary label if ).
- •
For , uniformly sample , and let . Let be a random label from chosen uniformly and independently. Let . (Choose an arbitrary label if ).
Fix a good pair . Given that is sampled in the above randomized strategy, with probability at least , are sampled such that , so with probability at least .
Fix a vertex and let be the fraction of such that is good. The expected fraction of the edges incident on satisfied by the above randomized labeling is
[TABLE]
where the first equality follows from the fact that for fixed , over the randomness of , and are sampled uniformly and independently over the neighbors of , so that in the first line can be replaced by in the second line.
Let be the distribution over , which is obtained as the marginal distribution of in . This implies that in , is sampled with probability , and
[TABLE]
Therefore, the total fraction of Label Cover edges satisfied by the above randomized strategy is at least
[TABLE]
Let . This choice establishes that , finishing the proof of the soundness claim.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k 𝑘 k -means and Euclidean k 𝑘 k -median by primal-dual algorithms. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017 , pages 61–72, 2017. URL: https://doi.org/10.1109/FOCS.2017.15 , doi:10.1109/FOCS.2017.15 . · doi ↗
- 2[2] Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness of approximation of euclidean k-means. In 31st International Symposium on Computational Geometry, So CG 2015, June 22-25, 2015, Eindhoven, The Netherlands , pages 754–767, 2015. URL: https://doi.org/10.4230/LIP Ics.SOCG.2015.754 , doi:10.4230/LIP Ics.SOCG.2015.754 . · doi ↗
- 3[3] Jarosław Byrka, Thomas Pensyl, Bartosz Rybicki, Aravind Srinivasan, and Khoa Trinh. An improved approximation for k 𝑘 k -median, and positive correlation in budgeted optimization. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms , pages 737–756. SIAM, 2014.
- 4[4] Gruia Calinescu, Chandra Chekuri, Martin Pál, and Jan Vondrák. Maximizing a monotone submodular function subject to a matroid constraint. SIAM J. Comput. , 40(6):1740–1766, 2011. URL: https://doi.org/10.1137/080733991 , doi:10.1137/080733991 . · doi ↗
- 5[5] Parinya Chalermsook, Marek Cygan, Guy Kortsarz, Bundit Laekhanukit, Pasin Manurangsi, Danupon Nanongkai, and Luca Trevisan. From gap-ETH to FPT-inapproximability: Clique, dominating set, and more. In Foundations of Computer Science (FOCS), 2017 IEEE 58th Annual Symposium on , pages 743–754. IEEE, 2017.
- 6[6] Moses Charikar, Sudipto Guha, Éva Tardos, and David B. Shmoys. A constant-factor approximation algorithm for the k 𝑘 k -median problem. J. Comput. Syst. Sci. , 65(1):129–149, 2002. URL: https://doi.org/10.1006/jcss.2002.1882 , doi:10.1006/jcss.2002.1882 . · doi ↗
- 7[7] Ke Chen. On k 𝑘 k -median clustering in high dimensions. In SODA , 2006.
- 8[8] Ke Chen. On coresets for k 𝑘 k -median and k 𝑘 k -means clustering in metric and Euclidean spaces and their applications. SIAM Journal on Computing , 39(3):923–947, 2009.
