On Coresets for Clustering in Small Dimensional Euclidean Spaces
Lingxiao Huang, Ruiyuan Huang, Zengfeng Huang, Xuan Wu

TL;DR
This paper studies small coresets for k-Median clustering in low-dimensional Euclidean spaces, providing improved bounds, new lower bounds, and the first separation results between k=1 and k=2 in 1D.
Contribution
It offers improved coreset size bounds for small dimensions, establishes new lower bounds, and demonstrates a novel separation between 1-Median and 2-Median in 1D.
Findings
Improved coreset bounds for small dimensions.
New lower bounds for coreset sizes.
First known separation between 1-Median and 2-Median in 1D.
Abstract
We consider the problem of constructing small coresets for -Median in Euclidean spaces. Given a large set of data points , a coreset is a much smaller set , so that the -Median costs of any centers w.r.t. and are close. Existing literature mainly focuses on the high-dimension case and there has been great success in obtaining dimension-independent bounds, whereas the case for small is largely unexplored. Considering many applications of Euclidean clustering algorithms are in small dimensions and the lack of systematic studies in the current literature, this paper investigates coresets for -Median in small dimensions. For small , a natural question is whether existing near-optimal dimension-independent bounds can be significantly improved. We provide affirmative answers to this question for a range of parameters.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsFace and Expression Recognition · Advanced Clustering Algorithms Research · Sparse and Compressive Sensing Techniques
On Coresets for Clustering in Small Dimensional Euclidean Spaces
Lingxiao Huang111 State Key Laboratory of Novel Software Technology, Nanjing University; Email: [email protected]
Ruiyuan Huang222Fudan University; Email: [email protected]
Zengfeng Huang333Fudan University; Email: [email protected]
Xuan Wu444Huawei TCS Lab; Email: [email protected]
Abstract
We consider the problem of constructing small coresets for -Median in Euclidean spaces. Given a large set of data points , a coreset is a much smaller set , so that the -Median costs of any centers w.r.t. and are close. Existing literature mainly focuses on the high-dimension case and there has been great success in obtaining dimension-independent bounds, whereas the case for small is largely unexplored. Considering many applications of Euclidean clustering algorithms are in small dimensions and the lack of systematic studies in the current literature, this paper investigates coresets for -Median in small dimensions. For small , a natural question is whether existing near-optimal dimension-independent bounds can be significantly improved. We provide affirmative answers to this question for a range of parameters. Moreover, new lower bound results are also proved, which are the highest for small . In particular, we completely settle the coreset size bound for -d -Median (up to log factors). Interestingly, our results imply a strong separation between -d -Median and -d -Median. As far as we know, this is the first such separation between and in any dimension.
Contents
1 Introduction
Processing huge datasets is always computationally challenging. In this paper, we consider the coreset paradigm, which is an effective data-reduction tool to alleviate the computation burden on big data. Roughly speaking, given a large dataset, the goal is to construct a much smaller dataset, called coreset, so that vital properties of the original dataset are preserved. Coresets for various problems have been extensively studied (Har-Peled and Mazumdar, 2004; Feldman and Langberg, 2011; Feldman et al., 2013; Cohen-Addad et al., 2022; Braverman et al., 2022). In this paper, we investigate coreset construction for -Median in Euclidean spaces.
Coreset construction for Euclidean -Median has been studied for nearly two decades (Har-Peled and Mazumdar, 2004; Feldman and Langberg, 2011; Huang et al., 2018; Cohen-Addad et al., 2021, 2022). For this particular problem, an -coreset is a (weighted) point set in the same Euclidean space that satisfies: given any set of centers, the -Median costs of the centers w.r.t. the original point set and the coreset are within a factor of . The most important task in theoretical research here is to characterize the minimum size of -coresets. Recently, there has been great progress in closing the gap between upper and lower bounds in high-dimensional spaces. However, researches on the coreset size in small dimensional spaces are rare. There are still large gaps between upper and lower bounds even for -d -Median.
Clustering in small dimensional Euclidean spaces is of both theoretical and practical importance. In practice, many applications involve clustering points in small dimensional spaces. A typical example is clustering objects in or based on their spatial coordinates (Wheeler, 2007; Fonseca-Rodríguez et al., 2021). Another example is spectral clustering for graph and social network analysis (Von Luxburg, 2007; Kunegis et al., 2010; Zhang et al., 2014; Narantsatsralt and Kang, 2017). In spectral clustering, nodes are first embedded into a small dimensional Euclidean space using spectral methods and then Euclidean clustering algorithms are applied in the embedding space. Even the simplest -d -Median has numerous practical applications (Arnaboldi et al., 2012; Jeske et al., 2013; Pennacchioli et al., 2014).
On the theory side, existing techniques for coresets in high dimensions may not be sufficient to obtain optimal coresets in small dimensions. For example, much smaller size is achievable in by using geometric methods, while the sampling methods for strong coresets in high dimension (Langberg and Schulman, 2010; Cohen-Addad et al., 2021; Huang et al., 2022b) seem not viable to obtain such bounds in low dimensions. This suggests that optimal coreset construction in small dimensions may require new techniques, which provides a partial explanation of why -d -Median is still open after two decades of research. That being said, the coreset problem for clustering in small dimensional spaces is of great theoretical interest and practical value. Yet it is largely unexplored in the literature. This paper aims to fill the gap and study the following question:
Question 1**.**
What is the tight coreset size for Euclidean -Median problem in for small ?
1.1 Problem Definitions and Previous Results
Euclidean -Median.
In the Euclidean -Median problem, we are given a dataset () of points and an integer ; and the goal is to find a -center set that minimizes the objective function
[TABLE]
where represents the Euclidean distance between and . It has many application domains including approximation algorithms, unsupervised learning, and computational geometry (Lloyd, 1982; Tan et al., 2006; Arthur and Vassilvitskii, 2007; Coates and Ng, 2012).
Coresets.
Let denote the collection of all -center sets, i.e., .
Definition 1.1** (-Coreset for Euclidean -Median (Har-Peled and Mazumdar, 2004)).**
Given a dataset of points, an integer and , an -coreset for Euclidean -Median is a subset with weight , such that
[TABLE]
For Euclidean -Median, the best known upper bound on -coreset size is (Huang et al., 2022b; Cohen-Addad et al., 2022) and is the best existing lower bound (Cohen-Addad et al., 2022). The upper bound is dimension-independent, since using dimensionality reduction techniques such as Johnson–Lindenstrauss transform, the dimension can be reduced to . Thus, most previous work essentially only focus on , whereas the case for is largely unexplored. The lower bound requires , as the hard instance for the lower bound is an orthonormal basis of size . For constant and large enough , the upper and lower bounds match up to a polylog factor.
On the contrary, for , tight coreset sizes for -Median are far from well-understood, even when . Specifically, for constant , the current best upper bound is (Feldman and Langberg, 2011), and the best lower bound is (Baker et al., 2020). Thus, there is a still large gap between the upper and lower bounds for small . Perhaps surprisingly, this is the case even for : Har-Peled and Kushal (2005) present a coreset of size in while the best known lower bound is .
1.2 Our Results
We provide a complete characterization of the coreset size (up to a logarithm factor) for and partially answer 1 for . Our results are summarized in Table 1.
For , we construct coresets with size for -Median (Theorem 2.1) and prove that the coreset size lower bound is for (Theorem 2.9). Previous work has shown coresets with size exist for -Median (Har-Peled and Kushal, 2005) in -d, and thus our lower bound nearly matches this upper bound. On the other hand, it was proved that the coreset size of -Median in -d is (Baker et al., 2020), which shows our upper bound result for -Median is nearly tight.
For , we provide a discrepancy-based method that constructs deterministic coresets of size for -Median (Theorem 3.2). Our result improves over the existing upper bound (Cohen-Addad et al., 2021) for and matches the lower bound (Cohen-Addad et al., 2022) for . We further prove a lower bound of for -Median in (Theorem 3.8). Combining with our -d lower bound , this improves over the existing lower bound (Baker et al., 2020; Cohen-Addad et al., 2022).
1.3 Technical Overview
We first discuss the 1-d -Median problem and show that the framework of (Har-Peled and Kushal, 2005) is optimal with significant improvement for . Then we briefly summarize our approaches for .
The Bucket-Partitioning Framework for -d -Median in (Har-Peled and Kushal, 2005).
Our main results in -d are based on the classic bucket-partitioning framework, developed in (Har-Peled and Kushal, 2005), which we briefly review now. They greedily partition a dataset into consecutive buckets ’s and collect the mean point together with weight as their coreset . Their construction requires that the cumulative error holds for every bucket , where is the optimal -Median cost of . Their important geometric observation is that the induced error of every bucket is at most , and even is 0 when all points in assign to the same center. Consequently, only buckets induce a non-zero error for every center set and the total induced error is at most , which concludes that is a coreset of size .
Reducing the Number of Buckets for -d -Median via Adaptive Cumulative Errors.
In the case of where there is only one center , we improve the result in (Har-Peled and Kushal, 2005) (Theorem 2.1) through the following observation: can be much larger than when center is close to either of the endpoints of , and consequently, can allow a larger induced error of coreset than . This observation motivates us to adaptively select cumulative errors for different buckets according to their locations. Inspired by this motivation, our algorithm (Algorithm 1) first partitions dataset into blocks according to clustering cost, i.e., for all , and then further partition each block into buckets with a carefully selected cumulative error bound . Intuitively, our selection of cumulative errors is proportional to the minimum clustering cost of buckets, which results in a coreset.
For the coreset size, we first observe that there are only non-empty blocks (Lemma 2.7) since we can “safely ignore” the leftmost and the rightmost points and the remaining points satisfy . The most technical part is that we show the number of buckets in each is at most (Lemma 2.8), which results in our improved coreset size . The basic idea is surprisingly simple: the clustering cost of a bucket is proportional to its distance to center , and hence, the clustering cost of consecutive buckets is proportional to instead of . According to this idea, we find that for every , which implies a desired bound by our selection of .
Hardness Result for -d -Median: Cumulative Error is Unavoidable.
We take as an example here and show the tightness of the bound by (Har-Peled and Mazumdar, 2004). The extension to is standard via an idea of (Baker et al., 2020).
We construct the following worst-case instance of size : We construct consecutive buckets such that the length of buckets exponentially increases while the number of points in buckets exponentially decreases. We fix a center at the leftmost point of (assuming to be [math] w. l. o. g.) and move the other center along the axis. Such dataset satisfies the following:
- •
the clustering cost is stable: for all , up to a constant factor;
- •
the cumulative error for every bucket is ;
- •
for every , is a quadratic function that first decreases and then increases as moves from left to right within , and the gap between the maximum and the minimum values is .
Suppose is of size . Then there must exist a bucket such that . We find that function is an affine linear function when is located within (Lemma 2.11). Consequently, the maximum induced error is at least since the estimation error of an affine linear function to a quadratic function is up to certain “cumulative curvature” of (Lemma 2.10), which is due to our construction. Hence, is not a coreset since always holds.
We remind the readers that the above cost function is actually a piecewise quadratic function with pieces instead of a quadratic one, which ensures the stability of . This is the main difference from , which leads to a gap of on the coreset size between and . As far as we know, this is the first such separation in any dimension.
Our Approaches when .
For -Median, our upper bound result (Theorem 3.2) combines a recent hierarchical decomposition coreset framework in (Braverman et al., 2022), that reduces the instance to a hierarchical ring structure (Theorem 3.4), and the discrepancy approaches (Theorem 3.6) developed by (Karnin and Liberty, 2019). The main idea is to extend the analytic analysis of (Karnin and Liberty, 2019) to handle multiplicative errors in a scalable way.
For -Median, our lower bound result (Theorem 3.8) extends recently developed approaches in (Cohen-Addad et al., 2022). Their hard instance is an orthonormal basis in , the size of which is at most , and hence cannot obtain a lower bound higher than . We improve the results by embedding copies of their hard instance in , each of which lies in a different affine subspace. We argue that the errors from all subspaces add up. However, the error analysis from (Cohen-Addad et al., 2022) cannot be directly used; we need to overcome several technical challenges. For instance, points in the coreset are not necessary in any affine subspace, so the error in each subspace is not a corollary of their result. Moreover, errors from different subspaces may cancel each other.
1.4 Other Related Work
Coresets for Clustering in Metric Spaces
Recent works (Cohen-Addad et al., 2022, 2022; Huang et al., 2023) show that Euclidean -Clustering admits -coresets of size and a nearly tight bound is known when (Cohen-Addad et al., 2021). Apart from the Euclidean metric, the research community also studies coresets for clustering in general metric spaces a lot. For example, Feldman and Langberg (2011) construct coresets of size for general discrete metric. Baker et al. (2020) show that the previous factor is unavoidable. There are also works on other specific metrics spaces: doubling metrics (Huang et al., 2018) and graphs with shortest path metrics (Baker et al., 2020; Braverman et al., 2021; Cohen-Addad et al., 2021), to name a few.
Coresets for Variants of Clustering
Coresets for variants of clustering problems are also of great interest. For example, Braverman et al. (2022) construct coresets of size for capacitated -Median, which is improved to by (Huang et al., 2023). Other important variants of clustering include ordered clustering (Braverman et al., 2019), robust clustering (Huang et al., 2022a), and time-series clustering (Huang et al., 2021).
2 Tight Coreset Sizes for -d -Median
2.1 Near Optimal Coreset for -d -Median
We have the following theorem.
Theorem 2.1** (Improved Coreset for one-dimensional -Median).**
There is a polynomial time algorithm, such that given an input data set , it outputs an -coreset of for -Median with size .
Useful Notations and Facts.
Throughout this section, we use with . Let , we have the following simple observations for .
Observation 2.2**.**
* is a convex piecewise affine linear function of and is the optimal -Median cost on .*
The following notions, proposed by (Har-Peled and Mazumdar, 2004), are useful for our coreset construction.
Definition 2.3** **(Bucket).
A bucket is a continuous subset of for some .
Definition 2.4** **(**Mean and cumulative error
(Har-Peled and Kushal, 2005)).**
Given a bucket for some , denote to be the number of points within and to be the length of . We define the mean of to be and define the cumulative error of to be
Note that always holds, which implies the following fact.
Fact 2.5**.**
.
The following lemma shows that for each bucket , the coreset error on is no more than .
Lemma 2.6** **(Cumulative error controls coreset error (Har-Peled and Kushal, 2005)).
Let for be a bucket and be a center. We have
if , ; 2. 2.
if , .
Algorithm for Theorem 2.1.
Our algorithm is summarized in Algorithm 1. We improve the framework in (Har-Peled and Kushal, 2005), which partitions into multiple buckets so that the cumulative errors in different buckets are the same and collects their means as a coreset. Our main idea is to carefully select an adaptive cumulative error for different buckets. In Lines 2-3, we take the leftmost points and the rightmost points, and add their weighted means to our coreset . In Lines 4 (and 7), we divide the remaining points into disjoint blocks () such that for every , , and then greedily divide each into disjoint buckets with a cumulative error roughly in Line 5. We remind the readers that the cumulative error in (Har-Peled and Kushal, 2005) is always .
We define function such that for every and define such that for every . By Observation 2.2, is decreasing on and increasing on . As a result, each consists of consecutive points in . The following lemma shows that the number of blocks () is .
Lemma 2.7** **(Number of blocks).
There are at most non-empty blocks or .
Proof:
We prove Algorithm 1 divides into at most non-empty blocks . Argument for is entirely symmetric.
If is non-empty for some , we must have for . We also have since . Since is convex, we have . If we show that then we have thus .
To prove , we use triangle inequality to obtain that
[TABLE]
Moreover, we note that by the choice of , . Thus we have,
[TABLE]
We next give a key lemma that we use to obtain an improved coreset size.
Lemma 2.8** **(Number of buckets).
Each non-empty block or is divided into buckets.
Proof:
We prove that each block is divided into at most buckets . Argument for is entirely symmetric.
Suppose and we divide into buckets . Since each is the maximal bucket with , we have for . Denote by for , we have:
[TABLE]
Here (2.1) is from Cauchy-Schwarz inequality. So we have , which implies .
Now we are ready to prove Theorem 2.1.
Proof:
[of Theorem 2.1] We first verify that the set is an -coreset. Our goal is to prove that for every , . We prove this for any . The argument for is entirely symmetric.
For any , we have
[TABLE]
where takes over all buckets. We then separately analyze the case and the case.
When , we note that (Lemma 2.6). By elementary calculus, both and are within ; hence differ by at most a multiplicative factor of . Thus, .
When , there is at most one bucket such that since these buckets are disjoint. If such a bucket does not exist, we have . Now suppose such a bucket exists. Since , we have for some block . Thus, by Lemma 2.6 and the construction of buckets:
[TABLE]
We have and . Since is convex (thus decreasing on ) and , we also have . This implies .
It remains to show that the size of , which is the total number of buckets, is . However, by Lemma 2.7, there are blocks, and by Lemma 2.8, each block contains buckets. Thus, there are at most buckets.
2.2 Tight Lower Bound on Coreset Size for -d -Median when
In this subsection, we prove that the size lower bound of -coreset for -Median problem in is . This lower bound matches the upper bound in (Har-Peled and Kushal, 2005).
Theorem 2.9** (Coreset lower bound for -d -Median when ).**
For a given integer and , there exists a dataset such that any -coreset must have size .
For ease of exposition, we only prove the lower bound for -Median here. The generalization to -Median is straightforward and can be found in appendix A.
We first prove a technical lemma, which shows that a quadratic function cannot be approximated well by an affine linear function in a long enough interval. We note that similar technical lemmas appear in coresets lower bound of other related clustering problems (Braverman et al., 2019) (Baker et al., 2020). The lemma in (Braverman et al., 2019) shows that the function cannot be approximated well by an affine linear function while our lemma is about approximating a quadratic function. The lemma in (Baker et al., 2020) shows that a quadratic function cannot be approximated well by an affine linear function on a bounded interval, a situation slightly different from ours.
Lemma 2.10** **(Quadratic function cannot be approximated well by affine linear functions).
Let be an interval, be a quadratic function on interval , and be two constants, and be a non-negative real number. If and for all , then there is no affine linear function such that for all .
Proof:
Assume there is an affine linear function that satisfies for all . We denote the error function by , which has two properties. First, its norm . Second, it is quadratic and satisfies , thus for all .
Define . By the mean value theorem, there is a point such that . Similarly there is a point such that . Since is a quadratic function, its derivative is monotonic and . Thus we have
[TABLE]
On the other hand . We have . Thus .
For any dataset , with a slight abuse of notations, we denote the cost function for -Median with one query point fixed in [math] by . The following lemma shows that is a piecewise affine linear function and all the transition points are .
Lemma 2.11** **(The function is piecewise affine linear).
Let be a weighted dataset. The function is a piecewise affine linear function. All the transition points between two affine pieces are .
Proof:
We denote the weight of point by and denote the midpoint between any point and [math] by . Now assume and both and are not in the dataset . The clustering cost of a single point is
[TABLE]
If changes to we have
[TABLE]
Assume is small enough, then there are no data points in and . We have
[TABLE]
thus
[TABLE]
Consider moves in from left to right, the derivative changes only when or pass a data point in . The same conclusion also holds for by a symmetric argument. This is exactly what we want.
Proof:
[-Median case of Theorem 2.9] We first construct the dataset . The dataset is a union of disjoint intervals . Denote the left endpoint and right endpoint of by and respectively. We recursively define for , for , and . Thus . The weight of points is specified by a measure on . The measure is absolutely continuous with respect to Lebesgue measure such that its density on the th interval is . We denote the density on the th interval by and the density at point by . Note that can be discretized in the following way. We only need to take a large enough constant , create a bucket of equally spaced points in each interval , and assign weight to every point.
The cost function has following two features:
the function value for any , 2. 2.
the function is quadratic on the interval and satisfies for each .
We show how to prove theorem 2.9 from these features and defer verification of these features later. Note that feature 2 does not contradict lemma 2.11 since the dataset contains infinite points.
Assume that is an -coreset of . We prove by contradiction. If , then there is an interval such that by the pigeonhole’s principle. Consider function on interval . When , we have . Thus both and do not pass points in when moves from to . By lemma 2.11, function is affine linear on interval . Since is an -coreset of , we have on interval . However, by applying lemma 2.10 to and on interval with and , we obtain that . This is a contradiction.
It remains to verify the two features of . We verify feature 1 by direct computations. For any point , the function satisfies
[TABLE]
To verify feature 2, we compute the first order derivative by computing the change of the function value up to the first order term when increases an infinitesimal number . The unweighted clustering cost of a single point is
[TABLE]
As increases to , the clustering cost of a single point changes
[TABLE]
The cumulative clustering cost changes
[TABLE]
Thus the first order derivative and the second order derivative
[TABLE]
For , the two points and both lie in interval . We have and . Thus the function is quadratic on and .
3 Improve Coreset Sizes when
In this section, we consider the case of constant , , and provide several improved coreset bounds for a general problem of Euclidean -Median, called Euclidean -Clustering. The only difference from -Median is that the goal is to find a -center set that minimizes the objective function
[TABLE]
where represents the -th power of the Euclidean distance. The coreset notion is as follows.
Definition 3.1** (-Coreset for Euclidean -Clustering (Har-Peled and Mazumdar, 2004)).**
Given a dataset of points, an integer , constant and , an -coreset for Euclidean -Clustering is a subset with weight , such that
[TABLE]
We first study the case of and provide a coreset upper bound (Theorem 3.2). Then we study the general case and provide a coreset lower bound (Theorem 3.8).
3.1 Improved Coreset Size in when
We prove the following main theorem for whose center is a point .
Theorem 3.2** (Coreset for Euclidean -Clustering).**
Let integer , constant and . There exists a randomized polynomial time algorithm that given a dataset , outputs an -coreset for Euclidean -Clustering of size at most .
Proof sketch:
By (Braverman et al., 2022), we first reduce the problem to constructing a mixed coreset for Euclidean -Clustering for a dataset satisfying that ,
[TABLE]
The main idea to construct such is to prove that the class discrepancy of Euclidean -Clustering for is at most for (Lemma 3.7), which implies the existence of a mixed coreseet of size by Fact 6 of (Karnin and Liberty, 2019). For the class discrepancy, we apply an analytic result of (Karnin and Liberty, 2019) (Theorem 3.6). The main difference is that (Karnin and Liberty, 2019) only considers an additive error that can handle instead of an arbitrary center . In our case, we allow a mixed error proportional to the scale of and extend the approach of (Karnin and Liberty, 2019) to handle arbitrary centers by increasing the discrepancy by a multiplicative factor .
The above theorem is powerful and leads to the following results for :
By dimension reduction as in (Huang and Vishnoi, 2020; Cohen-Addad et al., 2021, 2022), we can assume . Consequently, our coreset size is upper bounded by , which matches the nearly tight bound in (Cohen-Addad et al., 2022). 2. 2.
For , our coreset size is , which is the first known result in small dimensional space. Specifically, the prior known coreset size in is (Braverman et al., 2022), and our result improves it by a factor of .
We conjecture that our coreset size is almost tight, i.e., there exists a coreset lower bound for constant , which leaves as an interesting open problem.
3.1.1 Useful Notations and Facts
For preparation, we first propose a notion of mixed coreset (Definition 3.3), and then introduce some known discrepancy results.
Reduction to mixed coreset.
Let denote the -ball in that centers at with radius . Specifically, is the unit ball centered at the original point.
Definition 3.3** (Mixed coreset for Euclidean -Clustering).**
Given a dataset and , an -mixed-coreset for Euclidean -Clustering is a subset with weight , such that ,
[TABLE]
Actually, prior work (Cohen-Addad et al., 2021, 2022; Braverman et al., 2022) usually consider the following form: ,
[TABLE]
Compared to Definition 1.1, the above inequality allows both a multiplicative error and an additional additive error . Note that for a small , the additive error dominates the total error; while for a large , the multiplicative error dominates the total error. Hence, it is not hard to check that Inequality (5) is an equivalent form of Inequality (4) (up to an -scale). This is also the reason that we call Definition 3.3 mixed coreset. We have the following useful reduction.
Theorem 3.4** **(Reduction from coreset to mixed coreset (Braverman et al., 2022)).
Let . Suppose there exists a polynomial time algorithm that constructs an -mixed coreset for Euclidean -Clustering of size . Then there exists a polynomial time algorithm that constructs an -coreset for Euclidean -Clustering of size .
Thus, it suffices to prove that an -mixed coreset is of size , which implies Theorem 3.2.
Class discrepancy.
For preparation, we introduce the notion of class discrepancy introduced by (Karnin and Liberty, 2019). The idea of combining discrepancy and coreset construction has been studied in the literature, specifically for kernel density estimation (Phillips and Tai, 2018a, b; Karnin and Liberty, 2019; Tai, 2022). We propose the following definition.
Definition 3.5** **(Class discrepancy (Karnin and Liberty, 2019)).
Let be an integer. Let and with . The class discrepancy of of w.r.t. is
[TABLE]
Moreover, we define to be the class discrepancy w.r.t. .
Here, is the instance space and is the parameter space. Specifically, for Euclidean -Clustering, we let and be the Euclidean distance. The class discrepancy measures the capacity of . Intuitively, if the capacity of is large and leads to a complicated geometric structure of vector for , tends to be large.
Useful discrepancy results.
For a vector and integer , let present the -dimensional tensor obtained from the outer product of with itself times. For a -dimensional tensor with entries, we consider the measure . Next, we provide some known results about the class discrepancy.
Theorem 3.6** **(An upper bound for class discrepancy (restatement of Theorem 18 of (Karnin and Liberty, 2019))).
Let in . Let be analytic satisfying that for any integer , for some constant . Let and be an integer. The class discrepancy w.r.t. is at most for some constant .
Moreover, for any dataset of size , there exists a randomized polynomial time algorithm that constructs satisfying that for any integer , we have
[TABLE]
This satisfies .
Note that the above theorem is a constructive result instead of an existential result in Theorem 18 of (Karnin and Liberty, 2019). This is because Theorem 18 of (Karnin and Liberty, 2019) applies the existential version of Banaszczyk’s (Banaszczyk, 1998), which has been proven to be constructive recently (Bansal et al., 2019). Also, note that the construction of only depends on and does not depend on the selection of . This observation is important for the construction of mixed coresets via discrepancy.
3.1.2 Proof of Theorem 3.2
We are ready to prove Theorem 3.2. The main lemma is as follows.
Lemma 3.7** (Class discrepancy for Euclidean -Clustering).**
Let be an integer. Let and . For a given dataset of size , there exists a vector such that for any ,
[TABLE]
The above lemma indicates that the class discrepancy for Euclidean -Clustering linearly depends on the radius of the parameter space . Note that the lemma finds a vector that satisfies all levels of parameter spaces simultaneously. This requirement is slightly different from Definition 3.5 that considers a fixed parameter space. Observe that the term is similar to in Definition 3.3, which is the key of reduction from Lemma 3.7 to Theorem 3.2. The proof idea is similar to that of Fact 6 of (Karnin and Liberty, 2019).
Proof:
[of Theorem 3.2] Let be a dataset of size and . By the same argument as in Fact 6 of (Karnin and Liberty, 2019), we can iteratively applying Lemma 3.7 to construct a subset of size together with weights for and a vector , and satisfies that for any ,
[TABLE]
This implies that is an -mixed coreset for Euclidean -Clustering of size at most , which completes the proof of Theorem 3.2.
It remains to prove Lemma 3.7.
Proof:
[of Lemma 3.7] Let be a dataset of size . We first construct a vector by the following way:
For each , construct a point . 2. 2.
By Theorem 3.6, construct such that for any integer ,
[TABLE]
Let be the collection of all s. Note that by construction, which implies that . In the following, we show that satisfies Lemma 3.7.
Fix and let . We construct another dataset . For any , we let . By definition, we have for any and ,
[TABLE]
which implies that
[TABLE]
Thus, it suffices to prove that
[TABLE]
which implies the lemma. The proof idea of Inequality (6) is similar to that of Theorem 22 of (Karnin and Liberty, 2019).555Note that the proof of Theorem 22 of (Karnin and Liberty, 2019) is actually incorrect. Applying Theorem 18 of (Karnin and Liberty, 2019) may lead to an upper bound , which makes in Theorem 22 of (Karnin and Liberty, 2019) not exist. For each and , let and we can rewrite as follows:
[TABLE]
We note that and since . Construct another function as follows: for each and ,
If for any , , let ; 2. 2.
Otherwise, let .
We have for any integer . By the construction of and Theorem 3.6, we have that
[TABLE]
which implies Inequality (6) since due to the fact that .
Overall, we complete the proof.
3.2 Improved Coreset Lower Bound in when
We present a lower bound for the coreset size in small dimensional spaces.
Theorem 3.8** **(Coreset lower bound in small dimensional spaces).
Given an integer , constant and a real number , for any integer , there is a dataset such that its -coreset for -Clustering must contain at least points.
When , Theorem 3.8 gives the well known lower bound . When , the theorem is non-trivial. In the following, we prove Theorem 3.8 for and show how to extend to general in Appendix B.
3.2.1 Preparation
Notations
Let be the standard basis vectors of , and be -dimensional affine subspaces, where for a sufficiently large constant . For any , we use to denote the -dimensional vector (i.e., discard the [math]-th coordinate of ).
Hard instance
We construct the hard instance as follows. Take for and take to be the union of all . The hard instance is . Note that for each and .
In our proof, we always put two centers in each . Thus for large enough , all must be assigned to centers in .
We will use the following two technical lemmas from (Cohen-Addad et al., 2022).
Lemma 3.9**.**
For any , let be arbitrary unit vectors in , we have
[TABLE]
Lemma 3.10**.**
Let be a set of points in of size and be their weights. There exist unit vectors , such that
[TABLE]
3.2.2 Proof of Theorem 3.8 when
Now we are ready to prove Theorem 3.8 when .
Proof:
Note that points in might not be in any . We first map each point to an index such that is the nearest subspace of . The mapping is quite simple:
[TABLE]
where is the [math]-th coordinate of . Let , which is the distance of to the closest affine subspace. Let be the set of points in , whose closest affine subspace is . Define . Consider any -center set such that . Then for sufficiently large . On the other hand, . Since is a coreset, for all . 666Here we do not allow offsets to simplify the proof, but our technique can be extended to handle offsets. Therefore each must be very close to its closest affine subspace; in particular, we can assume that must be assigned to some center in (if there exists one).
In the proof follows, we consider three different set of centers and and compare the costs and for . In each , there are two centers in each . As we have discussed above, for large enough , the total cost for both and can be decomposed into the sum of costs over all affine subspaces.
For each , the corresponding centers in are the same across . Let be any point in such that has unit norm and is orthogonal to ; in other words, and the first coordinates of are all zero. Specifically, we set and the two centers in are two copies of for .
We first consider the following centers denoted by . As we have specified the centers for , we only describe the centers for each . Since by definition, , we can find a vector in such that has unit norm and is orthogonal to and all vectors in . Let be the set of points with each point in copied twice. We evaluate the cost of with respect to and .
Lemma 3.11**.**
For constructed above, we have and
[TABLE]
Proof:
Since is orthogonal to and has unit norm for all , it follows that
[TABLE]
On the other hand, the cost of w.r.t. is
[TABLE]
Recall is . For , the inner product is [math], and thus the total cost w.r.t. is
[TABLE]
which finishes the proof.
For notational convenience, we define . Since is an -coreset of , we have
[TABLE]
Next we consider a different set of centers denoted by . By Lemma 3.10, there exists unit vectors such that
[TABLE]
Applying this to all and get corresponding for all . Let be a set of centers in defined as follows: if , is with an additional [math]th coordinate with value , making them lie in ; for , we use the same centers as in , i.e., .
Lemma 3.12**.**
For constructed above, we have
[TABLE]
[TABLE]
Proof:
By (10),
[TABLE]
By Lemma 3.9 (with ), we have
[TABLE]
It follows that
[TABLE]
where in the inequality, we also used the orthogonality between and .
Since is an -coreset of , we have
[TABLE]
which implies
[TABLE]
By definition, , so
[TABLE]
and it follows that
[TABLE]
Finally we consider a third set of centers . Similarly, there are two centers per group. We set be a power of in . Let be the -dimensional Hadamard basis vectors. So all ’s are vectors and . We slightly abuse notation and treat each as a -dimensional vector by concatenating zeros in the end. For each construct a set of centers as follows. For each , we still use two copies of . For , the [math]th coordinate of the two centers is , then we concatenate and respectively to the first and the second centers.
Lemma 3.13**.**
Suppose is constructed based on . Then for all , we have
[TABLE]
[TABLE]
Proof:
For , the cost of the two centers w.r.t. is
[TABLE]
For , the cost w.r.t. is by (7). Thus, the total cost over all subspaces is
[TABLE]
On the other hand, for , the cost w.r.t. is
[TABLE]
Here , where . For , the cost w.r.t. is by (8). Thus, the total cost w.r.t. is
[TABLE]
This finishes the proof.
Corollary 3.14**.**
Let be a -coreset of , and . Then
[TABLE]
Proof:
Since is an -coreset, we have by Lemma 3.13
[TABLE]
Note that the above inequality holds for all , then
[TABLE]
By the Cauchy-Schwartz inequality,
[TABLE]
Therefore, we have
[TABLE]
Combining the above corollary with (11), we have
[TABLE]
By the assumption , it holds that or . Moreover, since for each , we have .
4 Conclusion
This work studies coresets for -Median problem in small dimensional Euclidean spaces. We give tight size bounds for -Median in and show that the framework in (Har-Peled and Kushal, 2005), with significant improvement, is optimal. For , we improve existing coreset upper bounds for -Median and prove new lower bounds.
Our work leaves several interesting problems for future research. One of which is to close the gap between upper bounds and lower bounds for . Another one is to generalize our results to -Clustering for general . Note that the generalization is non-trivial even for since the cost function is piece-wise linear for -Median while piece-wise polynomial of order for general -Clustering.
Appendix A Coreset Lower Bound for General -Median in
We prove the general case of Theorem 2.9 here.
Proof:
[the general case of Theorem 2.9]
We first construct the hard instance . Let denote the hard instance we have constructed in the proof of Theorem 2.9. We take a large enough constant , take , and take . Here means .
The dataset is a unification of copies of . These copies are far from each other. Thus -Median problem on can be decomposed to -Median problem on each copy. We prove the -Median lower bound by applying the argument for the -Median lower bound on every single copy and combining them together.
We denote , where is the -th interval we constructed in the proof of the -Median case of Theorem 2.9. We denote , denote the left endpoint and right endpoint of by and respectively. We have .
Now, assume that is an coreset of such that . We prove that there must be a contradiction. Since , there must be at least half of such that for some . We assume that these indexes are , without loss of generality. We define a parametrized query family as , where and
[TABLE]
Consider , a function of . Since is large enough, we have . The computation we have done in the proof of the -Median case of Theorem 2.9 implies that for each and
[TABLE]
Thus we have and .
It’s easy to see that is affine linear since for . Since is an coreset, we have . By Lemma 2.10, we must have , which leads to a contradiction.
Appendix B Proof of Theorem 3.8 for General
Using similar ideas from [Cohen-Addad et al., 2022], our proof of the lower bound for can be extended to arbitrary . First, we provide two lemmas analogous to Lemma 3.9 and Lemma 3.10 for general . Their proofs can be found in Appendix A in [Cohen-Addad et al., 2022].
Lemma B.1**.**
For any even number , let be arbitrary unit vectors in such that for each there exist some satisfying . We have
[TABLE]
Lemma B.2**.**
Let be a set of points in of size and be their weights. For arbitrary for each , there exist unit vectors satisfying , such that
[TABLE]
In this proof, the original point set and three sets of -centers, namely , are the same as for the case . The difference is that now and when constructing , we use Lemma B.2 in place of Lemma 3.10. Again, we compare the cost of and w.r.t. and get the following lemmas.
Lemma B.3**.**
For constructed above, we have and
[TABLE]
Proof:
Since is orthogonal to and has unit norm for all , it follows that
[TABLE]
On the other hand, the cost of w.r.t. is
[TABLE]
For , the inner product is [math], and thus the total cost w.r.t. is
[TABLE]
which finishes the proof.
For notational convenience, we define . Since is an -coreset of , we have
[TABLE]
Next we consider a different set of centers denoted by . By Lemma B.2, there exists unit vectors satisfying such that
[TABLE]
Applying this to all and get corresponding for all . Let be a set of centers in defined as follows: if , is with an additional [math]th coordinate with value , making them lie in ; for , we use the same centers as in , i.e., .
Lemma B.4**.**
For constructed above, we have
[TABLE]
[TABLE]
Proof:
By (15),
[TABLE]
By Lemma B.1 (with ), we have
[TABLE]
It follows that
[TABLE]
where in the inequality, we also used the orthogonality between and .
Since is an -coreset of , we have
[TABLE]
which implies
[TABLE]
By definition, , so
[TABLE]
and it follows that
[TABLE]
Finally we consider a third set of centers . Similarly, there are two centers per group. We set be a power of in . Let be the -dimensional Hadamard basis vectors. So all ’s are vectors and . We slightly abuse notation and treat each as a -dimensional vector by concatenating zeros in the end. For each construct a set of centers as follows. For each , we still use two copies of . For , the [math]th coordinate of the two centers is , then we concatenate and respectively to the first and the second centers.
Lemma B.5**.**
Suppose is constructed based on . Then for all , we have
[TABLE]
[TABLE]
Proof:
For , the cost of the two centers w.r.t. is
[TABLE]
For , the cost w.r.t. is by (12). Thus, the total cost over all subspaces is
[TABLE]
On the other hand, for , the cost w.r.t. is
[TABLE]
Here , where . For , the total cost w.r.t. is . Thus, the total cost w.r.t. is
[TABLE]
This finishes the proof.
Corollary B.6**.**
Let be a -coreset of , and . Then
[TABLE]
Proof:
Since is an -coreset, we have by Lemma B.5
[TABLE]
Note that the above inequality holds for all , then
[TABLE]
By the Cauchy-Schwartz inequality,
[TABLE]
Therefore, we have
[TABLE]
Combining the above corollary with (16), we have
[TABLE]
which implies that
[TABLE]
So if we set , then
[TABLE]
By the assumption , it holds that or . Moreover, since for each , we have .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Arnaboldi et al. [2012] Valerio Arnaboldi, Marco Conti, Andrea Passarella, and Fabio Pezzoni. Analysis of ego network structure in online social networks. 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing , pages 31–40, 2012.
- 2Arthur and Vassilvitskii [2007] David Arthur and Sergei Vassilvitskii. k 𝑘 k -means++: the advantages of careful seeding. In SODA , pages 1027–1035, 2007.
- 3Baker et al. [2020] Daniel N. Baker, Vladimir Braverman, Lingxiao Huang, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for clustering in graphs of bounded treewidth. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , volume 119 of Proceedings of Machine Learning Research , pages 569–579. PMLR, 2020.
- 4Banaszczyk [1998] Wojciech Banaszczyk. Balancing vectors and gaussian measures of n-dimensional convex bodies. Random Struct. Algorithms , 12(4):351–360, 1998.
- 5Bansal et al. [2019] Nikhil Bansal, Daniel Dadush, Shashwat Garg, and Shachar Lovett. The gram-schmidt walk: A cure for the banaszczyk blues. Theory Comput. , 15:1–27, 2019.
- 6Braverman et al. [2019] Vladimir Braverman, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for ordered weighted clustering. In International Conference on Machine Learning , 2019.
- 7Braverman et al. [2021] Vladimir Braverman, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for clustering in excluded-minor graphs and beyond. In SODA , pages 2679–2696. SIAM, 2021.
- 8Braverman et al. [2022] Vladimir Braverman, Vincent Cohen-Addad, Shaofeng Jiang, Robert Krauthgamer, Chris Schwiegelshohn, Mads Bech Toftrup, and Xuan Wu. The power of uniform sampling for coresets. In 62nd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2022 . IEEE Computer Society, 2022.
