# Clustering Small Samples with Quality Guarantees: Adaptivity with   One2all pps

**Authors:** Edith Cohen, Shiri Chechik, Haim Kaplan

arXiv: 1706.03607 · 2017-10-31

## TL;DR

This paper introduces adaptive sampling techniques for clustering small samples with quality guarantees in relaxed metric spaces, significantly reducing sample sizes compared to worst-case methods.

## Contribution

It presents the one2all probabilistic sampling method and an adaptive clustering wrapper that leverage data structure for efficient, quality-preserving clustering on small samples.

## Key findings

- Adaptive sampling reduces sample size for cost estimation.
- The clustering wrapper ensures quality transfer from sample to full data.
- Experimental results show substantial improvements over worst-case approaches.

## Abstract

Clustering of data points is a fundamental tool in data analysis. We consider points $X$ in a relaxed metric space, where the triangle inequality holds within a constant factor. The {\em cost} of clustering $X$ by $Q$ is $V(Q)=\sum_{x\in X} d_{xQ}$. Two basic tasks, parametrized by $k \geq 1$, are {\em cost estimation}, which returns (approximate) $V(Q)$ for queries $Q$ such that $|Q|=k$ and {\em clustering}, which returns an (approximate) minimizer of $V(Q)$ of size $|Q|=k$. With very large data sets $X$, we seek efficient constructions of small samples that act as surrogates to the full data for performing these tasks. Existing constructions that provide quality guarantees are either worst-case, and unable to benefit from structure of real data sets, or make explicit strong assumptions on the structure. We show here how to avoid both these pitfalls using adaptive designs.   At the core of our design is the {\em one2all} construction of multi-objective probability-proportional-to-size (pps) samples: Given a set $M$ of centroids and $\alpha \geq 1$, one2all efficiently assigns probabilities to points so that the clustering cost of {\em each} $Q$ with cost $V(Q) \geq V(M)/\alpha$ can be estimated well from a sample of size $O(\alpha |M|\epsilon^{-2})$. For cost queries, we can obtain worst-case sample size $O(k\epsilon^{-2})$ by applying one2all to a bicriteria approximation $M$, but we adaptively balance $|M|$ and $\alpha$ to further reduce sample size. For clustering, we design an adaptive wrapper that applies a base clustering algorithm to a sample $S$. Our wrapper uses the smallest sample that provides statistical guarantees that the quality of the clustering on the sample carries over to the full data set. We demonstrate experimentally the huge gains of using our adaptive instead of worst-case methods.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1706.03607/full.md

## Figures

36 figures with captions in the complete paper: https://tomesphere.com/paper/1706.03607/full.md

## References

30 references — full list in the complete paper: https://tomesphere.com/paper/1706.03607/full.md

---
Source: https://tomesphere.com/paper/1706.03607