# On the k-Means/Median Cost Function

**Authors:** Anup Bhattacharya, Yoav Freund, Ragesh Jaiswal

arXiv: 1704.05232 · 2021-09-10

## TL;DR

This paper investigates the behavior of the $k$-means cost function, providing bounds on how the optimal cost decreases as the number of centers increases, and extends these insights to metric $k$-median problems.

## Contribution

The work offers new bounds on the number of centers needed to approximate the $k$-means cost within a factor, generalizes results to metric spaces, and discusses practical algorithms for constructing such sets.

## Key findings

- Bounds on the number of centers to approximate $k$-means cost within a factor.
- Extension of bounds to metric $k$-median problems based on doubling dimension.
- Algorithmic approach using $D^2$-sampling to find near-optimal sets.

## Abstract

In this work, we study the $k$-means cost function. Given a dataset $X \subseteq \mathbb{R}^d$ and an integer $k$, the goal of the Euclidean $k$-means problem is to find a set of $k$ centers $C \subseteq \mathbb{R}^d$ such that $\Phi(C, X) \equiv \sum_{x \in X} \min_{c \in C} ||x - c||^2$ is minimized. Let $\Delta(X,k) \equiv \min_{C \subseteq \mathbb{R}^d} \Phi(C, X)$ denote the cost of the optimal $k$-means solution. For any dataset $X$, $\Delta(X,k)$ decreases as $k$ increases. In this work, we try to understand this behaviour more precisely. For any dataset $X \subseteq \mathbb{R}^d$, integer $k \geq 1$, and a precision parameter $\varepsilon > 0$, let $L(X, k, \varepsilon)$ denote the smallest integer such that $\Delta(X, L(X, k, \varepsilon)) \leq \varepsilon \cdot \Delta(X,k)$. We show upper and lower bounds on this quantity. Our techniques generalize for the metric $k$-median problem in arbitrary metric spaces and we give bounds in terms of the doubling dimension of the metric. Finally, we observe that for any dataset $X$, we can compute a set $S$ of size $O \left(L(X, k, \varepsilon/c) \right)$ using $D^2$-sampling such that $\Phi(S,X) \leq \varepsilon \cdot \Delta(X,k)$ for some fixed constant $c$. We also discuss some applications of our bounds.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1704.05232/full.md

## Figures

15 figures with captions in the complete paper: https://tomesphere.com/paper/1704.05232/full.md

## References

18 references — full list in the complete paper: https://tomesphere.com/paper/1704.05232/full.md

---
Source: https://tomesphere.com/paper/1704.05232