# Scalable K-Medoids via True Error Bound and Familywise Bandits

**Authors:** Aravindakshan Babu, Saurabh Agarwal, Sudarshan Babu, Hariharan, Chandrasekaran

arXiv: 1905.10979 · 2019-10-31

## TL;DR

This paper introduces a new theoretical framework for true K-Medoid error, proposes a scalable distributed algorithm MCPAM, and demonstrates its efficiency and accuracy on large semi-metric datasets, including one with 1 billion points.

## Contribution

It formalizes the true K-Medoid error, provides convergence analysis, and develops MCPAM, a scalable distributed algorithm for large-scale clustering.

## Key findings

- MCPAM achieves significant computational savings.
- Error decreases at a rate of Θ(1/n^{2/3}).
- Successfully clusters 1 billion points on semi-metric spaces.

## Abstract

K-Medoids(KM) is a standard clustering method, used extensively on semi-metric data.Error analyses of KM have traditionally used an in-sample notion of error,which can be far from the true error and suffer from generalization gap. We formalize the true K-Medoid error based on the underlying data distribution.We decompose the true error into fundamental statistical problems of: minimum estimation (ME) and minimum mean estimation (MME). We provide a convergence result for MME. We show $\errMME$ decreases no slower than $\Theta(\frac{1}{n^{\frac{2}{3}}})$, where $n$ is a measure of sample size. Inspired by this bound, we propose a computationally efficient, distributed KM algorithm namely MCPAM. MCPAM has expected runtime $\mathcal{O}(km)$,where $k$ is the number of medoids and $m$ is number of samples. MCPAM provides massive computational savings for a small tradeoff in accuracy. We verify the quality and scaling properties of MCPAM on various datasets. And achieve the hitherto unachieved feat of calculating the KM of 1 billion points on semi-metric spaces.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.10979/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/1905.10979/full.md

## References

47 references — full list in the complete paper: https://tomesphere.com/paper/1905.10979/full.md

---
Source: https://tomesphere.com/paper/1905.10979