Explainable $k$-Means and $k$-Medians Clustering

Sanjoy Dasgupta; Nave Frost; Michal Moshkovitz; Cyrus Rashtchian

arXiv:2002.12538·cs.LG·September 23, 2020·20 cites

Explainable $k$-Means and $k$-Medians Clustering

Sanjoy Dasgupta, Nave Frost, Michal Moshkovitz, Cyrus Rashtchian

PDF

Open Access 3 Repos 1 Video

TL;DR

This paper explores the use of decision trees to create interpretable clustering algorithms for geometric data, providing theoretical insights and algorithms with provable approximation guarantees for $k$-means and $k$-medians objectives.

Contribution

It introduces a novel approach to explainable clustering using decision trees, analyzes limitations of existing methods, and proposes algorithms with provable approximation bounds.

Findings

01

Popular decision tree algorithms may produce high-cost clusterings.

02

Any tree-based clustering has an $oldsymbol{ ext{Ω}( ext{log} k)}$ approximation factor.

03

Proposed algorithms achieve constant or polynomial approximation ratios for $k$-means and $k$-medians.

Abstract

Clustering is a popular form of unsupervised learning for geometric data. Unfortunately, many clustering algorithms lead to cluster assignments that are hard to explain, partially because they depend on all the features of the data in a complicated way. To improve interpretability, we consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner. We study this problem from a theoretical viewpoint, measuring cluster quality by the $k$ -means and $k$ -medians objectives: Must there exist a tree-induced clustering whose cost is comparable to that of the best unconstrained clustering, and if so, how can it be found? In terms of negative results, we show, first, that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost, and second, that any tree-induced clustering must in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Explainable k-Means and k-Medians Clustering· slideslive

Taxonomy

TopicsRough Sets and Fuzzy Logic · Statistical Methods and Inference · Bayesian Modeling and Causal Inference