Degrees of Freedom and Model Selection for k-means Clustering

David P. Hofmeyr

arXiv:1806.02034·stat.ML·February 24, 2020

Degrees of Freedom and Model Selection for k-means Clustering

David P. Hofmeyr

PDF

Open Access 1 Repo

TL;DR

This paper develops a new way to measure the effective degrees of freedom in k-means clustering, enabling better model selection through BIC, validated on simulated and real datasets.

Contribution

It introduces an extension of Stein's lemma to approximate the degrees of freedom in k-means, improving model selection accuracy.

Findings

01

Proposed degrees of freedom measure aligns well with empirical results.

02

Method outperforms existing techniques in selecting high-quality clusters.

03

Code implementation is available as an R package.

Abstract

This paper investigates the model degrees of freedom in k-means clustering. An extension of Stein's lemma provides an expression for the effective degrees of freedom in the k-means model. Approximating the degrees of freedom in practice requires simplifications of this expression, however empirical studies evince the appropriateness of our proposed approach. The practical relevance of this new degrees of freedom formulation for k-means is demonstrated through model selection using the Bayesian Information Criterion. The reliability of this method is validated through experiments on simulated data as well as on a large collection of publicly available benchmark data sets from diverse application areas. Comparisons with popular existing techniques indicate that this approach is extremely competitive for selecting high quality clustering solutions. Code to implement the proposed approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DavidHofmeyr/edfkmeans
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Advanced Clustering Algorithms Research · Data Mining Algorithms and Applications