An Efficient $k$-modes Algorithm for Clustering Categorical Datasets

Karin S. Dorman; Ranjan Maitra

arXiv:2006.03936·stat.ME·August 24, 2021

An Efficient $k$-modes Algorithm for Clustering Categorical Datasets

Karin S. Dorman, Ranjan Maitra

PDF

TL;DR

This paper introduces OTQT, a novel and efficient implementation of the $k$-modes algorithm for clustering categorical data, which improves accuracy and often reduces overall computation time compared to existing methods.

Contribution

The paper presents OTQT, a new $k$-modes algorithm that guarantees objective improvements undetectable by previous methods, enhancing clustering accuracy and efficiency.

Findings

01

OTQT consistently finds better objective function updates.

02

OTQT achieves higher accuracy per iteration.

03

OTQT is often faster to reach the final optimum.

Abstract

Mining clusters from data is an important endeavor in many applications. The $k$ -means method is a popular, efficient, and distribution-free approach for clustering numerical-valued data, but does not apply for categorical-valued observations. The $k$ -modes method addresses this lacuna by replacing the Euclidean with the Hamming distance and the means with the modes in the $k$ -means objective function. We provide a novel, computationally efficient implementation of $k$ -modes, called OTQT. We prove that OTQT finds updates to improve the objective function that are undetectable to existing $k$ -modes algorithms. Although slightly slower per iteration due to algorithmic complexity, OTQT is always more accurate per iteration and almost always faster (and only barely slower on some datasets) to the final optimum. Thus, we recommend OTQT as the preferred, default algorithm for $k$ -modes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.