An Efficient $k$-modes Algorithm for Clustering Categorical Datasets
Karin S. Dorman, Ranjan Maitra

TL;DR
This paper introduces OTQT, a novel and efficient implementation of the $k$-modes algorithm for clustering categorical data, which improves accuracy and often reduces overall computation time compared to existing methods.
Contribution
The paper presents OTQT, a new $k$-modes algorithm that guarantees objective improvements undetectable by previous methods, enhancing clustering accuracy and efficiency.
Findings
OTQT consistently finds better objective function updates.
OTQT achieves higher accuracy per iteration.
OTQT is often faster to reach the final optimum.
Abstract
Mining clusters from data is an important endeavor in many applications. The -means method is a popular, efficient, and distribution-free approach for clustering numerical-valued data, but does not apply for categorical-valued observations. The -modes method addresses this lacuna by replacing the Euclidean with the Hamming distance and the means with the modes in the -means objective function. We provide a novel, computationally efficient implementation of -modes, called OTQT. We prove that OTQT finds updates to improve the objective function that are undetectable to existing -modes algorithms. Although slightly slower per iteration due to algorithmic complexity, OTQT is always more accurate per iteration and almost always faster (and only barely slower on some datasets) to the final optimum. Thus, we recommend OTQT as the preferred, default algorithm for -modes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
