Fast Clustering of Categorical Big Data

Bipana Thapaliya; Yu Zhuang

arXiv:2502.07081·cs.LG·February 18, 2025

Fast Clustering of Categorical Big Data

Bipana Thapaliya, Yu Zhuang

PDF

Open Access

TL;DR

This paper introduces BK-Modes, a bisecting approach to improve initial cluster centers for K-Modes, resulting in better clustering quality and efficiency for large categorical datasets.

Contribution

The paper proposes BK-Modes, a novel bisecting method for selecting initial centers in K-Modes, enhancing clustering performance on big data.

Findings

01

BK-Modes outperforms existing methods in clustering quality.

02

BK-Modes is more efficient for large datasets.

03

Experimental results show improved performance in both quality and speed.

Abstract

The K-Modes algorithm, developed for clustering categorical data, is of high algorithmic simplicity but suffers from unreliable performances in clustering quality and clustering efficiency, both heavily influenced by the choice of initial cluster centers. In this paper, we investigate Bisecting K-Modes (BK-Modes), a successive bisecting process to find clusters, in examining how good the cluster centers out of the bisecting process will be when used as initial centers for the K-Modes. The BK-Modes works by splitting a dataset into multiple clusters iteratively with one cluster being chosen and bisected into two clusters in each iteration. We use the sum of distances of data to their cluster centers as the selection metric to choose a cluster to be bisected in each iteration. This iterative process stops when K clusters are produced. The centers of these K clusters are then used as the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research