Estimating the Optimal Number of Clusters in Categorical Data Clustering   by Silhouette Coefficient

Duy-Tai Dinh; Tsutomu Fujinami; and Van-Nam Huynh

arXiv:2501.15542·cs.LG·January 28, 2025

Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient

Duy-Tai Dinh, Tsutomu Fujinami, and Van-Nam Huynh

PDF

TL;DR

This paper introduces k-SCC, an algorithm that effectively estimates the optimal number of clusters in categorical data by combining kernel density estimation, information-theoretic dissimilarity, and silhouette analysis, outperforming existing methods.

Contribution

The paper presents a novel clustering algorithm, k-SCC, that accurately determines the optimal number of clusters in categorical data using a combination of density estimation and silhouette-based evaluation.

Findings

01

k-SCC outperforms other algorithms in cluster number estimation

02

Experimental results on synthetic and real datasets validate effectiveness

03

The method improves clustering quality in categorical data contexts

Abstract

The problem of estimating the number of clusters (say k) is one of the major challenges for the partitional clustering. This paper proposes an algorithm named k-SCC to estimate the optimal k in categorical data clustering. For the clustering step, the algorithm uses the kernel density estimation approach to define cluster centers. In addition, it uses an information-theoretic based dissimilarity to measure the distance between centers and objects in each cluster. The silhouette analysis based approach is then used to evaluate the quality of different clustering obtained in the former step to choose the best k. Comparative experiments were conducted on both synthetic and real datasets to compare the performance of k-SCC with three other algorithms. Experimental results show that k-SCC outperforms the compared algorithms in determining the number of clusters for each dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.