Imbalanced Data Clustering using Equilibrium K-Means

Yudong He

arXiv:2402.14490·cs.LG·June 7, 2024·5 cites

Imbalanced Data Clustering using Equilibrium K-Means

Yudong He

PDF

Open Access

TL;DR

This paper introduces equilibrium K-means (EKM), a novel clustering algorithm designed to effectively handle imbalanced data by incorporating a centroid repulsion mechanism, outperforming existing methods on real-world datasets.

Contribution

The paper proposes a new objective function and algorithm, EKM, that mitigates large cluster bias in centroid-based clustering, with theoretical reformulation and extensive empirical evaluation.

Findings

01

EKM outperforms benchmark algorithms on imbalanced datasets.

02

EKM is scalable and computationally efficient.

03

Reformulation of K-means algorithms facilitates unified analysis.

Abstract

Centroid-based clustering algorithms, such as hard K-means (HKM) and fuzzy K-means (FKM), have suffered from learning bias towards large clusters. Their centroids tend to be crowded in large clusters, compromising performance when the true underlying data groups vary in size (i.e., imbalanced data). To address this, we propose a new clustering objective function based on the Boltzmann operator, which introduces a novel centroid repulsion mechanism, where data points surrounding the centroids repel other centroids. Larger clusters repel more, effectively mitigating the issue of large cluster learning bias. The proposed new algorithm, called equilibrium K-means (EKM), is simple, alternating between two steps; resource-saving, with the same time and space complexity as FKM; and scalable to large datasets via batch learning. We substantially evaluate the performance of EKM on synthetic and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques · Spam and Phishing Detection · Text and Document Classification Technologies