Non-parametric Power-law Data Clustering
Xuhui Fan, Yiling Zeng, Longbing Cao

TL;DR
This paper introduces pyp-means, a novel clustering algorithm based on the Pitman-Yor Process, designed to automatically determine the number of clusters in power-law distributed data, improving over previous Bayesian nonparametric methods.
Contribution
The paper proposes pyp-means, which adaptively adjusts thresholds and merges clusters to effectively handle power-law data and determine cluster numbers automatically.
Findings
Pyp-means effectively models power-law data distributions.
It dynamically adjusts thresholds for cluster formation.
Experimental results validate its advantages over traditional methods.
Abstract
It has always been a great challenge for clustering algorithms to automatically determine the cluster numbers according to the distribution of datasets. Several approaches have been proposed to address this issue, including the recent promising work which incorporate Bayesian Nonparametrics into the -means clustering procedure. This approach shows simplicity in implementation and solidity in theory, while it also provides a feasible way to inference in large scale datasets. However, several problems remains unsolved in this pioneering work, including the power-law data applicability, mechanism to merge centers to avoid the over-fitting problem, clustering order problem, e.t.c.. To address these issues, the Pitman-Yor Process based k-means (namely \emph{pyp-means}) is proposed in this paper. Taking advantage of the Pitman-Yor Process, \emph{pyp-means} treats clusters differently by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Advanced Clustering Algorithms Research · Data Management and Algorithms
