Non-parametric Power-law Data Clustering

Xuhui Fan; Yiling Zeng; Longbing Cao

arXiv:1306.3003·cs.LG·June 14, 2013

Non-parametric Power-law Data Clustering

Xuhui Fan, Yiling Zeng, Longbing Cao

PDF

Open Access

TL;DR

This paper introduces pyp-means, a novel clustering algorithm based on the Pitman-Yor Process, designed to automatically determine the number of clusters in power-law distributed data, improving over previous Bayesian nonparametric methods.

Contribution

The paper proposes pyp-means, which adaptively adjusts thresholds and merges clusters to effectively handle power-law data and determine cluster numbers automatically.

Findings

01

Pyp-means effectively models power-law data distributions.

02

It dynamically adjusts thresholds for cluster formation.

03

Experimental results validate its advantages over traditional methods.

Abstract

It has always been a great challenge for clustering algorithms to automatically determine the cluster numbers according to the distribution of datasets. Several approaches have been proposed to address this issue, including the recent promising work which incorporate Bayesian Nonparametrics into the $k$ -means clustering procedure. This approach shows simplicity in implementation and solidity in theory, while it also provides a feasible way to inference in large scale datasets. However, several problems remains unsolved in this pioneering work, including the power-law data applicability, mechanism to merge centers to avoid the over-fitting problem, clustering order problem, e.t.c.. To address these issues, the Pitman-Yor Process based k-means (namely \emph{pyp-means}) is proposed in this paper. Taking advantage of the Pitman-Yor Process, \emph{pyp-means} treats clusters differently by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Advanced Clustering Algorithms Research · Data Management and Algorithms