Pruning nearest neighbor cluster trees

Samory Kpotufe; Ulrike von Luxburg

arXiv:1105.0540·stat.ML·May 6, 2011·30 cites

Pruning nearest neighbor cluster trees

Samory Kpotufe, Ulrike von Luxburg

PDF

Open Access

TL;DR

This paper analyzes how k-NN graphs can reliably estimate the true cluster structure of data and introduces a pruning method that removes spurious clusters while preserving meaningful ones, with finite sample guarantees.

Contribution

It provides the first finite sample guarantee for pruning k-NN cluster trees to eliminate spurious structures while recovering true clusters.

Findings

01

Subgraphs of k-NN graphs can consistently estimate the cluster tree.

02

A pruning method guarantees removal of all spurious clusters at all levels.

03

Finite sample guarantees ensure accurate cluster recovery.

Abstract

Nearest neighbor (k-NN) graphs are widely used in machine learning and data mining applications, and our aim is to better understand what they reveal about the cluster structure of the unknown underlying distribution of points. Moreover, is it possible to identify spurious structures that might arise due to sampling variability? Our first contribution is a statistical analysis that reveals how certain subgraphs of a k-NN graph form a consistent estimator of the cluster tree of the underlying distribution of points. Our second and perhaps most important contribution is the following finite sample guarantee. We carefully work out the tradeoff between aggressive and conservative pruning and are able to guarantee the removal of all spurious cluster structures at all levels of the tree while at the same time guaranteeing the recovery of salient clusters. This is the first such finite…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Advanced Clustering Algorithms Research · Bayesian Methods and Mixture Models