The Exploitation of Distance Distributions for Clustering
Michael C. Thrun

TL;DR
This paper investigates how analyzing distance distributions can improve clustering by identifying multimodal patterns, proposing Gaussian mixture modeling as a pre-evaluation step, supported by experiments on artificial and real datasets.
Contribution
It introduces the systematic use of distribution analysis with mirrored-density plots and Gaussian mixtures for better distance measure selection in clustering.
Findings
Multimodal distance distributions are preferable for clustering.
Gaussian mixture modeling enhances distance evaluation.
Experiments confirm improved clustering performance.
Abstract
Although distance measures are used in many machine learning algorithms, the literature on the context-independent selection and evaluation of distance measures is limited in the sense that prior knowledge is used. In cluster analysis, current studies evaluate the choice of distance measure after applying unsupervised methods based on error probabilities, implicitly setting the goal of reproducing predefined partitions in data. Such studies use clusters of data that are often based on the context of the data as well as the custom goal of the specific study. Depending on the data context, different properties for distance distributions are judged to be relevant for appropriate distance selection. However, if cluster analysis is based on the task of finding similar partitions of data, then the intrapartition distances should be smaller than the interpartition distances. By systematically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
