The Exploitation of Distance Distributions for Clustering

Michael C. Thrun

arXiv:2108.09649·cs.LG·August 24, 2021

The Exploitation of Distance Distributions for Clustering

Michael C. Thrun

PDF

TL;DR

This paper investigates how analyzing distance distributions can improve clustering by identifying multimodal patterns, proposing Gaussian mixture modeling as a pre-evaluation step, supported by experiments on artificial and real datasets.

Contribution

It introduces the systematic use of distribution analysis with mirrored-density plots and Gaussian mixtures for better distance measure selection in clustering.

Findings

01

Multimodal distance distributions are preferable for clustering.

02

Gaussian mixture modeling enhances distance evaluation.

03

Experiments confirm improved clustering performance.

Abstract

Although distance measures are used in many machine learning algorithms, the literature on the context-independent selection and evaluation of distance measures is limited in the sense that prior knowledge is used. In cluster analysis, current studies evaluate the choice of distance measure after applying unsupervised methods based on error probabilities, implicitly setting the goal of reproducing predefined partitions in data. Such studies use clusters of data that are often based on the context of the data as well as the custom goal of the specific study. Depending on the data context, different properties for distance distributions are judged to be relevant for appropriate distance selection. However, if cluster analysis is based on the task of finding similar partitions of data, then the intrapartition distances should be smaller than the interpartition distances. By systematically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.