An Investigation into Distance Measures in Cluster Analysis
Zoe Shapcott

TL;DR
This paper explores various distance measures for the K-means clustering algorithm, comparing their effectiveness on simulated and real datasets, including an analysis of the Mahalanobis distance versus traditional metrics.
Contribution
It provides a comparative analysis of distance measures in K-means clustering, including the application of Mahalanobis distance and evaluation of their performance on different datasets.
Findings
Mahalanobis distance can offer benefits over traditional measures in certain cases
Different distance measures impact cluster quality and interpretability
Analysis includes the use of ChatGPT for supplementary insights
Abstract
This report provides an exploration of different distance measures that can be used with the -means algorithm for cluster analysis. Specifically, we investigate the Mahalanobis distance, and critically assess any benefits it may have over the more traditional measures of the Euclidean, Manhattan and Maximum distances. We perform this by first defining the metrics, before considering their advantages and drawbacks as discussed in literature regarding this area. We apply these distances, first to some simulated data and then to subsets of the Dry Bean dataset [1], to explore if there is a better quality detectable for one metric over the others in these cases. One of the sections is devoted to analysing the information obtained from ChatGPT in response to prompts relating to this topic.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research
