Benchmarking distance-based partitioning methods for mixed-type data

Efthymios Costa; Ioanna Papatsouma; Angelos Markos

arXiv:2203.16287·stat.ME·August 31, 2022·Adv. Data Anal. Classif.·1 cites

Benchmarking distance-based partitioning methods for mixed-type data

Efthymios Costa, Ioanna Papatsouma, Angelos Markos

PDF

Open Access

TL;DR

This paper benchmarks eight distance-based clustering methods for mixed-type data, analyzing their performance across various data scenarios to guide practitioners in selecting the most effective approach.

Contribution

It provides a comprehensive comparison of clustering methods for mixed data, highlighting their relative performance and influencing factors.

Findings

01

KAMILA, K-Prototypes, and sequential Factor Analysis with K-Means perform best in many scenarios.

02

Cluster overlap and variable composition significantly affect clustering success.

03

The study offers practical insights for choosing appropriate clustering methods.

Abstract

Clustering mixed-type data, that is, observation by variable data that consist of both continuous and categorical variables poses novel challenges. Foremost among these challenges is the choice of the most appropriate clustering method for the data. This paper presents a benchmarking study comparing eight distance-based partitioning methods for mixed-type data in terms of cluster recovery performance. A series of simulations carried out by a full factorial design are presented that examined the effect of a variety of factors on cluster recovery. The amount of cluster overlap, the percentage of categorical variables in the data set, the number of clusters and the number of observations had the largest effects on cluster recovery and in most of the tested scenarios. KAMILA, K-Prototypes and sequential Factor Analysis and K-Means clustering typically performed better than other methods.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Bayesian Methods and Mixture Models · Customer churn and segmentation