Can an unsupervised clustering algorithm reproduce a categorization system?
Nathalia Castellanos, Dhruv Desai, Sebastian Frank, Stefano Pasquali,, Dhagash Mehta

TL;DR
This paper examines whether unsupervised clustering algorithms can replicate expert-defined categorization systems, highlighting the importance of feature selection, distance metrics, and the limitations of standard evaluation methods.
Contribution
It demonstrates that with suitable features and distance metrics, unsupervised clustering can reproduce ground truth classes, addressing a key challenge in categorization system validation.
Findings
Success depends on feature selection and distance metric choice.
Standard clustering metrics may not reliably identify the true number of classes.
Supervised metric learning can improve clustering alignment with ground truth.
Abstract
Peer analysis is a critical component of investment management, often relying on expert-provided categorization systems. These systems' consistency is questioned when they do not align with cohorts from unsupervised clustering algorithms optimized for various metrics. We investigate whether unsupervised clustering can reproduce ground truth classes in a labeled dataset, showing that success depends on feature selection and the chosen distance metric. Using toy datasets and fund categorization as real-world examples we demonstrate that accurately reproducing ground truth classes is challenging. We also highlight the limitations of standard clustering evaluation metrics in identifying the optimal number of clusters relative to the ground truth classes. We then show that if appropriate features are available in the dataset, and a proper distance metric is known (e.g., using a supervised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Fuzzy Logic and Control Systems
MethodsFeature Selection · ALIGN
