Cube Sampled K-Prototype Clustering for Featured Data
Seemandhar Jain, Aditya A. Shastri, Kapil Ahuja, Yann Busnel, and, Navneet Pratap Singh

TL;DR
This paper introduces a probabilistic cube sampling method combined with K-Prototype clustering, utilizing PCA for inclusion probabilities, resulting in improved accuracy and reduced computation on large datasets.
Contribution
It presents a novel cube sampling technique with PCA-based inclusion probabilities integrated with K-Prototype clustering for better large-scale data clustering.
Findings
Cube sampled K-Prototype achieves higher accuracy than other sampled clustering methods.
The approach reduces computational complexity while maintaining high clustering accuracy.
Experiments on UCI datasets validate the effectiveness of the proposed method.
Abstract
Clustering large amount of data is becoming increasingly important in the current times. Due to the large sizes of data, clustering algorithm often take too much time. Sampling this data before clustering is commonly used to reduce this time. In this work, we propose a probabilistic sampling technique called cube sampling along with K-Prototype clustering. Cube sampling is used because of its accurate sample selection. K-Prototype is most frequently used clustering algorithm when the data is numerical as well as categorical (very common in today's time). The novelty of this work is in obtaining the crucial inclusion probabilities for cube sampling using Principal Component Analysis (PCA). Experiments on multiple datasets from the UCI repository demonstrate that cube sampled K-Prototype algorithm gives the best clustering accuracy among similarly sampled other popular clustering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSpectral Clustering
