Clustering of Big Data with Mixed Features
Joshua Tobin, Mimi Zhang

TL;DR
This paper introduces a novel clustering algorithm tailored for large datasets with mixed data types, enhancing applicability, outlier detection, and automatic determination of the number of clusters, while significantly improving computational efficiency.
Contribution
The proposed algorithm extends density peak clustering to mixed data, detects outliers, and automatically identifies the number of clusters, with improved scalability.
Findings
Effective in clustering large mixed datasets
Capable of detecting outliers and low-density clusters
Reduces computational complexity significantly
Abstract
Clustering large, mixed data is a central problem in data mining. Many approaches adopt the idea of k-means, and hence are sensitive to initialisation, detect only spherical clusters, and require a priori the unknown number of clusters. We here develop a new clustering algorithm for large data of mixed type, aiming at improving the applicability and efficiency of the peak-finding technique. The improvements are threefold: (1) the new algorithm is applicable to mixed data; (2) the algorithm is capable of detecting outliers and clusters of relatively lower density values; (3) the algorithm is competent at deciding the correct number of clusters. The computational complexity of the algorithm is greatly reduced by applying a fast k-nearest neighbors method and by scaling down to component sets. We present experimental results to verify that our algorithm works well in practice. Keywords:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Data Mining Algorithms and Applications · Complex Network Analysis Techniques
