Clustering of Big Data with Mixed Features

Joshua Tobin; Mimi Zhang

arXiv:2011.06043·stat.ML·November 13, 2020·1 cites

Clustering of Big Data with Mixed Features

Joshua Tobin, Mimi Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel clustering algorithm tailored for large datasets with mixed data types, enhancing applicability, outlier detection, and automatic determination of the number of clusters, while significantly improving computational efficiency.

Contribution

The proposed algorithm extends density peak clustering to mixed data, detects outliers, and automatically identifies the number of clusters, with improved scalability.

Findings

01

Effective in clustering large mixed datasets

02

Capable of detecting outliers and low-density clusters

03

Reduces computational complexity significantly

Abstract

Clustering large, mixed data is a central problem in data mining. Many approaches adopt the idea of k-means, and hence are sensitive to initialisation, detect only spherical clusters, and require a priori the unknown number of clusters. We here develop a new clustering algorithm for large data of mixed type, aiming at improving the applicability and efficiency of the peak-finding technique. The improvements are threefold: (1) the new algorithm is applicable to mixed data; (2) the algorithm is capable of detecting outliers and clusters of relatively lower density values; (3) the algorithm is competent at deciding the correct number of clusters. The computational complexity of the algorithm is greatly reduced by applying a fast k-nearest neighbors method and by scaling down to component sets. We present experimental results to verify that our algorithm works well in practice. Keywords:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tobinjo96/CPFcluster
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Data Mining Algorithms and Applications · Complex Network Analysis Techniques