Clustering Mixed Datasets Using Homogeneity Analysis with Applications   to Big Data

Rajiv Sambasivan; Sourish Das

arXiv:1608.04961·stat.ML·October 31, 2017

Clustering Mixed Datasets Using Homogeneity Analysis with Applications to Big Data

Rajiv Sambasivan, Sourish Das

PDF

TL;DR

This paper explores using homogeneity analysis to cluster datasets with mixed numerical and categorical data, enabling the application of Euclidean-based tools for big data analysis.

Contribution

It introduces a method to represent mixed datasets in Euclidean space via homogeneity analysis, facilitating clustering and analysis.

Findings

01

Effective clustering of mixed datasets demonstrated

02

Applicable to large-scale big data scenarios

03

Enables use of Euclidean tools for mixed data analysis

Abstract

Datasets with a mixture of numerical and categorical attributes are routinely encountered in many application domains. In this work we examine an approach to clustering such datasets using homogeneity analysis. Homogeneity analysis determines a euclidean representation of the data. This can be analyzed by leveraging the large body of tools and techniques for data with a euclidean representation. Experiments conducted as part of this study suggest that this approach can be useful in the analysis and exploration of big datasets with a mixture of numerical and categorical attributes.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.