Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data
Curtis Storlie, Scott Myers, S Katusic, Amy Weaver, Robert Voigt,, Robert Colligan, Paul Croarkin, Ruth Stoeckel, John Port

TL;DR
This paper introduces a model-based clustering method for mixed continuous and discrete variables with missing data, incorporating variable selection to identify key features influencing cluster formation, demonstrated on autism spectrum disorder data.
Contribution
It presents a novel approach combining Dirichlet process mixture models with variable selection for mixed data with missing values, applicable to health sciences and beyond.
Findings
Successfully identified three clusters in ASD data
Selected four test scores as most informative for clustering
Outperformed existing methods in simulation studies
Abstract
We consider the problem of model-based clustering in the presence of many correlated, mixed continuous and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder (ASD) on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (~480) in the data set along with many (~100) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (i) cluster these patients into similar groups to help identify those with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Statistical Methods and Bayesian Inference · Gene expression and cancer classification
