Handling missing data in model-based clustering
Alessio Serafini, Thomas Brendan Murphy, Luca Scrucca

TL;DR
This paper introduces two novel MCEM-based methods for fitting Gaussian Mixture Models with missing data, improving clustering and density estimation accuracy over traditional imputation techniques.
Contribution
The paper proposes two new MCEM algorithms for GMMs that directly handle missing data, enhancing clustering and density estimation performance.
Findings
Proposed methods outperform multiple imputation in clustering accuracy.
New algorithms improve density estimation with missing data.
Methods demonstrate robustness across different missing data scenarios.
Abstract
Gaussian Mixture models (GMMs) are a powerful tool for clustering, classification and density estimation when clustering structures are embedded in the data. The presence of missing values can largely impact the GMMs estimation process, thus handling missing data turns out to be a crucial point in clustering, classification and density estimation. Several techniques have been developed to impute the missing values before model estimation. Among these, multiple imputation is a simple and useful general approach to handle missing data. In this paper we propose two different methods to fit Gaussian mixtures in the presence of missing data. Both methods use a variant of the Monte Carlo Expectation-Maximisation (MCEM) algorithm for data augmentation. Thus, multiple imputations are performed during the E-step, followed by the standard M-step for a given eigen-decomposed component-covariance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Statistical Methods and Bayesian Inference · Gene expression and cancer classification
