Clustering with missing data: which imputation model for which cluster analysis method?
Vincent Audigier, Nd\`eye Niang, Matthieu Resche-Rigon

TL;DR
This paper examines the compatibility of different multiple imputation methods with clustering techniques for continuous data, emphasizing the importance of model congruence and proposing a new FCS method for complex distributions.
Contribution
It introduces a new FCS multiple imputation method with theoretical properties similar to JM-GL and extends it for complex distributions, improving clustering accuracy.
Findings
Imputation models that consider data clustering improve partition accuracy.
JM-GL and JM-DP are suitable for Gaussian mixture distributed data.
FCS methods outperform JM methods on complex data distributions.
Abstract
Multiple imputation (MI) is a popular method for dealing with missing values. One main advantage of MI is to separate the imputation phase and the analysis one. However, both are related since they are based on distribution assumptions that have to be consistent. This point is well known as congeniality. In this paper, we discuss congeniality for clustering on continuous data. First, we theoretically highlight how two joint modeling (JM) MI methods (JM-GL and JM-DP) are congenial with various clustering methods. Then, we propose a new fully conditional specification (FCS) MI method with the same theoretical properties as JM-GL. Finally, we extend this FCS MI method to account for more complex distributions. Based on an extensive simulation study, all MI methods are compared for various cluster analysis methods (k-means, k-medoids, mixture model, hierarchical clustering). This study…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Statistical Methods and Bayesian Inference · Advanced Clustering Algorithms Research
