Clustering data with values missing at random using scale mixtures of multivariate skew-normal distributions

Jason Pillay; Cristina Tortora; Antonio Punzo; Andriette Bekker

arXiv:2507.20329·stat.ME·July 29, 2025

Clustering data with values missing at random using scale mixtures of multivariate skew-normal distributions

Jason Pillay, Cristina Tortora, Antonio Punzo, Andriette Bekker

PDF

TL;DR

This paper develops a flexible clustering method using scale mixtures of multivariate skew-normal distributions that effectively handles missing data under a missing at random mechanism, capturing skewness and heavy tails.

Contribution

It extends the FMSMSN family to incomplete data, deriving properties and an EM algorithm, enabling robust clustering with skewed, heavy-tailed data and missing values.

Findings

01

Demonstrates improved clustering performance with missing data

02

Provides closed-form expressions for missing data imputation

03

Shows applicability to real-world CO2 emissions data

Abstract

Handling missing data is a major challenge in model-based clustering, especially when the data exhibit skewness and heavy tails. We address this by extending the finite mixture of scale mixtures of multivariate skew-normal (FMSMSN) family to accommodate incomplete data under a missing at random (MAR) mechanism. Unlike previous work that is limited to one of the special cases of the FMSMSN family, our method offers a cluster analysis methodology for the entire family that accounts for skewness and excess kurtosis amidst data with missing values. The multivariate skew-normal distribution, as parameterised by \cite{azzalini1996} and \cite{arnoldbeaver} includes the normal distribution as a special case, which ensures that our method is flexible toward existing symmetric model-based clustering techniques under a normality assumption. We derive the distributional properties of the missing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.