Skewed Distributions or Transformations? Modelling Skewness for a Cluster Analysis
Michael P.B. Gallaugher, Paul D. McNicholas, Volodymyr Melnykov, and, Xuwen Zhu

TL;DR
This paper compares two main approaches for modeling skewness in data—using flexible skewed distributions versus transformations—evaluating their effectiveness for clustering across various datasets.
Contribution
It provides a comprehensive comparison of existing methods and introduces a new approach to assess cluster separation in skewed data.
Findings
Transformations often improve Gaussian mixture model performance on skewed data.
Flexible skewed distributions can better capture asymmetry in certain datasets.
The novel cluster separation assessment aids in choosing appropriate modeling strategies.
Abstract
Because of its mathematical tractability, the Gaussian mixture model holds a special place in the literature for clustering and classification. For all its benefits, however, the Gaussian mixture model poses problems when the data is skewed or contains outliers. Because of this, methods have been developed over the years for handling skewed data, and fall into two general categories. The first is to consider a mixture of more flexible skewed distributions, and the second is based on incorporating a transformation to near normality. Although these methods have been compared in their respective papers, there has yet to be a detailed comparison to determine when one method might be more suitable than the other. Herein, we provide a detailed comparison on many benchmarking datasets, as well as describe a novel method to assess cluster separation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Statistical Distribution Estimation and Applications · Census and Population Estimation
