Model-Based Clustering with Sequential Outlier Identification using the Distribution of Mahalanobis Distances
Ult\'an P. Doherty, Paul D. McNicholas, Arthur White

TL;DR
This paper introduces outlierMBC, a novel model-based sequential outlier detection method that uses Mahalanobis distances and Gaussian mixture models to accurately identify and remove outliers without prior knowledge of their number or distribution.
Contribution
outlierMBC is a new approach that iteratively removes outliers based on Mahalanobis distances and automatically determines the optimal number of outliers without pre-specification.
Findings
Performs well on simulated data
Effective in real data applications
Automatically identifies outliers without prior info
Abstract
The presence of outliers can prevent clustering algorithms from accurately determining an appropriate group structure within a data set. We present outlierMBC, a model-based approach for sequentially removing outliers and clustering the remaining observations. Our method identifies outliers one at a time while fitting a multivariate Gaussian mixture model to data. Since it can be difficult to classify observations as outliers without knowing what the correct cluster structure is a priori, and the presence of outliers interferes with the process of modelling clusters correctly, we use an iterative method to identify outliers one by one. At each iteration, outlierMBC removes the observation with the lowest density and fits a Gaussian mixture model to the remaining data. The method continues to remove potential outliers until a pre-set maximum number of outliers is reached, then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Anomaly Detection Techniques and Applications · Advanced Statistical Methods and Models
