# Finding Outliers in Gaussian Model-Based Clustering

**Authors:** Katharine M. Clark, Paul D. McNicholas

arXiv: 1907.01136 · 2024-05-31

## TL;DR

This paper introduces OCLUST, a novel outlier trimming algorithm for Gaussian mixture models that automatically estimates the number of outliers by analyzing subset log-likelihoods and their distribution.

## Contribution

The paper proposes a new outlier detection method, OCLUST, based on the distribution of subset log-likelihoods, which automatically estimates the number of outliers in Gaussian clustering.

## Key findings

- OCLUST effectively identifies outliers without pre-specifying their number.
- The method leverages the beta distribution of Mahalanobis distances.
- OCLUST improves clustering robustness by trimming outliers adaptively.

## Abstract

Clustering, or unsupervised classification, is a task often plagued by outliers. Yet there is a paucity of work on handling outliers in clustering. Outlier identification algorithms tend to fall into three broad categories: outlier inclusion, outlier trimming, and post hoc outlier identification methods, with the former two often requiring pre-specification of the number of outliers. The fact that sample squared Mahalanobis distance is beta-distributed is used to derive an approximate distribution for the log-likelihoods of subset finite Gaussian mixture models. An algorithm is then proposed that removes the least plausible points according to the subset log-likelihoods, which are deemed outliers, until the subset log-likelihoods adhere to the reference distribution. This results in a trimming method, called OCLUST, that inherently estimates the number of outliers.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.01136/full.md

## Figures

20 figures with captions in the complete paper: https://tomesphere.com/paper/1907.01136/full.md

## References

48 references — full list in the complete paper: https://tomesphere.com/paper/1907.01136/full.md

---
Source: https://tomesphere.com/paper/1907.01136