Detecting Outliers in High-dimensional Data with Mixed Variable Types   using Conditional Gaussian Regression Models

Mads Lindskou; Torben Tvedebrink; Poul Svante Eriksen; Niels; Morling

arXiv:2103.02366·math.ST·May 20, 2021

Detecting Outliers in High-dimensional Data with Mixed Variable Types using Conditional Gaussian Regression Models

Mads Lindskou, Torben Tvedebrink, Poul Svante Eriksen, Niels, Morling

PDF

Open Access

TL;DR

This paper introduces a new outlier detection method for high-dimensional datasets with mixed variable types, leveraging decomposable graphical models to improve detection accuracy over existing algorithms.

Contribution

The authors propose a novel outlier detection approach using conditional Gaussian regression models tailored for mixed variable types in high-dimensional data.

Findings

01

Outperforms the Isolation Forest algorithm on real data.

02

Effectively models relationships between mixed variable types.

03

Provides an exact likelihood ratio test for outlier detection.

Abstract

Outlier detection has gained increasing interest in recent years, due to newly emerging technologies and the huge amount of high-dimensional data that are now available. Outlier detection can help practitioners to identify unwanted noise and/or locate interesting abnormal observations. To address this, we developed a novel method for outlier detection for use in, possibly high-dimensional, datasets with both discrete and continuous variables. We exploit the family of decomposable graphical models in order to model the relationship between the variables and use this to form an exact likelihood ratio test for an observation that is considered an outlier. We show that our method outperforms the state-of-the-art Isolation Forest algorithm on a real data example.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Advanced Statistical Methods and Models · Data-Driven Disease Surveillance