Are Outlier Detection Methods Resilient to Sampling?
Laure Berti-Equille, Ji Meng Loh, Saravanan Thirumuruganathan

TL;DR
This paper introduces the concept of resilience to sampling in outlier detection, proposing methods to estimate it and analyzing how different outlier detection techniques perform when trained on samples versus full datasets.
Contribution
It defines resilience to sampling for outlier detection, proposes a novel estimation approach, and provides an extensive experimental evaluation across multiple methods and datasets.
Findings
Methods vary in resilience to sampling.
Careful selection of sampling scheme and method is crucial.
Resilience estimation can guide method selection for large datasets.
Abstract
Outlier detection is a fundamental task in data mining and has many applications including detecting errors in databases. While there has been extensive prior work on methods for outlier detection, modern datasets often have sizes that are beyond the ability of commonly used methods to process the data within a reasonable time. To overcome this issue, outlier detection methods can be trained over samples of the full-sized dataset. However, it is not clear how a model trained on a sample compares with one trained on the entire dataset. In this paper, we introduce the notion of resilience to sampling for outlier detection methods. Orthogonal to traditional performance metrics such as precision/recall, resilience represents the extent to which the outliers detected by a method applied to samples from a sampling scheme matches those when applied to the whole dataset. We propose a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Imbalanced Data Classification Techniques · Water Systems and Optimization
