Rare anomalies require large datasets: About proving the existence of anomalies
Simon Kl\"uttermann, Emmanuel M\"uller

TL;DR
This study establishes a fundamental lower bound on dataset size needed to reliably prove the existence of anomalies, highlighting the challenge of detecting extremely rare anomalies in large datasets.
Contribution
It introduces a theoretical bound linking dataset size, contamination rate, and algorithm-specific constants for anomaly existence proof.
Findings
Derived a lower bound formula for dataset size based on contamination rate and algorithm constants.
Demonstrated that extremely rare anomalies require prohibitively large datasets for confirmation.
Provided extensive empirical validation across various anomaly detection tasks.
Abstract
Detecting whether any anomalies exist within a dataset is crucial for effective anomaly detection, yet it remains surprisingly underexplored in anomaly detection literature. This paper presents a comprehensive study that addresses the fundamental question: When can we conclusively determine that anomalies are present? Through extensive experimentation involving over three million statistical tests across various anomaly detection tasks and algorithms, we identify a relationship between the dataset size, contamination rate, and an algorithm-dependent constant . Our results demonstrate that, for an unlabeled dataset of size and contamination rate , the condition represents a lower bound on the number of samples required to confirm anomaly existence. This threshold implies a limit to how rare anomalies can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
