Rare anomalies require large datasets: About proving the existence of anomalies

Simon Kl\"uttermann; Emmanuel M\"uller

arXiv:2508.09894·cs.LG·August 14, 2025

Rare anomalies require large datasets: About proving the existence of anomalies

Simon Kl\"uttermann, Emmanuel M\"uller

PDF

TL;DR

This study establishes a fundamental lower bound on dataset size needed to reliably prove the existence of anomalies, highlighting the challenge of detecting extremely rare anomalies in large datasets.

Contribution

It introduces a theoretical bound linking dataset size, contamination rate, and algorithm-specific constants for anomaly existence proof.

Findings

01

Derived a lower bound formula for dataset size based on contamination rate and algorithm constants.

02

Demonstrated that extremely rare anomalies require prohibitively large datasets for confirmation.

03

Provided extensive empirical validation across various anomaly detection tasks.

Abstract

Detecting whether any anomalies exist within a dataset is crucial for effective anomaly detection, yet it remains surprisingly underexplored in anomaly detection literature. This paper presents a comprehensive study that addresses the fundamental question: When can we conclusively determine that anomalies are present? Through extensive experimentation involving over three million statistical tests across various anomaly detection tasks and algorithms, we identify a relationship between the dataset size, contamination rate, and an algorithm-dependent constant $α_{algo}$ . Our results demonstrate that, for an unlabeled dataset of size $N$ and contamination rate $ν$ , the condition $N \geq \frac{α _{algo}}{ν ^{2}}$ represents a lower bound on the number of samples required to confirm anomaly existence. This threshold implies a limit to how rare anomalies can be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.