Rare Yet Popular: Evidence and Implications from Labeled Datasets for Network Anomaly Detection
Jose Manuel Navarro, Alexis Huet, Dario Rossi

TL;DR
This paper systematically analyzes the quality of ground truth datasets for network anomaly detection, revealing that some anomalies are more common than others and that clustering can significantly reduce labeling effort.
Contribution
It provides the first quantitative analysis of ground truth quality in network anomaly detection datasets, highlighting the spatial properties and anomaly popularity.
Findings
Anomalies vary significantly in popularity within datasets.
Clustering reduces labeling effort by 2x-10x.
First quantitative analysis of ground truth in real-world network data.
Abstract
Anomaly detection research works generally propose algorithms or end-to-end systems that are designed to automatically discover outliers in a dataset or a stream. While literature abounds concerning algorithms or the definition of metrics for better evaluation, the quality of the ground truth against which they are evaluated is seldom questioned. In this paper, we present a systematic analysis of available public (and additionally our private) ground truth for anomaly detection in the context of network environments, where data is intrinsically temporal, multivariate and, in particular, exhibits spatial properties, which, to the best of our knowledge, we are the first to explore. Our analysis reveals that, while anomalies are, by definition, temporally rare events, their spatial characterization clearly shows some type of anomalies are significantly more popular than others. We find…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Network Security and Intrusion Detection · Data-Driven Disease Surveillance
