Datasets are not Enough: Challenges in Labeling Network Traffic
Jorge Guerra, Carlos Catania, Eduardo Veas

TL;DR
This paper critically analyzes current methodologies for labeling network traffic datasets, highlighting their limitations in quality, volume, and speed, and emphasizes the need for a standardized, continuous labeling approach to improve network security research.
Contribution
It provides an in-depth evaluation of existing network traffic labeling methods, identifying fundamental drawbacks and advocating for a consistent, validated labeling methodology.
Findings
Current labeling methods are often outdated and inconsistent.
Synthetic data generation hides key aspects of real network behavior.
Manual labeling with non-experts faces quality and scalability issues.
Abstract
In contrast to previous surveys, the present work is not focused on reviewing the datasets used in the network security field. The fact is that many of the available public labeled datasets represent the network behavior just for a particular time period. Given the rate of change in malicious behavior and the serious challenge to label, and maintain these datasets, they become quickly obsolete. Therefore, this work is focused on the analysis of current labeling methodologies applied to network-based data. In the field of network security, the process of labeling a representative network traffic dataset is particularly challenging and costly since very specialized knowledge is required to classify network traces. Consequently, most of the current traffic labeling methods are based on the automatic generation of synthetic network traces, which hides many of the essential aspects necessary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Security and Intrusion Detection · Internet Traffic Analysis and Secure E-voting · Anomaly Detection Techniques and Applications
