On Generating and Labeling Network Traffic with Realistic,   Self-Propagating Malware

Molly Buchanan; Jeffrey W. Collyer; Jack W. Davidson; Saikat Dey; Mark; Gardner; Jason D. Hiser; Jeffry Lang; Alastair Nottingham; Alina Oprea

arXiv:2104.10034·cs.CR·May 30, 2022

On Generating and Labeling Network Traffic with Realistic, Self-Propagating Malware

Molly Buchanan, Jeffrey W. Collyer, Jack W. Davidson, Saikat Dey, Mark, Gardner, Jason D. Hiser, Jeffry Lang, Alastair Nottingham, Alina Oprea

PDF

Open Access

TL;DR

This paper introduces a method for generating realistic, labeled network traffic data by embedding defanged malware into real network environments, creating a valuable dataset for cybersecurity ML research.

Contribution

The authors present a novel approach to produce large-scale, realistic, labeled network traffic data by safely injecting malware into production networks and anonymizing the data for research use.

Findings

01

Generated a dataset with over 1.5 trillion connections and a petabyte of data.

02

Demonstrated the dataset's utility in AI/ML cybersecurity research.

03

Maintained high realism and security in data collection process.

Abstract

Research and development of techniques which detect or remediate malicious network activity require access to diverse, realistic, contemporary data sets containing labeled malicious connections. In the absence of such data, said techniques cannot be meaningfully trained, tested, and evaluated. Synthetically produced data containing fabricated or merged network traffic is of limited value as it is easily distinguishable from real traffic by even simple machine learning (ML) algorithms. Real network data is preferable, but while ubiquitous is broadly both sensitive and lacking in ground truth labels, limiting its utility for ML research. This paper presents a multi-faceted approach to generating a data set of labeled malicious connections embedded within anonymized network traffic collected from large production networks. Real-world malware is defanged and introduced to simulated,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Security and Intrusion Detection · Internet Traffic Analysis and Secure E-voting · Advanced Malware Detection Techniques