SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection
Richard Harang, Ethan M. Rudd

TL;DR
The paper introduces SOREL-20M, a comprehensive large-scale dataset of nearly 20 million files with labels and features for advancing malicious PE detection research, along with baseline models and tools.
Contribution
It provides one of the largest labeled datasets for PE malware detection, including disarmed samples, and offers baseline models and code for further research.
Findings
Baseline models achieve measurable detection performance.
Disarmed malware samples enable exploration of detection strategies.
The dataset facilitates large-scale empirical research in malware detection.
Abstract
In this paper we describe the SOREL-20M (Sophos/ReversingLabs-20 Million) dataset: a large-scale dataset consisting of nearly 20 million files with pre-extracted features and metadata, high-quality labels derived from multiple sources, information about vendor detections of the malware samples at the time of collection, and additional ``tags'' related to each malware sample to serve as additional targets. In addition to features and metadata, we also provide approximately 10 million ``disarmed'' malware samples -- samples with both the optional\_headers.subsystem and file\_header.machine flags set to zero -- that may be used for further exploration of features and detection strategies. We also provide Python code to interact with the data and features, as well as baseline neural network and gradient boosted decision tree models and their results, with full training and evaluation code,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Anomaly Detection Techniques and Applications · Adversarial Robustness in Machine Learning
