MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels
Robert J. Joyce, Dev Amlani, Charles Nicholas, Edward Raff

TL;DR
The MOTIF dataset provides the largest, most diverse, expert-labeled malware family dataset to date, enabling more accurate evaluation and research in malware classification.
Contribution
We introduce MOTIF, the largest publicly available malware dataset with ground truth labels and threat report mappings, facilitating improved malware family classification research.
Findings
Existing malware classification tools have accuracy below 63%.
The dataset reveals significant labeling noise and open-set challenges.
Evaluation highlights the need for improved classification methods.
Abstract
Malware family classification is a significant issue with public safety and research implications that has been hindered by the high cost of expert labels. The vast majority of corpora use noisy labeling approaches that obstruct definitive quantification of results and study of deeper interactions. In order to provide the data needed to advance further, we have created the Malware Open-source Threat Intelligence Family (MOTIF) dataset. MOTIF contains 3,095 malware samples from 454 families, making it the largest and most diverse public malware dataset with ground truth family labels to date, nearly 3x larger than any prior expert-labeled corpus and 36x larger than the prior Windows malware corpus. MOTIF also comes with a mapping from malware samples to threat reports published by reputable industry sources, which both validates the labels and opens new research opportunities in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Network Security and Intrusion Detection · Cybercrime and Law Enforcement Studies
