Virus-MNIST: A Benchmark Malware Dataset
David Noever, Samantha E. Miller Noever

TL;DR
This paper introduces Virus-MNIST, a malware dataset formatted as images for deep learning, demonstrating high accuracy in virus family classification and showcasing the transferability of image-based algorithms to malware detection.
Contribution
It provides a novel image-based malware dataset and benchmarks deep learning methods, enabling transfer of image classification techniques to malware analysis.
Findings
80% accuracy in virus family classification
87% accuracy in virus type classification after detection
Dataset available on Kaggle and Github
Abstract
The short note presents an image classification dataset consisting of 10 executable code varieties and approximately 50,000 virus examples. The malicious classes include 9 families of computer viruses and one benign set. The image formatting for the first 1024 bytes of the Portable Executable (PE) mirrors the familiar MNIST handwriting dataset, such that most of the previously explored algorithmic methods can transfer with minor modifications. The designation of 9 virus families for malware derives from unsupervised learning of class labels; we discover the families with KMeans clustering that excludes the non-malicious examples. As a benchmark using deep learning methods (MobileNetV2), we find an overall 80% accuracy for virus identification by families when beneware is included. We also find that once a positive malware detection occurs (by signature or heuristics), the projection of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Network Security and Intrusion Detection · Anomaly Detection Techniques and Applications
