Virus-MNIST: A Benchmark Malware Dataset

David Noever; Samantha E. Miller Noever

arXiv:2103.00602·cs.CR·March 2, 2021·22 cites

Virus-MNIST: A Benchmark Malware Dataset

David Noever, Samantha E. Miller Noever

PDF

Open Access

TL;DR

This paper introduces Virus-MNIST, a malware dataset formatted as images for deep learning, demonstrating high accuracy in virus family classification and showcasing the transferability of image-based algorithms to malware detection.

Contribution

It provides a novel image-based malware dataset and benchmarks deep learning methods, enabling transfer of image classification techniques to malware analysis.

Findings

01

80% accuracy in virus family classification

02

87% accuracy in virus type classification after detection

03

Dataset available on Kaggle and Github

Abstract

The short note presents an image classification dataset consisting of 10 executable code varieties and approximately 50,000 virus examples. The malicious classes include 9 families of computer viruses and one benign set. The image formatting for the first 1024 bytes of the Portable Executable (PE) mirrors the familiar MNIST handwriting dataset, such that most of the previously explored algorithmic methods can transfer with minor modifications. The designation of 9 virus families for malware derives from unsupervised learning of class labels; we discover the families with KMeans clustering that excludes the non-malicious examples. As a benchmark using deep learning methods (MobileNetV2), we find an overall 80% accuracy for virus identification by families when beneware is included. We also find that once a positive malware detection occurs (by signature or heuristics), the projection of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Network Security and Intrusion Detection · Anomaly Detection Techniques and Applications