Virus-MNIST: Machine Learning Baseline Calculations for Image   Classification

Erik Larsen; Korey MacVittie; and John Lilly

arXiv:2111.02375·cs.LG·November 4, 2021

Virus-MNIST: Machine Learning Baseline Calculations for Image Classification

Erik Larsen, Korey MacVittie, and John Lilly

PDF

Open Access

TL;DR

This paper introduces Virus-MNIST, a dataset of malware images for benchmarking virus classification models, and evaluates various machine learning algorithms on this dataset.

Contribution

It provides a new malware image dataset and baseline classification results using several machine learning models for future research.

Findings

01

Light Gradient Boosting Machine achieved high accuracy

02

Feature correlation analysis can reduce dimensionality

03

Model comparison highlights promising algorithms for malware classification

Abstract

The Virus-MNIST data set is a collection of thumbnail images that is similar in style to the ubiquitous MNIST hand-written digits. These, however, are cast by reshaping possible malware code into an image array. Naturally, it is poised to take on a role in benchmarking progress of virus classifier model training. Ten types are present: nine classified as malware and one benign. Cursory examination reveals unequal class populations and other key aspects that must be considered when selecting classification and pre-processing methods. Exploratory analyses show possible identifiable characteristics from aggregate metrics (e.g., the pixel median values), and ways to reduce the number of features by identifying strong correlations. A model comparison shows that Light Gradient Boosting Machine, Gradient Boosting Classifier, and Random Forest algorithms produced the highest accuracy scores,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Advanced Malware Detection Techniques · Digital Media Forensic Detection