Decoding the Secrets of Machine Learning in Malware Classification: A   Deep Dive into Datasets, Feature Extraction, and Model Performance

Savino Dambra; Yufei Han; Simone Aonzo; Platon Kotzias; Antonino; Vitale; Juan Caballero; Davide Balzarotti; Leyla Bilge

arXiv:2307.14657·cs.CR·July 28, 2023

Decoding the Secrets of Machine Learning in Malware Classification: A Deep Dive into Datasets, Feature Extraction, and Model Performance

Savino Dambra, Yufei Han, Simone Aonzo, Platon Kotzias, Antonino, Vitale, Juan Caballero, Davide Balzarotti, Leyla Bilge

PDF

Open Access 1 Repo

TL;DR

This study investigates the factors affecting machine learning-based malware classification, analyzing dataset composition, feature types, and model performance, and introduces a large balanced malware dataset for comprehensive evaluation.

Contribution

It provides the largest balanced malware dataset to date and systematically examines how dataset characteristics and feature types influence classification performance.

Findings

01

Static features outperform dynamic features in malware classification.

02

Combining static and dynamic features yields marginal improvements.

03

More families make classification harder; more samples per family improve accuracy.

Abstract

Many studies have proposed machine-learning (ML) models for malware detection and classification, reporting an almost-perfect performance. However, they assemble ground-truth in different ways, use diverse static- and dynamic-analysis techniques for feature extraction, and even differ on what they consider a malware family. As a consequence, our community still lacks an understanding of malware classification results: whether they are tied to the nature and distribution of the collected dataset, to what extent the number of families and samples in the training dataset influence performance, and how well static and dynamic features complement each other. This work sheds light on those open questions. by investigating the key factors influencing ML-based malware detection and classification. For this, we collect the largest balanced malware dataset so far with 67K samples from 670…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eurecom-s3/decodingmlsecretsofwindowsmalwareclassification
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Anomaly Detection Techniques and Applications · Adversarial Robustness in Machine Learning