Semi-supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection
Maksim E. Eren, Manish Bhattarai, Robert J. Joyce, Edward Raff,, Charles Nicholas, Boian S. Alexandrov

TL;DR
This paper introduces a hierarchical semi-supervised malware classification method that effectively handles extreme class imbalance, detects new malware families, and requires minimal labeled data, demonstrated on a large real-world dataset.
Contribution
The proposed HNMFk Classifier uniquely combines hierarchical non-negative matrix factorization with automatic model selection for semi-supervised malware classification under class imbalance.
Findings
Achieved an F1 score of 0.80 on EMBER-2018 dataset.
Outperformed baseline supervised and semi-supervised models.
Effectively identified rare and novel malware families.
Abstract
Identification of the family to which a malware specimen belongs is essential in understanding the behavior of the malware and developing mitigation strategies. Solutions proposed by prior work, however, are often not practicable due to the lack of realistic evaluation factors. These factors include learning under class imbalance, the ability to identify new malware, and the cost of production-quality labeled data. In practice, deployed models face prominent, rare, and new malware families. At the same time, obtaining a large quantity of up-to-date labeled malware for training a model can be expensive. In this paper, we address these problems and propose a novel hierarchical semi-supervised algorithm, which we call the HNMFk Classifier, that can be used in the early stages of the malware family labeling process. Our method is based on non-negative matrix factorization with automatic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
