Label-efficient Training Updates for Malware Detection over Time
Luca Minnei, Cristian Manca, Giorgio Piras, Angelo Sotgiu, Maura Pintor, Daniele Ghiani, Davide Maiorca, Giorgio Giacinto, and Battista Biggio

TL;DR
This paper proposes a model-agnostic framework combining active and semi-supervised learning to reduce labeling costs and improve malware detection over time, with a new methodology for analyzing feature drift.
Contribution
It introduces a comprehensive, domain-agnostic evaluation of AL and SSL techniques for malware detection and a feature-level drift analysis methodology.
Findings
Combining AL and SSL reduces manual labeling by up to 90%.
The combined techniques achieve detection performance comparable to full retraining.
Feature stability correlates with detector performance over time.
Abstract
Machine Learning (ML)-based detectors are becoming essential to counter the proliferation of malware. However, common ML algorithms are not designed to cope with the dynamic nature of real-world settings, where both legitimate and malicious software evolve. This distribution drift causes models trained under static assumptions to degrade over time unless they are continuously updated. Regularly retraining these models, however, is expensive, since labeling new acquired data requires costly manual analysis by security experts. To reduce labeling costs and address distribution drift in malware detection, prior work explored active learning (AL) and semi-supervised learning (SSL) techniques. Yet, existing studies (i) are tightly coupled to specific detector architectures and restricted to a specific malware domain, resulting in non-uniform comparisons; and (ii) lack a consistent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
