Fast & Furious: Modelling Malware Detection as Evolving Data Streams

Fabr\'icio Ceschin; Marcus Botacin; Heitor Murilo Gomes; Felipe; Pinag\'e; Luiz S. Oliveira; Andr\'e Gr\'egio

arXiv:2205.12311·cs.CR·August 23, 2022

Fast & Furious: Modelling Malware Detection as Evolving Data Streams

Fabr\'icio Ceschin, Marcus Botacin, Heitor Murilo Gomes, Felipe, Pinag\'e, Luiz S. Oliveira, Andr\'e Gr\'egio

PDF

1 Repo

TL;DR

This paper evaluates how concept drift affects malware detection models over time and proposes a novel adaptive data stream pipeline to improve detection accuracy in evolving malware landscapes.

Contribution

It introduces a comprehensive analysis of concept drift in malware detection, compares drift detection methods, and proposes an adaptive pipeline that outperforms existing approaches.

Findings

01

Concept drift significantly impacts malware classifier performance over nine years.

02

The proposed adaptive pipeline maintains higher detection accuracy amidst evolving malware.

03

Certain drift detectors and feature extractors outperform others in real-world scenarios.

Abstract

Malware is a major threat to computer systems and imposes many challenges to cyber security. Targeted threats, such as ransomware, cause millions of dollars in losses every year. The constant increase of malware infections has been motivating popular antiviruses (AVs) to develop dedicated detection strategies, which include meticulously crafted machine learning (ML) pipelines. However, malware developers unceasingly change their samples' features to bypass detection. This constant evolution of malware samples causes changes to the data distribution (i.e., concept drifts) that directly affect ML model detection rates, something not considered in the majority of the literature work. In this work, we evaluate the impact of concept drift on malware classifiers for two Android datasets: DREBIN (about 130K apps) and a subset of AndroZoo (about 285K apps). We used these datasets to train an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fabriciojoc/fast-furious-malware-data-streams
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.