Machine Learning Transferability for Malware Detection

C\'esar Vieira; Jo\~ao Vitorino; Eva Maia; Isabel Pra\c{c}a

arXiv:2603.26632·cs.CR·March 30, 2026

Machine Learning Transferability for Malware Detection

C\'esar Vieira, Jo\~ao Vitorino, Eva Maia, Isabel Pra\c{c}a

PDF

TL;DR

This paper assesses how different data preprocessing methods affect machine learning models' ability to detect malware across various datasets, highlighting challenges in transferability due to feature incompatibility.

Contribution

It evaluates the effectiveness of specific preprocessing pipelines and training setups in improving ML malware detection transferability across multiple datasets.

Findings

01

Preprocessing impacts model generalization across datasets.

02

Training with combined datasets improves transferability.

03

Models trained on unified features perform better on unseen datasets.

Abstract

Malware continues to be a predominant operational risk for organizations, especially when obfuscation techniques are used to evade detection. Despite the ongoing efforts in the development of Machine Learning (ML) detection approaches, there is still a lack of feature compatibility in public datasets. This limits generalization when facing distribution shifts, as well as transferability to different datasets. This study evaluates the suitability of different data preprocessing approaches for the detection of Portable Executable (PE) files with ML models. The preprocessing pipeline unifies EMBERv2 (2,381-dim) features datasets, trains paired models under two training setups: EMBER + BODMAS and EMBER + BODMAS + ERMDS. Regarding model evaluation, both EMBER + BODMAS and EMBER + BODMAS + ERMDS models are tested against TRITIUM, INFERNO and SOREL-20M. ERMDS is also used for testing for the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.