Empirical Evaluation of SMOTE in Android Malware Detection with Machine Learning: Challenges and Performance in CICMalDroid 2020
Diego Ferreira Duarte, Andre Augusto Bortoli

TL;DR
This study empirically evaluates the impact of SMOTE on machine learning models for Android malware detection using CICMalDroid2020 data, revealing that SMOTE often degrades performance and that tree-based models like XGBoost excel without it.
Contribution
It provides the first comprehensive empirical analysis of SMOTE's effectiveness in Android malware detection with dynamic features, highlighting the robustness of tree-based algorithms.
Findings
SMOTE often degrades model performance in this context
Tree-based models outperform others with recall above 94%
SMOTE may not be suitable for complex, sparse dynamic malware data
Abstract
Malware, malicious software designed to damage computer systems and perpetrate scams, is proliferating at an alarming rate, with thousands of new threats emerging daily. Android devices, prevalent in smartphones, smartwatches, tablets, and IoTs, represent a vast attack surface, making malware detection crucial. Although advanced analysis techniques exist, Machine Learning (ML) emerges as a promising tool to automate and accelerate the discovery of these threats. This work tests ML algorithms in detecting malicious code from dynamic execution characteristics. For this purpose, the CICMalDroid2020 dataset, composed of dynamically obtained Android malware behavior samples, was used with the algorithms XGBoost, Na{\i}ve Bayes (NB), Support Vector Classifier (SVC), and Random Forest (RF). The study focused on empirically evaluating the impact of the SMOTE technique, used to mitigate class…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Network Security and Intrusion Detection · Digital and Cyber Forensics
