Evaluating Ensemble and Deep Learning Models for Static Malware Detection with Dimensionality Reduction Using the EMBER Dataset
Md Min-Ha-Zul Abedin, Tazqia Mehrub

TL;DR
This paper benchmarks various machine learning models, especially ensemble and deep learning methods, for static malware detection using the EMBER dataset, analyzing the impact of dimensionality reduction techniques on their performance.
Contribution
It provides a comprehensive comparison of eight classifiers with different preprocessing strategies, highlighting the robustness of ensemble methods and offering insights into feature reduction effects.
Findings
Ensemble models like LightGBM and XGBoost outperform others across configurations.
LDA improves KNN performance but reduces accuracy of boosting models.
TabNet underperforms with feature reduction, indicating architectural sensitivity.
Abstract
This study investigates the effectiveness of several machine learning algorithms for static malware detection using the EMBER dataset, which contains feature representations of Portable Executable (PE) files. We evaluate eight classification models: LightGBM, XGBoost, CatBoost, Random Forest, Extra Trees, HistGradientBoosting, k-Nearest Neighbors (KNN), and TabNet, under three preprocessing settings: original feature space, Principal Component Analysis (PCA), and Linear Discriminant Analysis (LDA). The models are assessed on accuracy, precision, recall, F1 score, and AUC to examine both predictive performance and robustness. Ensemble methods, especially LightGBM and XGBoost, show the best overall performance across all configurations, with minimal sensitivity to PCA and consistent generalization. LDA improves KNN performance but significantly reduces accuracy for boosting models.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Digital and Cyber Forensics · Network Security and Intrusion Detection
