Efficient Malware Detection with Optimized Learning on High-Dimensional Features
Aditya Choudhary, Sarthak Pawar, Yashodhara Haribhakta

TL;DR
This paper introduces a scalable malware detection method that combines feature selection and PCA to reduce high-dimensional features, achieving high accuracy with efficient computation and strong generalization to unseen datasets.
Contribution
It demonstrates the effectiveness of combining XGBoost feature selection and PCA for dimensionality reduction in malware detection, improving efficiency without sacrificing accuracy.
Findings
LightGBM on 384-dimensional features achieves 97.52% accuracy
Method generalizes well to unseen datasets, maintaining over 93% accuracy
Reduced features significantly lower computational requirements
Abstract
Malware detection using machine learning requires feature extraction from binary files, as models cannot process raw binaries directly. A common approach involves using LIEF for raw feature extraction and the EMBER vectorizer to generate 2381-dimensional feature vectors. However, the high dimensionality of these features introduces significant computational challenges. This study addresses these challenges by applying two dimensionality reduction techniques: XGBoost-based feature selection and Principal Component Analysis (PCA). We evaluate three reduced feature dimensions (128, 256, and 384), which correspond to approximately 5.4%, 10.8%, and 16.1% of the original 2381 features, across four models-XGBoost, LightGBM, Extra Trees, and Random Forest-using a unified training, validation, and testing split formed from the EMBER-2018, ERMDS, and BODMAS datasets. This approach ensures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFeature Selection · Sparse Evolutionary Training
