Efficient Malware Detection with Optimized Learning on High-Dimensional Features

Aditya Choudhary; Sarthak Pawar; Yashodhara Haribhakta

arXiv:2506.17309·cs.CR·June 24, 2025

Efficient Malware Detection with Optimized Learning on High-Dimensional Features

Aditya Choudhary, Sarthak Pawar, Yashodhara Haribhakta

PDF

TL;DR

This paper introduces a scalable malware detection method that combines feature selection and PCA to reduce high-dimensional features, achieving high accuracy with efficient computation and strong generalization to unseen datasets.

Contribution

It demonstrates the effectiveness of combining XGBoost feature selection and PCA for dimensionality reduction in malware detection, improving efficiency without sacrificing accuracy.

Findings

01

LightGBM on 384-dimensional features achieves 97.52% accuracy

02

Method generalizes well to unseen datasets, maintaining over 93% accuracy

03

Reduced features significantly lower computational requirements

Abstract

Malware detection using machine learning requires feature extraction from binary files, as models cannot process raw binaries directly. A common approach involves using LIEF for raw feature extraction and the EMBER vectorizer to generate 2381-dimensional feature vectors. However, the high dimensionality of these features introduces significant computational challenges. This study addresses these challenges by applying two dimensionality reduction techniques: XGBoost-based feature selection and Principal Component Analysis (PCA). We evaluate three reduced feature dimensions (128, 256, and 384), which correspond to approximately 5.4%, 10.8%, and 16.1% of the original 2381 features, across four models-XGBoost, LightGBM, Extra Trees, and Random Forest-using a unified training, validation, and testing split formed from the EMBER-2018, ERMDS, and BODMAS datasets. This approach ensures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFeature Selection · Sparse Evolutionary Training