Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy
Bishwajit Prasad Gond, Rajneekant, Pushkar Kishore, Durga Prasad Mohapatra

TL;DR
This paper presents a novel malware classification method using NLP-based n-gram analysis combined with machine learning, achieving high accuracy and efficient feature selection on real-world samples.
Contribution
It introduces an NLP-driven n-gram approach for malware classification, improving accuracy and reducing feature dimensionality with hybrid feature selection.
Findings
Achieved 99.02% classification accuracy.
Reduced feature set to 1.6% of original features.
Outperformed traditional malware classification methods.
Abstract
This paper investigates the application of natural language processing (NLP)-based n-gram analysis and machine learning techniques to enhance malware classification. We explore how NLP can be used to extract and analyze textual features from malware samples through n-grams, contiguous string or API call sequences. This approach effectively captures distinctive linguistic patterns among malware and benign families, enabling finer-grained classification. We delve into n-gram size selection, feature representation, and classification algorithms. While evaluating our proposed method on real-world malware samples, we observe significantly improved accuracy compared to the traditional methods. By implementing our n-gram approach, we achieved an accuracy of 99.02% across various machine learning algorithms by using hybrid feature selection technique to address high dimensionality. Hybrid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Spam and Phishing Detection · Software Engineering Research
MethodsFeature Selection · Sparse Evolutionary Training
