Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy

Bishwajit Prasad Gond; Rajneekant; Pushkar Kishore; Durga Prasad Mohapatra

arXiv:2506.16224·cs.CR·February 24, 2026

Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy

Bishwajit Prasad Gond, Rajneekant, Pushkar Kishore, Durga Prasad Mohapatra

PDF

Open Access 1 Repo

TL;DR

This paper presents a novel malware classification method using NLP-based n-gram analysis combined with machine learning, achieving high accuracy and efficient feature selection on real-world samples.

Contribution

It introduces an NLP-driven n-gram approach for malware classification, improving accuracy and reducing feature dimensionality with hybrid feature selection.

Findings

01

Achieved 99.02% classification accuracy.

02

Reduced feature set to 1.6% of original features.

03

Outperformed traditional malware classification methods.

Abstract

This paper investigates the application of natural language processing (NLP)-based n-gram analysis and machine learning techniques to enhance malware classification. We explore how NLP can be used to extract and analyze textual features from malware samples through n-grams, contiguous string or API call sequences. This approach effectively captures distinctive linguistic patterns among malware and benign families, enabling finer-grained classification. We delve into n-gram size selection, feature representation, and classification algorithms. While evaluating our proposed method on real-world malware samples, we observe significantly improved accuracy compared to the traditional methods. By implementing our n-gram approach, we achieved an accuracy of 99.02% across various machine learning algorithms by using hybrid feature selection technique to address high dimensionality. Hybrid…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bishwajitprasadgond/malwareclassification
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Spam and Phishing Detection · Software Engineering Research

MethodsFeature Selection · Sparse Evolutionary Training