A Natural Language Processing Approach for Instruction Set Architecture Identification
Dinuka Sahabandu, Sukarno Mertoguno, Radha Poovendran

TL;DR
This paper introduces an NLP-inspired binary feature extraction method for machine learning-based identification of instruction set architectures, significantly improving accuracy and efficiency in binary analysis tasks.
Contribution
It proposes a novel character-level binary feature extraction model that enhances ISA identification accuracy and reduces feature set size without requiring domain knowledge.
Findings
8% higher accuracy than state-of-the-art methods
Character-level features reduce feature set size by up to 16x
Accuracy remains above 97% with reduced features
Abstract
Binary analysis of software is a critical step in cyber forensics applications such as program vulnerability assessment and malware detection. This involves interpreting instructions executed by software and often necessitates converting the software's binary file data to assembly language. The conversion process requires information about the binary file's target instruction set architecture (ISA). However, ISA information might not be included in binary files due to compilation errors, partial downloads, or adversarial corruption of file metadata. Machine learning (ML) is a promising methodology that can be used to identify the target ISA using binary data in the object code section of binary files. In this paper we propose a binary code feature extraction model to improve the accuracy and scalability of ML-based ISA identification methods. Our feature extraction model can be used in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Software Engineering Research · Digital and Cyber Forensics
