Automatic Classification of Object Code Using Machine Learning

John Clemens

arXiv:1805.02146·stat.ML·May 8, 2018

Automatic Classification of Object Code Using Machine Learning

John Clemens

PDF

TL;DR

This paper demonstrates that machine learning can effectively classify un-labeled object code by target architecture and endianess using byte histograms and heuristic features, with high accuracy on a large dataset.

Contribution

It introduces a novel approach applying machine learning to classify object code attributes, utilizing simple byte histograms and heuristic features for architecture and endianess detection.

Findings

01

High classification accuracy achieved for target architecture

02

Effective endianess detection using operand-based heuristics

03

Large dataset of 16,000 samples supports robustness of methods

Abstract

Recent research has repeatedly shown that machine learning techniques can be applied to either whole files or file fragments to classify them for analysis. We build upon these techniques to show that for samples of un-labeled compiled computer object code, one can apply the same type of analysis to classify important aspects of the code, such as its target architecture and endianess. We show that using simple byte-value histograms we retain enough information about the opcodes within a sample to classify the target architecture with high accuracy, and then discuss heuristic-based features that exploit information within the operands to determine endianess. We introduce a dataset with over 16000 code samples from 20 architectures and experimentally show that by using our features, classifiers can achieve very high accuracy with relatively small sample sizes.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.