TL;DR
This paper introduces a new open dataset, toolset, and evaluation framework for accurately identifying CPU architecture and endianness in binary files, addressing current gaps in research and enabling better comparison of methods.
Contribution
It develops comprehensive open datasets, tools, and APIs for automated architecture and endianness detection, facilitating evaluation and advancement of existing techniques.
Findings
Classifiers achieved over 98% accuracy in identifying architecture and endianness.
The new datasets and tools enable effective benchmarking of detection methods.
Results support the validity of current algorithms in real-world binary analysis.
Abstract
Static and dynamic binary analysis techniques are actively used to reverse engineer software's behavior and to detect its vulnerabilities, even when only the binary code is available for analysis. To avoid analysis errors due to misreading op-codes for a wrong CPU architecture, these analysis tools must precisely identify the Instruction Set Architecture (ISA) of the object code under analysis. The variety of CPU architectures that modern security and reverse engineering tools must support is ever increasing due to massive proliferation of IoT devices and the diversity of firmware and malware targeting those devices. Recent studies concluded that falsely identifying the binary code's ISA caused alone about 10\% of failures of IoT firmware analysis. The state of the art approaches to detect ISA for arbitrary object code look promising - their results demonstrate effectiveness and…
| Type | Approx. # of files in dataset | Approx. total size |
|---|---|---|
| .iso files | ~1600 | ~1843 GB |
| .deb files | ~79000 | ~36 GB |
| ELF files | 105000 | ~29 GB |
| ELF code sections | 105000 | ~17 GB |
| Architecture |
|
Wordsize | Endianness | File info |
|
|
||||||
| alpha | 3000 | 64 | Little | ELF 64-bit LSB executable, Alpha (unofficial) | 4042 | 1.62 | ||||||
| amd64 | 2994 | 64 | Little | ELF 64-bit LSB executable, x86-64 | 6221 | 1.19 | ||||||
| arm64 | 2997 | 64 | Little | ELF 64-bit LSB executable, ARM aarch64 | 4255 | 0.84 | ||||||
| armel | 2994 | 32 | Little | ELF 32-bit LSB executable, ARM | 4621 | 0.86 | ||||||
| armhf | 2994 | 32 | Little | ELF 32-bit LSB executable, ARM | 4618 | 0.70 | ||||||
| hppa | 3000 | 32 | Big | ELF 32-bit MSB executable, PA-RISC (LP64) | 4909 | 1.48 | ||||||
| i386 | 2994 | 32 | Little | ELF 32-bit LSB executable, Intel 80386 | 6742 | 1.15 | ||||||
| ia64 | 3000 | 64 | Little | ELF 64-bit LSB executable, IA-64 | 5046 | 2.75 | ||||||
| m68k | 3000 | 32 | Big | ELF 32-bit MSB executable, Motorola 68020 | 4440 | 1.17 | ||||||
| mips | 2997 | 32 | Big | ELF 32-bit MSB executable, MIPS, MIPS-II ( upsampled) | 418 | 1.00 | ||||||
| mips64el | 2998 | 64 | Little | ELF 64-bit LSB executable, MIPS, MIPS64 rel2 ( new) | 6430 | 2.61 | ||||||
| mipsel | 2994 | 32 | Little | ELF 32-bit LSB executable, MIPS, MIPS-II | 4396 | 1.01 | ||||||
| powerpc | 2110 | 32 | Big | ELF 32-bit MSB executable, PowerPC or cisco 4500 | 3672 | 1.29 | ||||||
| powerpcspe | 3000 | 32 | Big | ELF 32-bit MSB executable, PowerPC or cisco 4500 ( new) | 3976 | 1.63 | ||||||
| ppc64 | 2552 | 64 | Big | ELF 64-bit MSB executable, 64-bit PowerPC or cisco 7500 | 2900 | 1.75 | ||||||
| ppc64el | 2997 | 64 | Little | ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500 ( new) | 4370 | 1.03 | ||||||
| riscv64 | 3000 | 64 | Little | ELF 64-bit LSB executable ( new) | 4513 | 1.18 | ||||||
| s390 | 2997 | 32 | Big | ELF 32-bit MSB executable, IBM S/390 | 5562 | 0.61 | ||||||
| s390x | 2994 | 64 | Big | ELF 64-bit MSB executable, IBM S/390 | 4169 | 1.03 | ||||||
| sh4 | 3000 | 32 | Little | ELF 32-bit LSB executable, Renesas SH | 6003 | 1.30 | ||||||
| sparc | 2997 | 32 | Big | ELF 32-bit MSB executable, SPARC32PLUS, V8+ Required | 6111 | 0.62 | ||||||
| sparc64 | 2676 | 64 | Big | ELF 64-bit MSB executable, SPARC V9, relaxed memory ordering | 3338 | 1.37 | ||||||
| x32 | 3000 | 32 | Little | ELF 32-bit LSB executable, x86-64 ( new) | 4261 | 1.56 | ||||||
| Total | 67285 | – | – | – | 105013 | 29.74 GB |
| Model | Weka name | Parameters | ||
|---|---|---|---|---|
| 1 nearest neighbor (1-NN) | IBk | -K 1 -W 0 -A ”weka.core.neighboursearch.-LinearNNSearch -A ”weka.core.-EuclideanDistance -R first-last”” | ||
| 3 Nearest neighbors (3-NN) | IBk | -K 3 -W 0 -A ”weka.core.neighboursearch.-LinearNNSearch -A ”weka.core.-EuclideanDistance -R first-last”” | ||
| Decision tree | J48 | -C 0.25 -M 2 | ||
| Random tree | RandomTree | -K 0 -M 1.0 -V 0.001 -S 1 | ||
| Random forest | RandomForest | -I 100 -K 0 -S 1 -num-slots 1 | ||
| Naive Bayes | NaiveBayes | N/A | ||
| BayesNet | Bayesnet |
|
||
| SVM (SMO) | SMO |
|
||
| Logistic regression | SimpleLogistic | -I 0 -M 500 -H 50 -W 0.0 | ||
| Neural net | MultilayerPerceptron | -L 0.3 -M 0.2 -N 100 -V 0 -S 0 -E 20 -H 66 | ||
| Neural net ( this paper) | MultilayerPerceptron | -L 0.3 -M 0.2 -N 100 -V 20 -S 0 -E 20 -H 66 -C -I -num-decimal-places 10 |
| Type | Role | Configuration | |||
|---|---|---|---|---|---|
| Software | Dataset acquisition |
|
|||
| Software | ML experiments |
|
|||
| Software | OpenAPI web-services |
|
|||
| Software | Radare2 plugin |
|
| Classifier | Precision | Recall | AUC |
|
|
|
|
|
|
|||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1-NN | 0.983 | 0.983 | 0.991 | 0.983 | 0.983 | 0.911 | 0.927 | 0.895 | 0.893 | |||||||||||||||
| 3-NN | 0.994 | 0.994 | 0.999 | 0.994 | 0.993 | 0.957 | 0.949 | 0.902 | 0.898 | |||||||||||||||
| Decision tree | 0.992 | 0.992 | 0.998 | 0.992 | 0.992 | 0.993 | 0.980 | 0.936 | 0.932 | |||||||||||||||
| Random tree | 0.966 | 0.966 | 0.982 | 0.966 | 0.965 | 0.953 | 0.929 | 0.899 | 0.878 | |||||||||||||||
| Random forest | 0.996 | 0.996 | 1.000 | 0.996 | 0.996 | 0.992 | 0.964 | 0.904 | 0.904 | |||||||||||||||
| Naive Bayes | 0.991 | 0.991 | 0.999 | 0.991 | 0.990 | 0.990 | 0.958 | 0.932 | 0.925 | |||||||||||||||
| BayesNet | 0.992 | 0.992 | 1.000 | 0.992 | 0.991 | 0.994 | 0.922 | 0.917 | 0.895 | |||||||||||||||
| SVM (SMO) | 0.997 | 0.997 | 1.000 | 0.997 | 0.997 | 0.997 | 0.983 | 0.931 | 0.927 | |||||||||||||||
| Logistic regression | 0.989 | 0.988 | 0.998 | 0.989 | 0.988 | 0.997 | 0.979 | 0.939 | 0.930 | |||||||||||||||
| Neural net | 0.995* | 0.994* | 1.000* | 0.994* | 0.994* | 0.919 | 0.979 | 0.940 | 0.940 |
| Classifier | Precision | Recall | AUC | F1 measure | Accuracy |
|---|---|---|---|---|---|
| Weka | 0.989 | 0.988 | 0.998 | 0.989 | 0.988 |
| scikit-learn | 0.998 | 0.998 | 0.998 | 0.998 | 0.996 |
| Keras | 0.998 | 0.998 | 0.998 | 0.998 | 0.997 |
| Classifier | Precision | Recall | AUC | F1 measure | Accuracy |
|---|---|---|---|---|---|
| 1-NN | 0.871 | 0.742 | 0.867 | 0.772 | 0.741 |
| 3-NN | 0.876 | 0.749 | 0.892 | 0.773 | 0.749 |
| Decision tree | 0.845 | 0.717 | 0.865 | 0.733 | 0.716 |
| Random tree | 0.679 | 0.613 | 0.798 | 0.619 | 0.613 |
| Random forest | 0.912 | 0.902 | 0.995 | 0.892 | 0.901 |
| Naive Bayes | 0.807 | 0.420 | 0.727 | 0.419 | 0.420 |
| Bayes net | 0.886 | 0.844 | 0.987 | 0.840 | 0.844 |
| SVM (SMO) | 0.883 | 0.733 | 0.971 | 0.766 | 0.732 |
| Logistic regression (Weka) | 0.875 | 0.718 | 0.978 | 0.728 | 0.718 |
| Logistic regression (scikit-learn) | 0.913 | 0.780 | 0.780 | 0.794 | 0.579 |
| Logistic regression (Keras) | 0.921 | 0.831 | 0.831 | 0.839 | 0.676 |
| Neural net | 0.841 | 0.452 | 0.875 | 0.515 | 0.451 |
| Average (De Nicolao et al. (De Nicolao et al., 2018)) | 0.996 | 0.996 | 0.998 | 0.996 | - |
| Classifier | Recall | Precision | Accuracy | ||
|---|---|---|---|---|---|
| Decision jungle (Azure) | 0.964 | 0.964 | 0.964 | ||
| Random forest (Azure) | 0.974 | 0.974 | 0.974 | ||
|
0.902 | 0.912 | 0.901 |
| Classifier | Precision | Recall | AUC | F1 measure | Accuracy |
|
||
|---|---|---|---|---|---|---|---|---|
| Random forest (Weka) | 0.987 | 0.987 | 1.000 | 0.987 | 0.986 | 0.901 | ||
| Random forest (Azure) | 0.992 | 0.992 | - | - | 0.992 | 0.974 | ||
| Decision jungle (Azure) | 0.989 | 0.989 | - | - | 0.989 | 0.964 | ||
| Logistic regression (Weka) | 0.983 | 0.983 | 0.999 | 0.983 | 0.982 | 0.718 | ||
| Logistic regression (scikit) | 0.985 | 0.984 | 0.984 | 0.984 | 0.971 | 0.579 | ||
| Logistic regression (Keras) | 0.990 | 0.989 | 0.989 | 0.989 | 0.980 | 0.676 | ||
| SVM (SMO) | 0.975 | 0.975 | 0.997 | 0.975 | 0.974 | 0.732 |
| Architecture | F1 score | Improvement | |
|---|---|---|---|
| without signature | with signature | ||
| PPC | 0.868 | 0.894 | 2.99% |
| PPCspe | 0.888 | 0.906 | 2.02% |
|
|
|
|
|||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
802 | 200 | 843 | 160 | ||||||||
|
39 | 963 | 36 | 967 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Towards usable automated detection of CPU architecture and endianness for arbitrary binary files and object code sequences
Sami Kairajärvi
University of Jyväskyla, Jyväskyla, Finland
,
Andrei Costin
University of Jyväskyla, Jyväskyla, Finland
and
Timo Hämäläinen
University of Jyväskyla, Jyväskyla, Finland
Abstract.
Static and dynamic binary analysis techniques are actively used to reverse engineer software’s behavior and to detect its vulnerabilities, even when only the binary code is available for analysis. To avoid analysis errors due to misreading op-codes for a wrong CPU architecture, these analysis tools must precisely identify the Instruction Set Architecture (ISA) of the object code under analysis. The variety of CPU architectures that modern security and reverse engineering tools must support is ever increasing due to massive proliferation of IoT devices and the diversity of firmware and malware targeting those devices. Recent studies concluded that falsely identifying the binary code’s ISA caused alone about 10% of failures of IoT firmware analysis. The state of the art approaches to detect ISA for arbitrary object code look promising – their results demonstrate effectiveness and high-performance. However, they lack the support of publicly available datasets and toolsets, which makes the evaluation, comparison, and improvement of those techniques, datasets, and machine learning models quite challenging (if not impossible). This paper bridges multiple gaps in the field of automated and precise identification of architecture and endianness of binary files and object code. We develop from scratch the toolset and datasets that are lacking in this research space. As such, we contribute a comprehensive collection of open data, open source, and open API web-services. We also attempt experiment reconstruction and cross-validation of effectiveness, efficiency, and results of the state of the art methods. When training and testing classifiers using solely code-sections from executable binary files, all our classifiers performed equally well achieving over 98% accuracy. The results are consistent and comparable with the current state of the art, hence supports the general validity of the algorithms, features, and approaches suggested in those works. Complementing the field, we propose using complete binaries in either testing or training&testing mode – experiments show that ISA of complete binary files is identified with 99.2% accuracy using Random Forest classifier.
Binary code analysis, Firmware analysis, Instruction Set Architecture (ISA), Supervised Machine Learning, Reverse engineering, Malware analysis, Digital forensics,
††copyright: none††conference: arXiv.org; pre-print; 15.8.2019††ccs: Security and privacy Software reverse engineering††ccs: Security and privacy††ccs: Computing methodologies Machine learning††ccs: Computing methodologies Artificial intelligence††ccs: Computer systems organization Embedded and cyber-physical systems††ccs: Software and its engineering Software creation and management
1. Introduction
Reverse engineering and analysis of binary code has a wide spectrum of applications (Sutherland et al., 2006; Liu et al., 2013; Shoshitaishvili et al., 2016b), ranging from vulnerability research (Wang et al., 2009; Cha et al., 2012; Costin et al., 2014; Shoshitaishvili et al., 2015; Costin et al., 2016) to binary patching and translation (Sites et al., 1993), and from digital forensics (Clemens, 2015) to anti-malware and Intrusion Detection Systems (IDS) (Van Den Berg and Chinchani, 2009; Manni et al., 2014). For such applications, various static and dynamic analysis techniques and tools are constantly being researched, developed, and improved (Song et al., 2008; Chipounov et al., 2011; Brumley et al., 2011; Liu et al., 2013; Wang and Shoshitaishvili, 2017; ”pancake” Alvarez and core contributors, [n. d.]; Eagle, 2011).
Regardless of their end goal, one of the important steps in these techniques is to correctly identify the Instruction Set Architecture (ISA) of the op-codes within the binary code. Some techniques can perform the analysis using architecture-independent or cross-architecture methods (Pewny et al., 2015; Eschweiler et al., 2016; Feng et al., 2016). However, many of those techniques still require the exact knowledge of the binary code’s ISA. For example, recent studies concluded that falsely identifying the binary code’s ISA caused about 10% of failures of IoT/embedded firmware analysis (Costin et al., 2014; Costin et al., 2016).
Sometimes the CPU architecture is available in the executable format’s header sections, for example in ELF file format (noh, [n. d.]). However, this information is not guaranteed to be universally available for analysis. There are multiple reasons for this and we will detail a few of them. The advances in Internet of Things (IoT) technologies bring to the game an extreme variety of hardware and software, in particular new CPU architectures, new OSs or OS-like systems (Muench et al., 2018). Many of those devices are resource-constrained and the binary code comes without sophisticated headers and OS abstraction layers. At the same time, the digital forensics and the IDSs sometimes may have access only to a fraction of the code, e.g., from an exploit (shell-code), malware trace, or a memory dump. For example, many shell-codes are exactly this – a quite short, formatless and headerless sequence of CPU op-codes for a target system (i.e., a combination of hardware, operating-system, and abstraction layers) that performs a presumably malicious action on behalf of the attacker (Foster, 2005; Polychronakis et al., 2010). In such cases, though possible in theory, it is quite unlikely that the full code including the headers specifying CPU ISA will be available for analysis. Finally, even in the traditional computing world of (e.g., x86/x86_64), there are situations where the hosts contain object code for CPU architectures other than the one of the host itself. Examples include firmware for network cards (Delugré, 2010; Duflot et al., 2011; Blanco and Eissler, 2012), various management co-processors (Miller, 2011), and device drivers (Chipounov and Candea, 2010; Kadav and Swift, 2012) for USB (Nohl and Lell, 2014; Tian et al., 2015) and other embedded devices that contain own specialized processors and provide specific services (e.g., encryption, data codecs) that are implemented as peripheral firmware (Li et al., 2011a; Davidson et al., 2013). Even worse, more often than not the object code for such peripheral firmware is stored using less traditional or non-standard file formats and headers (if any), or embedded inside the device drivers themselves resulting in mixed architectures code streams.
Currently, several state of the art works try to address the challenge of accurately identifying the CPU ISA for arbitrary object code sequences (Clemens, 2015; De Nicolao et al., 2018). Their approaches look promising as the results demonstrate effectiveness and high-performance. However, they lack the support of publicly available datasets and toolsets, which makes the evaluation, comparison, and improvement of those techniques, datasets, and machine learning models quite challenging (if not impossible).
With this paper, we bridge multiple gaps in the field of automated and precise identification of architecture and endianness of binary files and object code. We develop from scratch the toolset and datasets that are lacking in this research space. To this end, we release a comprehensive collection of open data, open source, and open API web-services. We attempt experiment reconstruction and cross-validation of effectiveness, efficiency, and results of the state of the art methods (Clemens, 2015; De Nicolao et al., 2018), as well as propose and experimentally validate new approaches to the main classification challenge where we obtain consistently comparable, and in some scenarios better, results. The results we obtain in our extensive set of experiments are consistent and comparable with prior art, hence supports the general validity and soundness of both existing and newly proposed algorithms, features, and approaches.
1.1. Contributions
In this paper, we present the following contributions:
- a)
First and foremost contribution is that we implement and release as open source the code and toolset necessary to reconstruct and re-run the experiments from this paper as well as from the state of the art works of Clemens (Clemens, 2015) and De Nicolao et al. (De Nicolao et al., 2018). To the best of our knowledge, this is the first such toolset to be publicly released. 2. b)
Second and equally important contribution is that we release as open data the machine learning models and data necessary to both validate our results and expand further the datasets and the research field. To the best of our knowledge, this is both the first and the largest dataset of this type to be publicly released. 3. c)
Third valuable contribution is that we propose, evaluate and validate the use of “complete binaries” when training the classifiers (as opposed to current “code-sections”-only approaches (Clemens, 2015; De Nicolao et al., 2018)). We achieve a 99.2% classification accuracy when testing with Random Forest classifiers implemented in Azure Machine Learning platform. 4. d)
Last but not least, we perform and present the first independent study that attempts experiment reconstruction and cross-validation of results, effectiveness and efficiency of state of the art methods (Clemens (Clemens, 2015), De Nicolao et al. (De Nicolao et al., 2018)). 5. e)
We release both the dataset and the toolset as open source and open data, and are accessible at: https://github.com/kairis/
1.2. Organization
The rest of this paper is organized as follows. We review the related work in Section 2. Then we detail our methodology, experimental setups and datasets in Section 3. We provide a detailed analysis of results in Section 4. Finally, we discuss future work and conclude with Section 5.
2. Related work
McDaniel and Heydari (McDaniel and Heydari, 2003) introduced the idea to use file contents to identify the file type. Previous methods utilized metadata such as fixed file extension, fixed ”magic numbers” and proprietary descriptive file wrappers. The authors showed that many file types have characteristic patterns that can be used to differentiate them from other file formats. They used byte frequency analysis, byte frequency cross-correlation analysis and file header/trailer analysis, and reported accuracies ranging from 23% to 96% depending on the algorithm used. A lot of research has utilized the difference in byte frequencies when classifying file types (Fitzgerald et al., 2012; Li et al., 2011b, 2005; Xie et al., 2013; Beebe et al., 2013; Sportiello and Zanero, 2012; Penrose et al., 2013). Clemens (Clemens, 2015) applied the techniques introduced by McDaniel and Heydari (McDaniel and Heydari, 2003) and proposed methods for classifying object code with its target architecture and endianness. The author used byte-value histograms combined with signatures to train different classifiers using supervised machine learning. This approach produced promising results and showed that machine learning can be an effective tool when classifying architecture of object code. Classification accuracy varied depending on the used classifier from 92.2% to 98.4%. The authors proposed future research ideas for a larger data set, which would include more architectures and code samples compiled with different compilers. De Nicolao et al. (De Nicolao et al., 2018) extend the work of Clemens (Clemens, 2015) as part of their research by adding more signatures to use as an input for the classifier. They obtain a global accuracy of 99.8% with logistic regression, which is higher than previously demonstrated by Clemens (Clemens, 2015). For further research, they propose to incorporate instruction-level features such as trying to group bytes corresponding to code into valid instructions. In a slightly different direction, Costin et al. (Costin et al., 2016) go through every executable inside the firmware and use ELF headers to identify the architecture if the header is present. The authors determine the architecture of the firmware by counting the amount of architecture specific binaries the firmware contains. This information is used to launch an emulator for that specific architecture. cpu_rec (Granboulan, [n. d.]) uses the fact that the probability distribution of one byte depends on the value of the previous one and can detect 72 code architectures (Granboulan, 2017). The author uses Markov chains and Kullback-Leibler divergence for classification. A sliding window is used to handle files with code for multiple architectures in them. The author does not publish any performance measures other than that the analysis of 1 MB takes 60 seconds on 1GB of RAM. Angr static analysis framework (Shoshitaishvili et al., 2016b) includes the Boyscout tool. It leverages static signatures to identify the CPU architecture executable files. Boyscout tries to match the file to a set of signatures containing the byte patterns of function prologues and epilogues for known and surveyed architectures. One of the limitations of such an approach is that the signatures require maintenance. The performance of the classification is highly dependent on keeping those signatures up-to-date, complete and highly-qualitative, which can be challenging. Also, this technique can be less effective for heavily optimized code which is many times the case with the resource-constrained IoT/embedded devices. Binwalk (bin, [n. d.]) uses architecture-specific features (totalling 33 signatures for 9 CPU architectures) along with Capstone disassembler (Quynh, 2014) (having 9 configurations for 4 CPU architectures) to identify object code’s architecture 111Last checked on 20.4.2019.. Using only signatures has some limitations as for example they can lead to false positives if the signatures are not unique compared to other architectures in the dataset (Clemens, 2015). In addition to that, the disassembly method requires complete support from a disassembler framework, which might not always be available or working perfectly. Costin et al. (Costin et al., 2017) were the first to apply machine learning in the context of firmware classification. They try to automate finding the brand and the model of the device the firmware is made for and propose different firmware features and machine learning algorithms to get this information. The researchers achieve the best results with random forest classifier with 90% classification accuracy. With the help of statistical confidence interval, the authors estimate that in a data set of 172 000 firmware images, the classifier could correctly classify the firmware in 93.5% 4.3% of the cases.
3. Datasets and experimental setup
3.1. Datasets
We started with the dataset acquisition challenge. Despite the existence of several state of the art works, unfortunately neither their datasets nor the toolsets are publicly available. 222A post-processed dataset from Clemens (Clemens, 2015) was generously provided by the author privately on request. The dataset is not publicly available. To overcome this limitation we had to develop a complete toolset and pipeline that are able to optimally download a dataset that is both large and representative enough. We release our data acquisition toolset as open source. The pipeline used to acquire our dataset is depicted in Figure 1, while sample code snippets are listed in Appendix 6 333Due to space limitation, only a part of the code is listed in this submission..
We chose the Debian Linux repositories for several reasons. First, it is a long established and trusted project, therefore a good source of software packages and related data. Second, it is bootstrapped from the same tools and sources to build the Linux kernel and userspace software packages for a very diverse set of CPU and ABI architectures.
The downloaded and pre-processed dataset can be summarized as follows: about 1600 ISO/Jigdo files taking up around 1843 GB; approximately 79000 DEB package files taking up about 36 GB; around 105000 ELF files taking up about 29 GB; about 105000 ELF code sections taking up approximately 17 GB. A detailed breakdown of our dataset and sample-sets is in Table 2.
Our dataset covers 23 distinct architectures, which is inline with and comparable to Clemens (Clemens, 2015) (20 architectures) and De Nicolao et al. (De Nicolao et al., 2018) (21 architectures). Most of the CPU architectures overlap with existing works, but there are few new ones (as marked in Table 2). At the same time, the sample-sets used in our experiments have some significant differences compared to the state of the art. First, the total number of 67285 samples in our experiments is several times larger than those used by both Clemens (Clemens, 2015) (16785 samples) and De Nicolao et al. (De Nicolao et al., 2018) (15290 samples). Second, compared to existing works, our sample-set size per architecture is both larger and more balanced. Using more balanced sample-sets should give more accurate results when evaluating and comparing classifier performance, as imbalance of classes in the dataset can cause sub-optimal classification performance (Japkowicz, 2003). When creating our sample-sets for each architecture, we had set forth several constraints. On the one hand, we decided that the minimum code section in the ELF file should be 4000 bytes (4K), as this is the code size where all classifiers were shown to converge and provide high-accuracy at the same time (see Fig.2 in Clemens (Clemens, 2015)). On the other hand, we wanted to have the sample-sets as balanced as possible between all the architectures. Given these parameters, from the initial 105013 ELF files in the download dataset, our toolset filtered 67285 samples with an approximate average of 3000 samples per architecture sample-set (Table 2). Importantly, our toolset can be parametrized to download more files, and to filter the sample-sets based on different criteria as dictated by various use cases.
3.1.1. Discussion
An important point to clarify at this stage is the reasoning and the implications of selecting the Debian repositories as the source of code binaries, and the subsequent experimentation limited to ELF binaries. To the best of our knowledge Debian package repositories are the only at this time that provide a years long list of compiled binaries for such an extensive list of CPU and hardware architectures. Certainly other repositories (e.g., Fedora, Ubuntu, Raspbian) can also provide a good source of compiled binaries, but in our experience and opinion they would only be able to marginally improve the quality and the quantity of the ones provided by the Debian repository. This may also be the reason why other state of the art works relied on Debian repositories as well (Clemens, 2015; De Nicolao et al., 2018). As a result of using the Debian packages as source of code binaries, this inherently limited the datasets and the experiments (both ours and ones in the state of the art (Clemens, 2015; De Nicolao et al., 2018)) to using only ELF format binaries. This however is not limiting or impacting the experiments or applicability of the methods, since we mainly work with the raw machine code (i.e., op-codes) extracted solely from the code sections. Equivalent raw machine code can be extracted in similarly easy ways from DOS MZ, COFF, PE32(+), MACH-O, BFLT, and virtually any other executable binary format. The only important condition is that, regardless of the binary file format those op-codes come from, they must be extracted from code sections which is a way to guarantee those op-code byte sequences represent valid instructions for the CPU architecture they represent (and which is specified in the binary format headers). Unfortunately, we are unaware of any substantial collection of non-ELF binaries that would cover an extensive list of CPU architectures. For example, despite the fact the PE32 format supports well x86, x86-64, x64, and added recently support for ARM thanks to Windows IoT for Raspberry Pi (which has an ARM CPU), there is not much support for other CPUs beyond that. To summarize, it is important to emphasize that the methods evaluated or presented in this paper are not limited only to ELF files, though we performed our experiments on code sections extracted solely from ELF files.
3.2. Machine Learning
We then continued with the experiment reconstruction and cross-validation of the state of the art. For training and testing our machine learning classifiers, we used the following complete feature-set which consists of 293 features as follows. The first 256 features are mapped to Byte Frequency Distribution (BFD) (used by Clemens (Clemens, 2015)). The next 4 features map to “4 endianness signatures” (used by Clemens (Clemens, 2015)). The following 31 features are mapped to function epilog and function prolog signatures for amd64, arm, armel, mips32, powerpc, powerpc64, s390x, x86 (developed by angr framework (Shoshitaishvili et al., 2016a) and also used by De Nicolao et al. (De Nicolao et al., 2018)). The final 2 features map to “powerpcspe signatures” that were developed specifically for this paper. 444These signatures are in the process of being contributed back to open-source projects such as angr, binwalk. To this end, we extract the mentioned features from the code sections of the ELF binaries in the sample-sets (column Sample-sets size in experiment in Table 2). We then save the extracted features into a CSV file ready for input to and processing by machine learning frameworks.
In order to replicate and validate the approach of Clemens (Clemens, 2015), we used the Weka framework along with exact list and settings of classifiers as used by the author (Table 3). We used non-default parameters (acquired by manual tuning) only for neural net when training the classifier on our complete dataset, because the parameters used by Clemens (Clemens, 2015) were specific to their dataset. We also used only the list of architectures and features used by the author.
In order to replicate and validate the approach of De Nicolao et al. (De Nicolao et al., 2018), we used scikit-learn (Pedregosa et al., 2011). The authors used only logistic regression classifier to which they add L1 regularization as compared to Clemens (Clemens, 2015). In this paper, we implemented the logistic regression classifier both in scikit-learn and Keras (Chollet et al., 2015) (a high-level API for neural networks) in order to see if the framework used has any effect on the classification accuracy. We also used only the list of architectures and features used by the authors.
3.3. Hardware and Software
To perform the work from this paper we used multiple combinations of software, as summarized in Table 4. In terms of hardware we used two main machines. The server used for dataset collection, and data pre- and post-processing has a 4-cores CPU Intel(R) Xeon(R) E7-8837 2.67GHz with 16 GB of DDR3, and was running CentOS 7.4 with the kernel 3.10.0-957.5.1.el7.x86_64. The host used for machine learning tasks is a standard PC with a 6-core CPU Intel(R) Core(TM) i7-8700k 5.00GHz CPU with 12 threads, GTX 970 graphics card, 16 GB of DDR4 and was running Windows 10.
4. Results and analysis
4.1. Classifier performance when training and testing with code-only sections
First, we compare the performance of multiple classifiers trained on code-only sections, when classifying code-only input. For this we use 10-fold cross validation and the features extracted from code-only sections of the test binaries. Also, we evaluate the effect of various feature-sets on classification performance by calculating performance measures with the “all features” set, BFD-only features-set, and BFD+endianness features-set. We then cross-validate the results by Clemens (Clemens, 2015) as well as compare them to our results.
Using Weka frameworks and the settings presented in Table 3, we trained and tested multiple different classifiers using different feature-sets. BFD corresponds to using only byte frequency distribution, while BFD+endianness adds the architecture endianness signatures introduced by Clemens (Clemens, 2015). The complete data set includes the new architectures as well as the new signatures for powerpcspe (see also Section 4.4). The performance metrics are weighted averages, i.e., sum of the metric through all the classes, weighted by the number of instances in the specific architecture class. The results are also compared to the results presented by (Clemens, 2015), and can be observed in Table 5. We marked with asterisk (*) the results that we obtained using different parameters than those in (Clemens, 2015).
As can be seen in Table 5, the results are inline with the ones presented by Clemens (Clemens, 2015), even though we constructed and used our own datasets. In our experiments, the complete data set (with added architectures and all features considered) increased the accuracy of all classifiers (in some cases by up to 7%) when compared to the results of Clemens (Clemens, 2015). This could be due to a combination of larger overall dataset, more balanced sets for each CPU architecture class, and the use of only binaries that have code sections larger than 4000 bytes.
4.1.1. Effect of test sample code size on classification performance
Next, we study if the sample size has an effect on the classification performance. For this, we test the classifiers against a test set of code sections with increasingly varying size, as also performed by both Clemens (Clemens, 2015) and De Nicolao et al. (De Nicolao et al., 2018). If the performance of such classifiers is good enough with only small fragments of the binary code, those classifiers could be used in environments where only a part of the executable is present. For example, small (128 bytes or less) code size fragments could be encountered in digital forensics when only a portion of malware or exploit code is successfully extracted from an exploited smartphone or IoT device. For this test, the code fragments were taken from code-only sections using random sampling in order to avoid any bias that could come from using only code from the beginning of code sections (Clemens, 2015). We present the results of this test in Figure 2.
When testing with varying size of the test input, SVM performed the best with almost 50% accuracy even with the smallest sample size of 8 bytes. Also, SVM along with 3 nearest neighbors achieved 90% accuracy at 128 bytes. Logistic regression implemented in scikit-learn and Keras were very close performance-wise, both implementations achieving 90% accuracy at 256 bytes. Surprisingly, logistic regression implemented in Weka under-performed and required 2048 bytes to reach 90% accuracy. Cross-validating and comparing the result, the classifiers that performed the best in our varying sample size experiments also performed well in the experiments by Clemens (Clemens, 2015). On the other hand, in this experiments not all the classifiers achieved 90% accuracy at 4000 bytes as experienced by Clemens (Clemens, 2015). For example, at 4000 bytes the Decision Tree and Random Tree classifiers in our case achieved accuracy of only 85% and 75% respectively.
4.1.2. Effect of different frameworks on performance of logistic regression
From all the different classifiers available, De Nicolao et al. (De Nicolao et al., 2018) used logistic regression only and used the scikit-learn as their machine learning framework of choice. Logistic regression has a couple of parameters that affect the classification performance. The authors used grid search to identify the best value for C, which stands for inverse of regularization strength and found the value of 10000 to give the best results in their case. Since the dataset itself affects the result, we ran grid search for the scikit-learn model developed based on our dataset. The C values of 10000, 1000, 100, 10, 1, 0.1 were tested and we found that for our case the value of 1000 gave the best results. Similarly, for the Keras model, we found for C the value of 0.0000001 to provide the best accuracy. The Table 6 present our results of logistic regression using 10-fold cross-validation on code-only sections when tested in all different frameworks.
With this experiment, for example, we found that for the same dataset, the logistic regression implemented in scikit-learn and Keras provided better results (F1 measures of 0.998) when compared to Weka (F1 measures of 0.989).
4.2. Classifier performance when training with code-only sections and testing with complete binaries
We also explored how well the classifiers perform when given the task to classify a complete binary (i.e., containing headers, and code and data sections). In fact, De Nicolao et al. (De Nicolao et al., 2018) tested their classifier performance on complete binaries (i.e., full executables). Therefore, in this work we test all the different classifiers used by Clemens (Clemens, 2015) and De Nicolao et al. (De Nicolao et al., 2018) against complete binaries using a separate test set consisting of 500 binaries for each architecture (which is about 1.5 times more than in (De Nicolao et al., 2018)). The classifiers we used in this test are still previously trained using code-only sections The results of our tests for this experiments can be seen in Table 7.
Our analysis shows that Random Forest performed the best by having the highest performance measures of 0.901 for accuracy and 0.995 for AUC. The logistic regression implemented in scikit-learn did not perform as well as experiences by De Nicolao et al. (De Nicolao et al., 2018). The time to classify all the binaries in this test set took only a couple of seconds on all algorithms except the Nearest Neighbor algorithms which took approximately 15 minutes. One of the reasons for this is because the Nearest Neighbor algorithm is a lazy classifier, and the model is only built when data needs to be classified.
We also present in Figure 3 the confusion matrix of the best performing classifier in this test, i.e., the Random Forest classifier . The columns represent the class frequencies predicted by the model while the rows present true class frequencies. Everything off from the diagonal is a misclassification. The alphabets represent the 23 architecture classes in the alphabetical order as presented in Table 2. For example, looking at the confusion matrix it is possible to see that the i386 (g) and the m68k (i) architectures caused over 70% of the misclassifications. Therefore, one direction in future work is to find the root cause of this, and develop better discriminating signature for these architectures.
4.2.1. Testing Machine Learning as cloud implementations
In addition to the existing approaches and tools detailed above, we also employed Azure Machine Learning platform to test code-only trained classifiers with complete binaries as input. An example of our setup and workflow in Azure platform is presented in Figure 4, and its (comparative) performance is presented in Table 8. In summary, the Random Forest classifier implemented in Azure performed sensibly better when compared to the one implemented in Weka. This once again highlights the long-standing research challenge that, even when using the same methods or algorithms, the implementations do matter and can affect the research results whether in computer science (Kriegel et al., 2017) or other scientific fields (Durlak and DuPre, 2008).
4.3. Classifier performance when training and testing with complete binaries
We then verify how the training of classifiers using complete binaries (i.e., not code-only sections as in previous works) affects their performance when given complete binaries as test input. For this, we selected some of the best performing classifiers from the previous experiments. Then we trained those classifiers with a training set consisting of 1000 complete binaries for each of those 23 architectures. Finally, we tested them against a test set of 1000 full binaries for each architecture. The results of this experiment can be seen in Table 9. For comparison, along with the performance of classifiers trained and tested on complete binaries, we also present the performance of exact same classifiers when tested in code-only experiments above.
Previous work did not propose or use complete binaries for training the classifiers, and used complete binaries only as classification test inputs (Clemens, 2015; De Nicolao et al., 2018). As can be seen in Table 9, all the classifiers we tested achieved over 97% accuracy, while the Random Forest implemented in Azure platform performed the best with 99.2% accuracy. The proposal of using complete binaries for both training and testing the classifiers, as well as the experimental confirmation that the accuracy of classification is comparable to existing approaches and is very high (e.g., up to 99.2%), is another incremental but novel contribution of this paper to the field.
4.3.1. Discussion
Up to this point, we evaluated our “complete binary” training and recall only using ELF format binaries, and we presented the reasons for experimenting only with ELF format binaries in Section 3.1. We are aware that executable binary file formats such as PE32, COFF, MACH-O, and including ELF, even when compiled for the same CPU architecture, may certainly have major structural difference between each other, despite the fact that their code sections may contain similar, equivalent or highly-comparable op-code byte sequences. This is due to many factors, including but not limited to file format header variations, compiler and its options, the way the data (e.g., strings, constants, initial values for variables) is stored in the binary file and subsequently referenced within code sections. Such structural differences could in turn influence both the training and the recall of the “complete binary” method we introduced. Therefore, we plan as a future work to further evaluate and improve the “complete binary” approach for non-ELF-only cases such as when dealing with other homogeneous binary format datasets (e.g., only PE32) or with heterogeneous binary format datasets (e.g., combination of ELF, PE32, MACH-O).
4.4. The special case of signatures and features for powerpcspe
During the testing of classifiers performance, we observed that some architectures were the root cause behind the most false matches performed by the classifiers. For example, the binary code for powerpc (PPC) and powerpcspe (PPCspe) are essentially the same, the only difference being the presence of SPE instructions 555SPE stands for Signal Processing Engine. SPE instructions perform floating-point operations on the integer registers. in powerpcspe ISA. Therefore, we had to create custom signatures to be able to more accurately distinguish between the powerpc and powerpcspe code sequences. We created the signatures by comparing the instructions between the two architectures and finding unique ones that only appear in one of the architectures. For each analyzed architecture, the dataset for this experiment consisted of 1000 complete binaries for training, and the same amount for testing. To run this test, we employed Random Forest classifier as implement in Weka framework, and the results are presented in Table 10.
Our analysis shows that the addition of the two additional powerpcspe signatures increased the F1 score of Random Forest implemented in Weka by about two percentage units. The confusion matrix for powerpc and powerpcspe architectures without the signatures is presented for comparison next to the confusion matrix when the two signatures/features were used, and can be seen in Table .
The development and addition of the powerpcspe discriminating features, and the experimental confirmation that they improve the overall classification accuracy and confusion matrix, is another contribution of this paper to the field.
5. Conclusion
In this paper we tried to bridge multiple gaps in the field of automated and precise identification of architecture and endianness of binary files and object code. For this, we developed from scratch the toolset and datasets that are lacking in this research space. As a result, we contribute a comprehensive collection of open data, open source, and open API web-services. We performed experiment reconstruction and cross-validation of the effectiveness, efficiency, and results of the state of the art methods by Clemens (Clemens, 2015) and De Nicolao et al. (De Nicolao et al., 2018). When training and testing classifiers using solely code-sections from compiled binary files (e.g. ELF binaries), all our classifiers performed equally well achieving over 98% accuracy. We have shown that our results are generally inline with the state of art, and in some cases we managed to outperform previous work by up to 7%. Additionally, we provided with this work novel contributions to the field. One contribution is the proposal and confirmation that complete binaries can be successfully used for both training and testing machine learning classifiers. In this direction, we demonstrate a 99.2% accuracy using Random Forest classifiers implemented in Azure platform. Another contribution is the development and validation of new discriminating features for powerpc and powerpcspe architectures. Finally, our work provides an independent confirmation of the general validity and soundness of both existing and newly proposed algorithms, features, and approaches.
5.1. Future work
There are several directions for future work, and we plan to initially focus on the following. First, we would like to continuously expand the datasets in terms of size, number of supported architectures, and quality. We plan to achieve this by using community submitted data subsets as well as by setting up our own multi-architecture object code building infrastructures. Second, we plan to use crowdsourcing in order to get community’s expert knowledge that would continously increase the performance of machine learning classifiers. One way to achieve this is to allow user to confirm or correct automatically classified results. Third, we plan to expand the work on architecture discriminating signatures. One idea is to develop advanced methods (e.g., based on machine learning) to automatically create signatures for more accurate discrimination of code’s CPU architecture, for example by using op-code specifics, and function epilogs and prologs.
6. Appendices
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2bin ([n. d.]) [n. d.]. binwalk – Firmware Analysis Tool. https://github.com/binwalk/binwalk
- 3noh ([n. d.]) [n. d.]. UNIX System V: understanding ELF object files and debugging tools .
- 4Beebe et al . (2013) Nicole L Beebe, Laurence A Maddox, Lishu Liu, and Minghe Sun. 2013. Sceadan: using concatenated n-gram vectors for improved file and data type classification. IEEE Transactions on Information Forensics and Security 8, 9 (2013), 1519–1530.
- 5Blanco and Eissler (2012) Andrés Blanco and Matias Eissler. 2012. One firmware to monitor ’em all.
- 6Brumley et al . (2011) David Brumley, Ivan Jager, Thanassis Avgerinos, and Edward J Schwartz. 2011. BAP: A binary analysis platform. In International Conference on Computer Aided Verification . Springer, 463–469.
- 7Cha et al . (2012) Sang Kil Cha, Thanassis Avgerinos, Alexandre Rebert, and David Brumley. 2012. Unleashing mayhem on binary code. In 2012 IEEE Symposium on Security and Privacy . IEEE, 380–394.
- 8Chipounov and Candea (2010) Vitaly Chipounov and George Candea. 2010. Reverse engineering of binary device drivers with Rev NIC. In Proceedings of the 5th European conference on Computer systems . ACM, 167–180.
