# Towards usable automated detection of CPU architecture and endianness   for arbitrary binary files and object code sequences

**Authors:** Sami Kairaj\"arvi, Andrei Costin, Timo H\"am\"al\"ainen

arXiv: 1908.05459 · 2021-08-24

## TL;DR

This paper introduces a new open dataset, toolset, and evaluation framework for accurately identifying CPU architecture and endianness in binary files, addressing current gaps in research and enabling better comparison of methods.

## Contribution

It develops comprehensive open datasets, tools, and APIs for automated architecture and endianness detection, facilitating evaluation and advancement of existing techniques.

## Key findings

- Classifiers achieved over 98% accuracy in identifying architecture and endianness.
- The new datasets and tools enable effective benchmarking of detection methods.
- Results support the validity of current algorithms in real-world binary analysis.

## Abstract

Static and dynamic binary analysis techniques are actively used to reverse engineer software's behavior and to detect its vulnerabilities, even when only the binary code is available for analysis. To avoid analysis errors due to misreading op-codes for a wrong CPU architecture, these analysis tools must precisely identify the Instruction Set Architecture (ISA) of the object code under analysis. The variety of CPU architectures that modern security and reverse engineering tools must support is ever increasing due to massive proliferation of IoT devices and the diversity of firmware and malware targeting those devices. Recent studies concluded that falsely identifying the binary code's ISA caused alone about 10\% of failures of IoT firmware analysis. The state of the art approaches to detect ISA for arbitrary object code look promising - their results demonstrate effectiveness and high-performance. However, they lack the support of publicly available datasets and toolsets, which makes the evaluation, comparison, and improvement of those techniques, datasets, and machine learning models quite challenging (if not impossible). This paper bridges multiple gaps in the field of automated and precise identification of architecture and endianness of binary files and object code. We develop from scratch the toolset and datasets that are lacking in this research space. As such, we contribute a comprehensive collection of open data, open source, and open API web-services. We also attempt experiment reconstruction and cross-validation of effectiveness, efficiency, and results of the state of the art methods. When training and testing classifiers using solely code-sections from executable binary files, all our classifiers performed equally well achieving over 98\% accuracy. The results are consistent and comparable with the current state of the art, hence supports the general validity of the algorithms

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1908.05459/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/1908.05459/full.md

## References

57 references — full list in the complete paper: https://tomesphere.com/paper/1908.05459/full.md

---
Source: https://tomesphere.com/paper/1908.05459