Revisiting Binary Code Similarity Analysis using Interpretable Feature   Engineering and Lessons Learned

Dongkwan Kim; Eunsoo Kim; Sang Kil Cha; Sooel Son; and Yongdae Kim

arXiv:2011.10749·cs.SE·July 8, 2022

Revisiting Binary Code Similarity Analysis using Interpretable Feature Engineering and Lessons Learned

Dongkwan Kim, Eunsoo Kim, Sang Kil Cha, Sooel Son, and Yongdae Kim

PDF

2 Repos

TL;DR

This paper systematically studies basic features in binary code similarity analysis using interpretable models, revealing that simple features can match deep learning approaches and emphasizing the importance of binary compilation and analysis tools.

Contribution

It provides the first systematic analysis of interpretable features in BCSA, sharing source code and benchmarks to facilitate future research.

Findings

01

Simple interpretable models can match deep learning performance

02

Binary compilation and analysis tools significantly impact BCSA results

03

Public release of source code and benchmarks to support reproducibility

Abstract

Binary code similarity analysis (BCSA) is widely used for diverse security applications, including plagiarism detection, software license violation detection, and vulnerability discovery. Despite the surging research interest in BCSA, it is significantly challenging to perform new research in this field for several reasons. First, most existing approaches focus only on the end results, namely, increasing the success rate of BCSA, by adopting uninterpretable machine learning. Moreover, they utilize their own benchmark, sharing neither the source code nor the entire dataset. Finally, researchers often use different terminologies or even use the same technique without citing the previous literature properly, which makes it difficult to reproduce or extend previous work. To address these problems, we take a step back from the mainstream and contemplate fundamental research questions for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.