TL;DR
This paper systematically studies basic features in binary code similarity analysis using interpretable models, revealing that simple features can match deep learning approaches and emphasizing the importance of binary compilation and analysis tools.
Contribution
It provides the first systematic analysis of interpretable features in BCSA, sharing source code and benchmarks to facilitate future research.
Findings
Simple interpretable models can match deep learning performance
Binary compilation and analysis tools significantly impact BCSA results
Public release of source code and benchmarks to support reproducibility
Abstract
Binary code similarity analysis (BCSA) is widely used for diverse security applications, including plagiarism detection, software license violation detection, and vulnerability discovery. Despite the surging research interest in BCSA, it is significantly challenging to perform new research in this field for several reasons. First, most existing approaches focus only on the end results, namely, increasing the success rate of BCSA, by adopting uninterpretable machine learning. Moreover, they utilize their own benchmark, sharing neither the source code nor the entire dataset. Finally, researchers often use different terminologies or even use the same technique without citing the previous literature properly, which makes it difficult to reproduce or extend previous work. To address these problems, we take a step back from the mainstream and contemplate fundamental research questions for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
