BinSimDB: Benchmark Dataset Construction for Fine-Grained Binary Code Similarity Analysis
Fei Zuo, Cody Tompkins, Qiang Zeng, Lannan Luo, Yung Ryn Choe,, Junghwan Rhee

TL;DR
This paper introduces BinSimDB, a large-scale, fine-grained benchmark dataset for binary code similarity analysis, addressing the scarcity of publicly available datasets and enabling more precise research at the snippet level.
Contribution
The authors create BinSimDB, a publicly accessible dataset with algorithms for aligning binary snippets, and empirically demonstrate its effectiveness in improving similarity analysis performance.
Findings
BinSimDB significantly enhances binary code similarity comparison accuracy.
The dataset supports fine-grained analysis at the snippet level.
Algorithms BMerge and BPair effectively bridge code discrepancies.
Abstract
Binary Code Similarity Analysis (BCSA) has a wide spectrum of applications, including plagiarism detection, vulnerability discovery, and malware analysis, thus drawing significant attention from the security community. However, conventional techniques often face challenges in balancing both accuracy and scalability simultaneously. To overcome these existing problems, a surge of deep learning-based work has been recently proposed. Unfortunately, many researchers still find it extremely difficult to conduct relevant studies or extend existing approaches. First, prior work typically relies on proprietary benchmark without making the entire dataset publicly accessible. Consequently, a large-scale, well-labeled dataset for binary code similarity analysis remains precious and scarce. Moreover, previous work has primarily focused on comparing at the function level, rather than exploring other…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics
