BCD: A Cross-Architecture Binary Comparison Database Experiment Using Locality Sensitive Hashing Algorithms
Haoxi Tan

TL;DR
This paper introduces a framework using MinHash for cross-architecture binary code similarity search, aiding reverse engineers in understanding unknown binaries by comparing functions against a database of known code snippets.
Contribution
It presents a novel comparison of hashing algorithms for code similarity detection and implements an open-source database framework for cross-architecture binary comparison.
Findings
MinHash outperforms other hashing algorithms in detecting similar code snippets.
The framework enables efficient cross-architecture binary comparison.
Open-source implementation available for community use.
Abstract
Given a binary executable without source code, it is difficult to determine what each function in the binary does by reverse engineering it, and even harder without prior experience and context. In this paper, we performed a comparison of different hashing functions' effectiveness at detecting similar lifted snippets of LLVM IR code, and present the design and implementation of a framework for cross-architecture binary code similarity search database using MinHash as the chosen hashing algorithm, over SimHash, SSDEEP and TLSH. The motivation is to help reverse engineers to quickly gain context of functions in an unknown binary by comparing it against a database of known functions. The code for this project is open source and can be found at https://github.com/h4sh5/bcddb
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Image and Video Retrieval Techniques · Web Data Mining and Analysis
