KEENHash: Hashing Programs into Function-Aware Embeddings for Large-Scale Binary Code Similarity Analysis
Zhijie Liu, Qiyi Tang, Sen Nie, Shi Wu, Liang Feng Zhang, Yutian Tang

TL;DR
KEENHash introduces a fast, scalable hashing method for large-scale binary code similarity analysis using LLM-generated function embeddings, significantly outperforming existing tools in speed and effectiveness.
Contribution
It proposes KEENHash, a novel program-level hashing approach that leverages large language models and clustering techniques to enable efficient large-scale binary similarity analysis.
Findings
KEENHash is at least 215 times faster than state-of-the-art function matching tools.
KEENHash can evaluate 5.3 billion similarities in under 6.5 minutes.
KEENHash outperforms four existing methods by at least 23.16% in large-scale BCSA tasks.
Abstract
Binary code similarity analysis (BCSA) is a crucial research area in many fields such as cybersecurity. Specifically, function-level diffing tools are the most widely used in BCSA: they perform function matching one by one for evaluating the similarity between binary programs. However, such methods need a high time complexity, making them unscalable in large-scale scenarios (e.g., 1/n-to-n search). Towards effective and efficient program-level BCSA, we propose KEENHash, a novel hashing approach that hashes binaries into program-level representations through large language model (LLM)-generated function embeddings. KEENHash condenses a binary into one compact and fixed-length program embedding using K-Means and Feature Hashing, allowing us to do effective and efficient large-scale program-level BCSA, surpassing the previous state-of-the-art methods. The experimental results show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Web Data Mining and Analysis · Algorithms and Data Compression
