BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code Matching
Ling Jiang, Junwen An, Huihui Huang, Qiyi Tang, Sen Nie, Shi Wu, Yuqun, Zhang

TL;DR
BinaryAI introduces a two-phase binary-to-source code matching technique using transformer-based embeddings, significantly improving the accuracy and robustness of software composition analysis for identifying third-party libraries in binaries.
Contribution
The paper presents BinaryAI, a novel binary-to-source SCA method that leverages semantic embeddings and link-time locality to enhance TPL detection accuracy over existing techniques.
Findings
BinaryAI achieves 22.54% recall@1, outperforming the state-of-the-art.
BinaryAI increases TPL detection precision to 85.84%.
BinaryAI outperforms commercial SCA tools in recall and precision.
Abstract
While third-party libraries are extensively reused to enhance productivity during software development, they can also introduce potential security risks such as vulnerability propagation. Software composition analysis, proposed to identify reused TPLs for reducing such risks, has become an essential procedure within modern DevSecOps. As one of the mainstream SCA techniques, binary-to-source SCA identifies the third-party source projects contained in binary files via binary source code matching, which is a major challenge in reverse engineering since binary and source code exhibit substantial disparities after compilation. The existing binary-to-source SCA techniques leverage basic syntactic features that suffer from redundancy and lack robustness in the large-scale TPL dataset, leading to inevitable false positives and compromised recall. To mitigate these limitations, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Advanced Malware Detection Techniques
