Pre-Training Representations of Binary Code Using Contrastive Learning
Yifan Zhang, Chen Huang, Yueke Zhang, Huajie Shao, Kevin Leach, Yu Huang

TL;DR
ContraBin is a novel contrastive learning framework that integrates source code, comments, and binary code to improve binary analysis and comprehension tasks, revealing insights about comment quality and significantly enhancing downstream task performance.
Contribution
Introduces ContraBin, the first model combining source code, comments, and binary code for contrastive learning in binary analysis, with new methods for integrating diverse representations.
Findings
Synthetic comments improve binary comprehension performance.
Human-written comments can introduce noise and reduce accuracy.
ContraBin outperforms existing methods on multiple binary analysis tasks.
Abstract
Binary code analysis and comprehension is critical to applications in reverse engineering and computer security tasks where source code is not available. Unfortunately, unlike source code, binary code lacks semantics and is more difficult for human engineers to understand and analyze. In this paper, we present ContraBin, a contrastive learning technique that integrates source code and comment information along with binaries to create an embedding capable of aiding binary analysis and comprehension tasks. Specifically, we present three components in ContraBin: (1) a primary contrastive learning method for initial pre-training, (2) a simplex interpolation method to integrate source code, comments, and binary code, and (3) an intermediate representation learning algorithm to train a binary code embedding. We further analyze the impact of human-written and synthetic comments on binary code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Reliability and Analysis Research · Software Engineering Research · Advanced Malware Detection Techniques
MethodsContrastive Learning
