SimCLF: A Simple Contrastive Learning Framework for Function-level Binary Embeddings
Sun RuiJin, Guo Shize, Guo Jinhong, Li Wei, Zhan Dazhi, Sun Meng, Pan, Zhisong

TL;DR
SimCLF introduces an unsupervised contrastive learning framework for function-level binary code similarity detection, leveraging augmented data to improve robustness and accuracy without manual annotations.
Contribution
It presents a novel unsupervised contrastive learning approach for binary code embeddings that operates on disassembled functions and uses data augmentation techniques.
Findings
Outperforms state-of-the-art in accuracy
Excels in few-shot learning scenarios
Operates effectively without manual annotations
Abstract
Function-level binary code similarity detection is a crucial aspect of cybersecurity. It enables the detection of bugs and patent infringements in released software and plays a pivotal role in preventing supply chain attacks. A practical embedding learning framework relies on the robustness of the assembly code representation and the accuracy of function-pair annotation, which is traditionally accomplished using supervised learning-based frameworks. However, annotating different function pairs with accurate labels poses considerable challenges. These supervised learning methods can be easily overtrained and suffer from representation robustness problems. To address these challenges, we propose SimCLF: A Simple Contrastive Learning Framework for Function-level Binary Embeddings. We take an unsupervised learning approach and formulate binary code similarity detection as instance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Software Engineering Research · Software Testing and Debugging Techniques
MethodsContrastive Learning
