SimCLF: A Simple Contrastive Learning Framework for Function-level   Binary Embeddings

Sun RuiJin; Guo Shize; Guo Jinhong; Li Wei; Zhan Dazhi; Sun Meng; Pan; Zhisong

arXiv:2209.02442·cs.CR·December 27, 2023

SimCLF: A Simple Contrastive Learning Framework for Function-level Binary Embeddings

Sun RuiJin, Guo Shize, Guo Jinhong, Li Wei, Zhan Dazhi, Sun Meng, Pan, Zhisong

PDF

Open Access 1 Repo

TL;DR

SimCLF introduces an unsupervised contrastive learning framework for function-level binary code similarity detection, leveraging augmented data to improve robustness and accuracy without manual annotations.

Contribution

It presents a novel unsupervised contrastive learning approach for binary code embeddings that operates on disassembled functions and uses data augmentation techniques.

Findings

01

Outperforms state-of-the-art in accuracy

02

Excels in few-shot learning scenarios

03

Operates effectively without manual annotations

Abstract

Function-level binary code similarity detection is a crucial aspect of cybersecurity. It enables the detection of bugs and patent infringements in released software and plays a pivotal role in preventing supply chain attacks. A practical embedding learning framework relies on the robustness of the assembly code representation and the accuracy of function-pair annotation, which is traditionally accomplished using supervised learning-based frameworks. However, annotating different function pairs with accurate labels poses considerable challenges. These supervised learning methods can be easily overtrained and suffer from representation robustness problems. To address these challenges, we propose SimCLF: A Simple Contrastive Learning Framework for Function-level Binary Embeddings. We take an unsupervised learning approach and formulate binary code similarity detection as instance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

iamawhalez/fun2vec
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Software Engineering Research · Software Testing and Debugging Techniques

MethodsContrastive Learning