Towards Learning (Dis)-Similarity of Source Code from Program Contrasts
Yangruibo Ding, Luca Buratti, Saurabh Pujar, Alessandro Morari,, Baishakhi Ray, Saikat Chakraborty

TL;DR
DISCO is a self-supervised model that effectively detects code similarity and vulnerabilities using targeted data augmentation and a novel pre-training approach, outperforming larger models despite using significantly less data.
Contribution
The paper introduces DISCO, a pre-trained Transformer model that leverages structure-guided code transformations and a new training objective to improve code similarity detection with less data.
Findings
DISCO outperforms state-of-the-art models in vulnerability detection.
Effective with only 5% of the data size of previous models.
Synthetic data augmentation enhances model performance.
Abstract
Understanding the functional (dis)-similarity of source code is significant for code modeling tasks such as software vulnerability and code clone detection. We present DISCO(DIS-similarity of COde), a novel self-supervised model focusing on identifying (dis)similar functionalities of source code. Different from existing works, our approach does not require a huge amount of randomly collected datasets. Rather, we design structure-guided code transformation algorithms to generate synthetic code clones and inject real-world security bugs, augmenting the collected datasets in a targeted way. We propose to pre-train the Transformer model with such automatically generated program contrasts to better identify similar code in the wild and differentiate vulnerable programs from benign ones. To better capture the structural features of source code, we propose a new cloze objective to encode the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Web Application Security Vulnerabilities
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Softmax · Label Smoothing · Dropout
