Soft-Labeled Contrastive Pre-training for Function-level Code Representation
Xiaonan Li, Daya Guo, Yeyun Gong, Yun Lin, Yelong Shen, Xipeng Qiu,, Daxin Jiang, Weizhu Chen, Nan Duan

TL;DR
SCodeR introduces a soft-labeled contrastive pre-training framework that leverages semantic-aware positive sample construction methods, such as code comments and syntax sub-trees, to enhance function-level code representations, achieving state-of-the-art results.
Contribution
The paper proposes SCodeR, a novel contrastive pre-training approach with soft labels and semantic-aware positive sample construction for better code representations.
Findings
Achieves state-of-the-art performance on four code-related tasks.
Soft-labeled contrastive learning improves code representation quality.
Semantic-aware positive samples outperform transformation-based methods.
Abstract
Code contrastive pre-training has recently achieved significant progress on code-related tasks. In this paper, we present \textbf{SCodeR}, a \textbf{S}oft-labeled contrastive pre-training framework with two positive sample construction methods to learn functional-level \textbf{Code} \textbf{R}epresentation. Considering the relevance between codes in a large-scale code corpus, the soft-labeled contrastive pre-training can obtain fine-grained soft-labels through an iterative adversarial manner and use them to learn better code representation. The positive sample construction is another key for contrastive pre-training. Previous works use transformation-based methods like variable renaming to generate semantically equal positive codes. However, they usually result in the generated code with a highly similar surface form, and thus mislead the model to focus on superficial code structure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques
