SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation
Xin Wang, Yasheng Wang, Fei Mi, Pingyi Zhou, Yao Wan, Xiao Liu, Li Li,, Hao Wu, Jin Liu, Xin Jiang

TL;DR
SynCoBERT introduces a syntax-guided multi-modal contrastive pre-training method that leverages code, comments, and AST structures to produce superior code representations for various code intelligence tasks.
Contribution
The paper proposes SynCoBERT, a novel pre-training approach that incorporates syntax and multi-modal contrastive learning to enhance code representations beyond existing models.
Findings
Achieves state-of-the-art results on four code intelligence tasks.
Effectively exploits multi-modal information from code, comments, and ASTs.
Demonstrates the benefit of syntax-guided objectives in code pre-training.
Abstract
Code representation learning, which aims to encode the semantics of source code into distributed vectors, plays an important role in recent deep-learning-based models for code intelligence. Recently, many pre-trained language models for source code (e.g., CuBERT and CodeBERT) have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code search, code clone detection, and program translation. Current approaches typically consider the source code as a plain sequence of tokens, or inject the structure information (e.g., AST and data-flow) into the sequential model pre-training. To further explore the properties of programming languages, this paper proposes SynCoBERT, a syntax-guided multi-modal contrastive pre-training approach for better code representations. Specially, we design two novel pre-training objectives originating from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Contrastive Learning · CuBERT · Weight Decay · Softmax · Dense Connections · Dropout
