TL;DR
This paper introduces ContraCode, a contrastive learning approach for source code representations that emphasizes program functionality over superficial token reconstruction, leading to improved robustness and semantic understanding.
Contribution
The paper proposes a novel contrastive pre-training method for code that enhances semantic robustness and introduces a new dataset for code clone detection.
Findings
Improves JavaScript summarization and TypeScript type inference by 2-13%.
Outperforms RoBERTa by 39% AUROC in adversarial clone detection.
Produces more semantically meaningful code representations.
Abstract
Recent work learns contextual representations of source code by reconstructing tokens from their context. For downstream semantic understanding tasks like summarizing code in English, these representations should ideally capture program functionality. However, we show that the popular reconstruction-based BERT model is sensitive to source code edits, even when the edits preserve semantics. We propose ContraCode: a contrastive pre-training task that learns code functionality, not form. ContraCode pre-trains a neural network to identify functionally similar variants of a program among many non-equivalent distractors. We scalably generate these variants using an automated source-to-source compiler as a form of data augmentation. Contrastive pre-training improves JavaScript summarization and TypeScript type inference accuracy by 2% to 13%. We also propose a new zero-shot JavaScript code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Adam · Linear Warmup With Linear Decay · Dropout · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Residual Connection · Attention Is All You Need · Attention Dropout
