Contrastive Code Representation Learning

Paras Jain; Ajay Jain; Tianjun Zhang; Pieter Abbeel; Joseph E.; Gonzalez; Ion Stoica

arXiv:2007.04973·cs.LG·January 10, 2022

Contrastive Code Representation Learning

Paras Jain, Ajay Jain, Tianjun Zhang, Pieter Abbeel, Joseph E., Gonzalez, Ion Stoica

PDF

1 Repo

TL;DR

This paper introduces ContraCode, a contrastive learning approach for source code representations that emphasizes program functionality over superficial token reconstruction, leading to improved robustness and semantic understanding.

Contribution

The paper proposes a novel contrastive pre-training method for code that enhances semantic robustness and introduces a new dataset for code clone detection.

Findings

01

Improves JavaScript summarization and TypeScript type inference by 2-13%.

02

Outperforms RoBERTa by 39% AUROC in adversarial clone detection.

03

Produces more semantically meaningful code representations.

Abstract

Recent work learns contextual representations of source code by reconstructing tokens from their context. For downstream semantic understanding tasks like summarizing code in English, these representations should ideally capture program functionality. However, we show that the popular reconstruction-based BERT model is sensitive to source code edits, even when the edits preserve semantics. We propose ContraCode: a contrastive pre-training task that learns code functionality, not form. ContraCode pre-trains a neural network to identify functionally similar variants of a program among many non-equivalent distractors. We scalably generate these variants using an automated source-to-source compiler as a form of data augmentation. Contrastive pre-training improves JavaScript summarization and TypeScript type inference accuracy by 2% to 13%. We also propose a new zero-shot JavaScript code…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

parasj/contracode
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Adam · Linear Warmup With Linear Decay · Dropout · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Residual Connection · Attention Is All You Need · Attention Dropout