CCBERT: Self-Supervised Code Change Representation Learning
Xin Zhou, Bowen Xu, DongGyun Han, Zhou Yang, Junda He, David Lo

TL;DR
CCBERT introduces a Transformer-based model for learning detailed, self-supervised representations of code changes, significantly outperforming prior methods like CC2Vec and CodeBERT in various tasks while being more efficient.
Contribution
This work presents CCBERT, a novel self-supervised Transformer model that captures fine-grained code change semantics at the token level, improving over existing approaches in effectiveness and efficiency.
Findings
CCBERT outperforms CC2Vec and state-of-the-art models by 7.7-14.0% on multiple tasks.
CCBERT requires 6-10x less training time and 5-30x less inference time than large pre-trained models.
CCBERT uses less GPU memory while achieving superior performance.
Abstract
Numerous code changes are made by developers in their daily work, and a superior representation of code changes is desired for effective code change analysis. Recently, Hoang et al. proposed CC2Vec, a neural network-based approach that learns a distributed representation of code changes to capture the semantic intent of the changes. Despite demonstrated effectiveness in multiple tasks, CC2Vec has several limitations: 1) it considers only coarse-grained information about code changes, and 2) it relies on log messages rather than the self-contained content of the code changes. In this work, we propose CCBERT (\underline{C}ode \underline{C}hange \underline{BERT}), a new Transformer-based pre-trained model that learns a generic representation of code changes based on a large-scale dataset containing massive unlabeled code changes. CCBERT is pre-trained on four proposed self-supervised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software System Performance and Reliability · Web Data Mining and Analysis
