Commit2Vec: Learning Distributed Representations of Code Changes
Roc\`io Cabrera Lozoya, Arnaud Baumann, Antonino Sabetta, Michele, Bezzi

TL;DR
This paper introduces Commit2Vec, a deep learning approach that uses syntactic structure to generate representations of code changes, improving classification of security-relevant commits through transfer learning and pretraining strategies.
Contribution
It adapts structural code representation techniques to source code commits and demonstrates the effectiveness of pretraining on related tasks with smaller datasets for commit classification.
Findings
Structural representations outperform token-based ones.
Pretraining on related tasks with smaller datasets yields better performance.
Transfer learning enhances commit classification accuracy.
Abstract
Deep learning methods, which have found successful applications in fields like image classification and natural language processing, have recently been applied to source code analysis too, due to the enormous amount of freely available source code (e.g., from open-source software repositories). In this work, we elaborate upon a state-of-the-art approach to the representation of source code that uses information about its syntactic structure, and we adapt it to represent source changes (i.e., commits). We use this representation to classify security-relevant commits. Because our method uses transfer learning (that is, we train a network on a "pretext task" for which abundant labeled data is available, and then we use such network for the target task of commit classification, for which fewer labeled instances are available), we studied the impact of pre-training the network using two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
