Commit2Vec: Learning Distributed Representations of Code Changes

Roc\`io Cabrera Lozoya; Arnaud Baumann; Antonino Sabetta; Michele; Bezzi

arXiv:1911.07605·cs.SE·November 18, 2021

Commit2Vec: Learning Distributed Representations of Code Changes

Roc\`io Cabrera Lozoya, Arnaud Baumann, Antonino Sabetta, Michele, Bezzi

PDF

TL;DR

This paper introduces Commit2Vec, a deep learning approach that uses syntactic structure to generate representations of code changes, improving classification of security-relevant commits through transfer learning and pretraining strategies.

Contribution

It adapts structural code representation techniques to source code commits and demonstrates the effectiveness of pretraining on related tasks with smaller datasets for commit classification.

Findings

01

Structural representations outperform token-based ones.

02

Pretraining on related tasks with smaller datasets yields better performance.

03

Transfer learning enhances commit classification accuracy.

Abstract

Deep learning methods, which have found successful applications in fields like image classification and natural language processing, have recently been applied to source code analysis too, due to the enormous amount of freely available source code (e.g., from open-source software repositories). In this work, we elaborate upon a state-of-the-art approach to the representation of source code that uses information about its syntactic structure, and we adapt it to represent source changes (i.e., commits). We use this representation to classify security-relevant commits. Because our method uses transfer learning (that is, we train a network on a "pretext task" for which abundant labeled data is available, and then we use such network for the target task of commit classification, for which fewer labeled instances are available), we studied the impact of pre-training the network using two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.