To What Extent do Deep Learning-based Code Recommenders Generate   Predictions by Cloning Code from the Training Set?

Matteo Ciniselli; Luca Pascarella; Gabriele Bavota

arXiv:2204.06894·cs.SE·April 15, 2022

To What Extent do Deep Learning-based Code Recommenders Generate Predictions by Cloning Code from the Training Set?

Matteo Ciniselli, Luca Pascarella, Gabriele Bavota

PDF

Open Access

TL;DR

This study investigates whether deep learning-based code recommenders predominantly clone training set code, impacting licensing considerations, and finds that a small but significant percentage of predictions are exact clones, especially for shorter code snippets.

Contribution

It provides the first large-scale analysis quantifying the extent of code cloning in DL-based code completion tools, informing licensing and originality concerns.

Findings

01

Approximately 0.1% to 10% of predictions are exact clones of training data.

02

Longer code predictions are less likely to be clones.

03

Cloning prevalence varies with the size of the predicted code.

Abstract

Deep Learning (DL) models have been widely used to support code completion. These models, once properly trained, can take as input an incomplete code component (e.g., an incomplete function) and predict the missing tokens to finalize it. GitHub Copilot is an example of code recommender built by training a DL model on millions of open source repositories: The source code of these repositories acts as training data, allowing the model to learn "how to program". The usage of such a code is usually regulated by Free and Open Source Software (FOSS) licenses, that establish under which conditions the licensed code can be redistributed or modified. As of Today, it is unclear whether the code generated by DL models trained on open source code should be considered as "new" or as "derivative" work, with possible implications on license infringements. In this work, we run a large-scale study…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Scientific Computing and Data Management · Software Testing and Debugging Techniques