To What Extent do Deep Learning-based Code Recommenders Generate Predictions by Cloning Code from the Training Set?
Matteo Ciniselli, Luca Pascarella, Gabriele Bavota

TL;DR
This study investigates whether deep learning-based code recommenders predominantly clone training set code, impacting licensing considerations, and finds that a small but significant percentage of predictions are exact clones, especially for shorter code snippets.
Contribution
It provides the first large-scale analysis quantifying the extent of code cloning in DL-based code completion tools, informing licensing and originality concerns.
Findings
Approximately 0.1% to 10% of predictions are exact clones of training data.
Longer code predictions are less likely to be clones.
Cloning prevalence varies with the size of the predicted code.
Abstract
Deep Learning (DL) models have been widely used to support code completion. These models, once properly trained, can take as input an incomplete code component (e.g., an incomplete function) and predict the missing tokens to finalize it. GitHub Copilot is an example of code recommender built by training a DL model on millions of open source repositories: The source code of these repositories acts as training data, allowing the model to learn "how to program". The usage of such a code is usually regulated by Free and Open Source Software (FOSS) licenses, that establish under which conditions the licensed code can be redistributed or modified. As of Today, it is unclear whether the code generated by DL models trained on open source code should be considered as "new" or as "derivative" work, with possible implications on license infringements. In this work, we run a large-scale study…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Scientific Computing and Data Management · Software Testing and Debugging Techniques
