TL;DR
This large-scale study reveals that code cloning is highly prevalent in Jupyter notebooks on GitHub, with over 70% of snippets duplicated and clones often spanning multiple repositories, especially in Python.
Contribution
First comprehensive analysis of code cloning in Jupyter notebooks, highlighting clone prevalence, cross-repository sharing, and differences across programming languages.
Findings
Over 70% of snippets are exact clones.
At least 80% of Python snippets are approximate clones.
Clones are more common across repositories than within the same repository.
Abstract
Jupyter notebooks has emerged as a standard tool for data science programming. Programs in Jupyter notebooks are different from typical programs as they are constructed by a collection of code snippets interleaved with text and visualisation. This allows interactive exploration and snippets may be executed in different order which may give rise to different results due to side-effects between snippets. Previous studies have shown the presence of considerable code duplication -- code clones -- in sources of traditional programs, in both so-called systems programming languages and so-called scripting languages. In this paper we present the first large-scale study of code cloning in Jupyter notebooks. We analyse a corpus of 2.7 million Jupyter notebooks hosted on GitHJub, representing 37 million individual snippets and 227 million lines of code. We study clones at the level of individual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
