Jupyter Notebooks on GitHub: Characteristics and Code Clones

Malin K\"all\'en; Tobias Wrigstad

arXiv:2007.10146·cs.SE·March 2, 2021

Jupyter Notebooks on GitHub: Characteristics and Code Clones

Malin K\"all\'en, Tobias Wrigstad

PDF

1 Repo

TL;DR

This large-scale study reveals that code cloning is highly prevalent in Jupyter notebooks on GitHub, with over 70% of snippets duplicated and clones often spanning multiple repositories, especially in Python.

Contribution

First comprehensive analysis of code cloning in Jupyter notebooks, highlighting clone prevalence, cross-repository sharing, and differences across programming languages.

Findings

01

Over 70% of snippets are exact clones.

02

At least 80% of Python snippets are approximate clones.

03

Clones are more common across repositories than within the same repository.

Abstract

Jupyter notebooks has emerged as a standard tool for data science programming. Programs in Jupyter notebooks are different from typical programs as they are constructed by a collection of code snippets interleaved with text and visualisation. This allows interactive exploration and snippets may be executed in different order which may give rise to different results due to side-effects between snippets. Previous studies have shown the presence of considerable code duplication -- code clones -- in sources of traditional programs, in both so-called systems programming languages and so-called scripting languages. In this paper we present the first large-scale study of code cloning in Jupyter notebooks. We analyse a corpus of 2.7 million Jupyter notebooks hosted on GitHJub, representing 37 million individual snippets and 227 million lines of code. We study clones at the level of individual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fxpl/notebooks
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.