KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle
Luigi Quaranta, Fabio Calefato, Filippo Lanubile

TL;DR
KGTorrent is a newly created, large dataset of Python Jupyter notebooks from Kaggle, designed to support research on how data scientists use notebooks in real-world scenarios.
Contribution
This paper introduces KGTorrent, a comprehensive dataset of Kaggle notebooks with metadata, enabling analysis of data science practices and informing future tool development.
Findings
Provides a large, curated collection of notebooks with metadata
Facilitates studying real-world data science workflows
Supports research on notebook usage patterns
Abstract
Computational notebooks have become the tool of choice for many data scientists and practitioners for performing analyses and disseminating results. Despite their increasing popularity, the research community cannot yet count on a large, curated dataset of computational notebooks. In this paper, we fill this gap by introducing KGTorrent, a dataset of Python Jupyter notebooks with rich metadata retrieved from Kaggle, a platform hosting data science competitions for learners and practitioners with any levels of expertise. We describe how we built KGTorrent, and provide instructions on how to use it and refresh the collection to keep it up to date. Our vision is that the research community will use KGTorrent to study how data scientists, especially practitioners, use Jupyter Notebook in the wild and identify potential shortcomings to inform the design of its future extensions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
