A longitudinal dataset of five years of public activity in the Scratch online community
Benjamin Mako Hill, Andr\'es Monroy-Hern\'andez

TL;DR
This paper presents a comprehensive five-year longitudinal dataset of public activity in the Scratch online community, including user interactions, projects, comments, and visits, to facilitate research on youth programming and online collaboration.
Contribution
The authors provide the largest and most detailed publicly available dataset of Scratch community activity, including source code and software used, enabling new research opportunities.
Findings
Dataset includes over 1 million users and 2 million projects.
Contains detailed logs of comments, visits, and interactions.
Provides source code and software versions for validation.
Abstract
Scratch is a programming environment and an online community where young people can create, share, learn, and communicate. In collaboration with the Scratch Team at MIT, we created a longitudinal dataset of public activity in the Scratch online community during its first five years (2007-2012). The dataset comprises 32 tables with information on more than 1 million Scratch users, nearly 2 million Scratch projects, more than 10 million comments, more than 30 million visits to Scratch projects, and more. To help researchers understand this dataset, and to establish the validity of the data, we also include the source code of every version of the software that operated the website, as well as the software used to generate this dataset. We believe this is the largest and most comprehensive downloadable dataset of youth programming artifacts and communication.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
