MSR Mining Challenge: The SmartSHARK Repository Mining Data
Alexander Trautsch, Fabian Trautsch, Steffen Herbold

TL;DR
The paper introduces the SmartSHARK repository mining data, a comprehensive dataset capturing detailed software project evolution, including changes, issues, CI, pull requests, and annotations, facilitating advanced research in software engineering.
Contribution
It presents a unique, richly annotated dataset combining diverse data sources and labels, enabling complex longitudinal and multi-source analyses in software evolution research.
Findings
Rich, detailed data enables complex analyses.
Annotations improve data usability.
Supports longitudinal and multi-source research.
Abstract
The SmartSHARK repository mining data is a collection of rich and detailed information about the evolution of software projects. The data is unique in its diversity and contains detailed information about each change, issue tracking data, continuous integration data, as well as pull request and code review data. Moreover, the data does not contain only raw data scraped from repositories, but also annotations in form of labels determined through a combination of manual analysis and heuristics, as well as links between the different parts of the data set. The SmartSHARK data set provides a rich source of data that enables us to explore research questions that require data from different sources and/or longitudinal data over time.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Open Source Software Innovations
