TL;DR
This paper introduces GHS, a comprehensive, continuously updated dataset of GitHub repositories with 25 characteristics, facilitating project sampling for MSR studies despite API limitations.
Contribution
The paper presents GHS, a large, queryable dataset of GitHub repositories with key features, addressing API limitations and supporting MSR research.
Findings
Contains data on 735,669 repositories across 10 languages.
Supports complex selection criteria for repository sampling.
Provides a web application for easy querying and data retrieval.
Abstract
Almost every Mining Software Repositories (MSR) study requires, as first step, the selection of the subject software repositories. These repositories are usually collected from hosting services like GitHub using specific selection criteria dictated by the study goal. For example, a study related to licensing might be interested in selecting projects explicitly declaring a license. Once the selection criteria have been defined, utilities such as the GitHub APIs can be used to "query" the hosting service. However, researchers have to deal with usage limitations imposed by these APIs and a lack of required information. For example, the GitHub search APIs allow 30 requests per minute and, when searching repositories, only provide limited information (e.g., the number of commits in a repository is not included). To support researchers in sampling projects from GitHub, we present GHS (GitHub…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
