A Tool to Extract Structured Data from GitHub

Shreyansh Surana; Smit Detroja; Saurabh Tiwari

arXiv:2012.03453·cs.SE·December 8, 2020·1 cites

A Tool to Extract Structured Data from GitHub

Shreyansh Surana, Smit Detroja, Saurabh Tiwari

PDF

Open Access

TL;DR

This paper introduces GitRepository, a tool that systematically extracts detailed structured data from GitHub repositories to facilitate knowledge mining and dataset creation.

Contribution

The paper presents a novel tool, GitRepository, for creating structured datasets from GitHub repositories, filling a gap in systematic open source project data collection.

Findings

01

Created a dataset of 620 repositories after filtering

02

Extracted detailed repository information into CSV and database formats

03

Facilitated knowledge mining from open source projects

Abstract

GitHub repositories consist of various detailed information about the project contributors, the number of commits and its contributors, releases, pull requests, programming languages, and issues. However, no systematic dataset of open source projects exists which features detailed information about the repositories on GitHub for knowledge acquisition and mining. In this paper, we developed tool support, named GitRepository, which helps in creating a data-set of repositories based on the proposed schema. Out of initial 1680 repositories, the dataset hosts 620 repositories (with applied basic filters of stars and forks), and 247 repositories (after applying all pre-defined filters). The tool extracts the information of GitHub repositories and saves the data in the form of CSV. files and a database (.DB) file.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Scientific Computing and Data Management · Software System Performance and Reliability