Summarising Big Data: Common GitHub Dataset for Software Engineering   Challenges

Abdulkadir \c{S}eker; Banu Diri; Halil Arslan

arXiv:2006.04967·cs.SE·October 1, 2020

Summarising Big Data: Common GitHub Dataset for Software Engineering Challenges

Abdulkadir \c{S}eker, Banu Diri, Halil Arslan

PDF

TL;DR

This paper discusses the creation of a common GitHub dataset to facilitate consistent research in software engineering and natural language processing, addressing the challenge of data variability and processing difficulties.

Contribution

It introduces a shared, standardized dataset for software engineering research using GitHub data, enabling better comparison and reproducibility across studies.

Findings

01

Facilitates consistent benchmarking across studies

02

Reduces data processing complexity for researchers

03

Enhances reproducibility of software engineering research

Abstract

In open-source software development environments; textual, numerical and relationship-based data generated are of interest to researchers. Various data sets are available for this data, which is frequently used in areas such as software engineering and natural language processing. However, since these data sets contain all the data in the environment, the problem arises in the terabytes of data processing. For this reason, almost all of the studies using GitHub data use filtered data according to certain criteria. In this context, using a different data set in each study makes a comparison of the accuracy of the studies quite difficult. In order to solve this problem, a common dataset was created and shared with the researchers, which would allow us to work on many software engineering problems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.