More Effective Software Repository Mining
Adam Tutko, Austin Henley, Audris Mockus

TL;DR
This paper introduces a new interface leveraging the World of Code system to improve sampling and analysis of software repositories, aiming to enhance data validity and research generalizability.
Contribution
It presents a resource that simplifies data sampling and retrieval for Mining Software Repository researchers, addressing current sampling and data completeness issues.
Findings
Enhanced data sampling accessibility using the World of Code
Improved validity and completeness of repository data
Facilitated more representative software repository studies
Abstract
Background: Data mining and analyzing of public Git software repositories is a growing research field. The tools used for studies that investigate a single project or a group of projects have been refined, but it is not clear whether the results obtained on such ``convenience samples'' generalize. Aims: This paper aims to elucidate the difficulties faced by researchers who would like to ascertain the generalizability of their findings by introducing an interface that addresses the issues with obtaining representative samples. Results: To do that we explore how to exploit the World of Code system to make software repository sampling and analysis much more accessible. Specifically, we present a resource for Mining Software Repository researchers that is intended to simplify data sampling and retrieval workflow and, through that, increase the validity and completeness of data. Conclusions:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Data Mining Algorithms and Applications · Software System Performance and Reliability
