How are Software Repositories Mined? A Systematic Literature Review of Workflows, Methodologies, Reproducibility, and Tools
Adam Tutko, Austin Z. Henley, Audris Mockus

TL;DR
This systematic review analyzes two decades of software repository mining research, highlighting issues in data workflows and reproducibility, and proposes improvements to enhance research validity and transparency.
Contribution
It provides a comprehensive analysis of research workflows, identifies reproducibility challenges, and offers recommendations to improve data handling and transparency in software repository mining.
Findings
Dataset selection is often problematic, questioning result generality.
Many papers lack reproducibility instructions, hindering validation.
33% of papers do not specify data retrieval methods.
Abstract
With the advent of open source software, a veritable treasure trove of previously proprietary software development data was made available. This opened the field of empirical software engineering research to anyone in academia. Data that is mined from software projects, however, requires extensive processing and needs to be handled with utmost care to ensure valid conclusions. Since the software development practices and tools have changed over two decades, we aim to understand the state-of-the-art research workflows and to highlight potential challenges. We employ a systematic literature review by sampling over one thousand papers from leading conferences and by analyzing the 286 most relevant papers from the perspective of data workflows, methodologies, reproducibility, and tools. We found that an important part of the research workflow involving dataset selection was particularly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Scientific Computing and Data Management · Big Data and Business Intelligence
