Does the Tool Matter? Exploring Some Causes of Threats to Validity in Mining Software Repositories
Nicole Hoess, Carlos Paradis, Rick Kazman, Wolfgang Mauerer

TL;DR
This study investigates how different mining tools and their technical details can cause significant discrepancies in software repository analysis results, potentially affecting research validity.
Contribution
It compares two tools on large projects to identify causes of discrepancies and offers guidance to improve validity and reproducibility in mining software repositories.
Findings
Discrepancies in simple metrics can reach up to 500%.
Minor technical details often cause major differences.
Adjusting code and parameters can reduce discrepancies.
Abstract
Software repositories are an essential source of information for software engineering research on topics such as project evolution and developer collaboration. Appropriate mining tools and analysis pipelines are therefore an indispensable precondition for many research activities. Ideally, valid results should not depend on technical details of data collection and processing. It is, however, widely acknowledged that mining pipelines are complex, with a multitude of implementation decisions made by tool authors based on their interests and assumptions. This raises the questions if (and to what extent) tools agree on their results and are interchangeable. In this study, we use two tools to extract and analyse ten large software projects, quantitatively and qualitatively comparing results and derived data to better understand this concern. We analyse discrepancies from a technical point of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Software Engineering Research · Semantic Web and Ontologies
