On the Creation of Representative Samples of Software Repositories
June Gorostidi, Adem Ait, Jordi Cabot, Javier Luis C\'anovas Izquierdo

TL;DR
This paper proposes a methodology for creating representative samples of software repositories that align with the population's characteristics and research needs, improving sampling accuracy for empirical studies.
Contribution
It introduces a new sampling methodology tailored for software repositories, addressing limitations of existing random or variable-based methods.
Findings
Effective sampling aligns with repository characteristics
Use cases demonstrate improved representativeness
Method enhances reliability of empirical software engineering studies
Abstract
Software repositories is one of the sources of data in Empirical Software Engineering, primarily in the Mining Software Repositories field, aimed at extracting knowledge from the dynamics and practice of software projects. With the emergence of social coding platforms such as GitHub, researchers have now access to millions of software repositories to use as source data for their studies. With this massive amount of data, sampling techniques are needed to create more manageable datasets. The creation of these datasets is a crucial step, and researchers have to carefully select the repositories to create representative samples according to a set of variables of interest. However, current sampling methods are often based on random selection or rely on variables which may not be related to the research study (e.g., popularity or activity). In this paper, we present a methodology for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Semantic Web and Ontologies
