On the Creation of Representative Samples of Software Repositories

June Gorostidi; Adem Ait; Jordi Cabot; Javier Luis C\'anovas Izquierdo

arXiv:2410.00639·cs.SE·October 3, 2024

On the Creation of Representative Samples of Software Repositories

June Gorostidi, Adem Ait, Jordi Cabot, Javier Luis C\'anovas Izquierdo

PDF

Open Access

TL;DR

This paper proposes a methodology for creating representative samples of software repositories that align with the population's characteristics and research needs, improving sampling accuracy for empirical studies.

Contribution

It introduces a new sampling methodology tailored for software repositories, addressing limitations of existing random or variable-based methods.

Findings

01

Effective sampling aligns with repository characteristics

02

Use cases demonstrate improved representativeness

03

Method enhances reliability of empirical software engineering studies

Abstract

Software repositories is one of the sources of data in Empirical Software Engineering, primarily in the Mining Software Repositories field, aimed at extracting knowledge from the dynamics and practice of software projects. With the emergence of social coding platforms such as GitHub, researchers have now access to millions of software repositories to use as source data for their studies. With this massive amount of data, sampling techniques are needed to create more manageable datasets. The creation of these datasets is a crucial step, and researchers have to carefully select the repositories to create representative samples according to a set of variables of interest. However, current sampling methods are often based on random selection or rely on variables which may not be related to the research study (e.g., popularity or activity). In this paper, we present a methodology for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Semantic Web and Ontologies