Modeling Sampling Workflows for Code Repositories
Romain Lefeuvre (DiverSe), Ma\"iwenn Le Goasteller (DiverSe), Jessie Galasso, Benoit Combemale (DiverSe), Quentin Perez (DiverSe), Houari Sahraoui (UdeM, DIRO)

TL;DR
This paper introduces a Python-based DSL to explicitly model and reason about sampling strategies in code repository datasets, enhancing the understanding of their impact on research generalizability.
Contribution
It presents a formal, composable language for describing sampling workflows, enabling better design, analysis, and reasoning about sampling strategies in software engineering research.
Findings
The DSL can accurately model sampling strategies from recent literature.
It supports reasoning about the representativeness of samples using statistical indicators.
Case study validates the DSL's effectiveness in real-world research scenarios.
Abstract
Empirical software engineering research often depends on datasets of code repository artifacts, where sampling strategies are employed to enable large-scale analyses. The design and evaluation of these strategies are critical, as they directly influence the generalizability of research findings. However, sampling remains an underestimated aspect in software engineering research: we identify two main challenges related to (1) the design and representativeness of sampling approaches, and (2) the ability to reason about the implications of sampling decisions on generalizability. To address these challenges, we propose a Domain-Specific Language (DSL) to explicitly describe complex sampling strategies through composable sampling operators. This formalism supports both the specification and the reasoning about the generalizability of results based on the applied sampling strategies. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
