Modeling Sampling Workflows for Code Repositories

Romain Lefeuvre (DiverSe); Ma\"iwenn Le Goasteller (DiverSe); Jessie Galasso; Benoit Combemale (DiverSe); Quentin Perez (DiverSe); Houari Sahraoui (UdeM; DIRO)

arXiv:2601.19316·cs.SE·April 10, 2026

Modeling Sampling Workflows for Code Repositories

Romain Lefeuvre (DiverSe), Ma\"iwenn Le Goasteller (DiverSe), Jessie Galasso, Benoit Combemale (DiverSe), Quentin Perez (DiverSe), Houari Sahraoui (UdeM, DIRO)

PDF

TL;DR

This paper introduces a Python-based DSL to explicitly model and reason about sampling strategies in code repository datasets, enhancing the understanding of their impact on research generalizability.

Contribution

It presents a formal, composable language for describing sampling workflows, enabling better design, analysis, and reasoning about sampling strategies in software engineering research.

Findings

01

The DSL can accurately model sampling strategies from recent literature.

02

It supports reasoning about the representativeness of samples using statistical indicators.

03

Case study validates the DSL's effectiveness in real-world research scenarios.

Abstract

Empirical software engineering research often depends on datasets of code repository artifacts, where sampling strategies are employed to enable large-scale analyses. The design and evaluation of these strategies are critical, as they directly influence the generalizability of research findings. However, sampling remains an underestimated aspect in software engineering research: we identify two main challenges related to (1) the design and representativeness of sampling approaches, and (2) the ability to reason about the implications of sampling decisions on generalizability. To address these challenges, we propose a Domain-Specific Language (DSL) to explicitly describe complex sampling strategies through composable sampling operators. This formalism supports both the specification and the reasoning about the generalizability of results based on the applied sampling strategies. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.