Forking Without Clicking: on How to Identify Software Repository Forks
Antoine Pietri (DGD-I), Guillaume Rousseau (UP, DGD-I), Stefano, Zacchiroli (UP, DGD-I)

TL;DR
This paper investigates how to identify software forks beyond platform metadata, analyzing real-world forking workflows and their implications for empirical research accuracy.
Contribution
It proposes alternative definitions of software forks, quantifies differences from platform-based identification, and discusses impacts on empirical studies.
Findings
Forge forks only capture a subset of real-world forks.
Significant number of forks are overlooked by platform metadata.
Fork network structures vary based on definitions used.
Abstract
The notion of software ''fork'' has been shifting over time from the (negative) phenomenon of community disagreements that result in the creation of separate development lines and ultimately software products, to the (positive) practice of using distributed version control system (VCS) repositories to collaboratively improve a single product without stepping on each others toes. In both cases the VCS repositories participating in a fork share parts of a common development history. Studies of software forks generally rely on hosting platform metadata, such as GitHub, as the source of truth for what constitutes a fork. These ''forge forks'' however can only identify as forks repositories that have been created on the platform, e.g., by clicking a ''fork'' button on the platform user interface. The increased diversity in code hosting platforms (e.g., GitLab) and the habits of significant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Scientific Computing and Data Management · Open Source Software Innovations
