Determining the Intrinsic Structure of Public Software Development History
Antoine Pietri (DGD-I), Guillaume Rousseau (UP, DGD-I), Stefano, Zacchiroli (UP, DGD-I)

TL;DR
This study explores the intrinsic network structure of the entire corpus of public version control system data to understand its topology and inform analysis methods, using large-scale graph analysis techniques.
Contribution
It provides an initial exploration of the network topology of public software development history using large-scale graph analysis of VCS data from Software Heritage.
Findings
Analysis of degree distributions and their scale-free properties
Distribution patterns of connected component sizes
Shortest path length distributions in the graph
Abstract
Background. Collaborative software development has produced a wealth of version control system (VCS) data that can now be analyzed in full. Little is known about the intrinsic structure of the entire corpus of publicly available VCS as an interconnected graph. Understanding its structure is needed to determine the best approach to analyze it in full and to avoid methodological pitfalls when doing so. Objective. We intend to determine the most salient network topol-ogy properties of public software development history as captured by VCS. We will explore: degree distributions, determining whether they are scale-free or not; distribution of connect component sizes; distribution of shortest path lengths.Method. We will use Software Heritage-which is the largest corpus of public VCS data-compress it using webgraph compression techniques, and analyze it in-memory using classic graph…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
