# Growth and Duplication of Public Source Code over Time: Provenance   Tracking at Scale

**Authors:** Guillaume Rousseau (UPD7), Roberto Di Cosmo (IRIF), Stefano Zacchiroli, (IRIF)

arXiv: 1906.08076 · 2019-06-20

## TL;DR

This paper analyzes the exponential growth and widespread duplication of public source code over 40 years, examining how to effectively track provenance at scale using scalable data models.

## Contribution

It provides the first large-scale analysis of source code growth and duplication, and benchmarks scalable provenance tracking solutions for massive code corpora.

## Key findings

- Exponential growth of source code files over 40 years.
- Widespread duplication of code artifacts across commits.
- A scalable provenance tracking model suitable for large-scale deployment.

## Abstract

We study the evolution of the largest known corpus of publicly available source code, i.e., the Software Heritage archive (4B unique source code files, 1B commits capturing their development histories across 50M software projects). On such corpus we quantify the growth rate of original, never-seen-before source code files and commits. We find the growth rates to be exponential over a period of more than 40 years.We then estimate the multiplication factor, i.e., how much the same artifacts (e.g., files or commits) appear in different contexts (e.g., commits or source code distribution places). We observe a combinatorial explosion in the multiplication of identical source code files across different commits.We discuss the implication of these findings for the problem of tracking the provenance of source code artifacts (e.g., where and when a given source code file or commit has been observed in the wild) for the entire body of publicly available source code. To that end we benchmark different data models for capturing software provenance information at this scale and growth rate. We identify a viable solution that is deployable on commodity hardware and appears to be maintainable for the foreseeable future.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1906.08076/full.md

## Figures

20 figures with captions in the complete paper: https://tomesphere.com/paper/1906.08076/full.md

## References

41 references — full list in the complete paper: https://tomesphere.com/paper/1906.08076/full.md

---
Source: https://tomesphere.com/paper/1906.08076