The Rise of GitHub in Scholarly Publications
Emily Escamilla, Martin Klein, Talya Cooper, Vicky Rampin, Michele C., Weigle, Michael L. Nelson

TL;DR
This paper highlights the increasing use of GitHub and similar platforms in scholarly publications, emphasizing the urgent need for dedicated archiving efforts to preserve research code and context for reproducibility.
Contribution
It provides a quantitative analysis of GHP URI usage in arXiv and PMC, demonstrating the growing reliance on these platforms in scholarly work and the need for specialized archiving solutions.
Findings
GitHub URIs appeared in 1/5 of arXiv publications in 2021.
GHP references increased from 253,590 in 2007 to 76,746 in 2021.
The complexity of GHPs challenges traditional web archiving methods.
Abstract
The definition of scholarly content has expanded to include the data and source code that contribute to a publication. While major archiving efforts to preserve conventional scholarly content, typically in PDFs (e.g., LOCKSS, CLOCKSS, Portico), are underway, no analogous effort has yet emerged to preserve the data and code referenced in those PDFs, particularly the scholarly code hosted online on Git Hosting Platforms (GHPs). Similarly, the Software Heritage Foundation is working to archive public source code, but there is value in archiving the issue threads, pull requests, and wikis that provide important context to the code while maintaining their original URLs. In current implementations, source code and its ephemera are not preserved, which presents a problem for scholarly projects where reproducibility matters. To understand and quantify the scope of this issue, we analyzed the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Advanced Data Storage Technologies · Research Data Management Practices
