GitHub Repository Complexity Leads to Diminished Web Archive Availability
David Calano, Michele C. Weigle, Michael L. Nelson

TL;DR
This study analyzes how the complexity of GitHub repositories affects their preservation in the Internet Archive, revealing significant gaps in the archival of both project home pages and source code files.
Contribution
It provides a large-scale empirical assessment of the preservation quality of GitHub repositories in web archives, highlighting the impact of repository complexity on archival completeness.
Findings
Over 31% of archived home pages had minor damage
1.6% exhibited major page damage
Less than 5% of source files were archived on average
Abstract
Software is often developed using versioned controlled software, such as Git, and hosted on centralized Web hosts, such as GitHub and GitLab. These Web hosted software repositories are made available to users in the form of traditional HTML Web pages for each source file and directory, as well as a presentational home page and various descriptive pages. We examined more than 12,000 Web hosted Git repository project home pages, primarily from GitHub, to measure how well their presentational components are preserved in the Internet Archive, as well as the source trees of the collected GitHub repositories to assess the extent to which their source code has been preserved. We found that more than 31% of the archived repository home pages examined exhibited some form of minor page damage and 1.6% exhibited major page damage. We also found that of the source trees analyzed, less than 5% of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
