How Much of the Web Is Archived?
Scott G. Ainsworth, Ahmed AlSum, Hany SalahEldeen, Michele C., Weigle, Michael L. Nelson

TL;DR
This study estimates the extent of web archiving by analyzing sample URIs from various sources, revealing that 35-90% of the Web has at least one archived copy, with significant variation over time.
Contribution
It provides the first comprehensive estimate of how much of the Web is archived across multiple public archives using diverse sample sources.
Findings
35%-90% of Web has at least one archived copy
17%-49% have 2-5 copies in archives
Up to 31.3% of URIs are archived monthly
Abstract
Although the Internet Archive's Wayback Machine is the largest and most well-known web archive, there have been a number of public web archives that have emerged in the last several years. With varying resources, audiences and collection development policies, these archives have varying levels of overlap with each other. While individual archives can be measured in terms of number of URIs, number of copies per URI, and intersection with other archives, to date there has been no answer to the question "How much of the Web is archived?" We study the question by approximating the Web using sample URIs from DMOZ, Delicious, Bitly, and search engine indexes; and, counting the number of copies of the sample URIs exist in various public web archives. Each sample set provides its own bias. The results from our sample sets indicate that range from 35%-90% of the Web has at least one archived…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Web visibility and informetrics · Information Retrieval and Search Behavior
