How Much of the Web Is Archived?

Scott G. Ainsworth; Ahmed AlSum; Hany SalahEldeen; Michele C.; Weigle; Michael L. Nelson

arXiv:1212.6177·cs.DL·January 8, 2013·2 cites

How Much of the Web Is Archived?

Scott G. Ainsworth, Ahmed AlSum, Hany SalahEldeen, Michele C., Weigle, Michael L. Nelson

PDF

Open Access

TL;DR

This study estimates the extent of web archiving by analyzing sample URIs from various sources, revealing that 35-90% of the Web has at least one archived copy, with significant variation over time.

Contribution

It provides the first comprehensive estimate of how much of the Web is archived across multiple public archives using diverse sample sources.

Findings

01

35%-90% of Web has at least one archived copy

02

17%-49% have 2-5 copies in archives

03

Up to 31.3% of URIs are archived monthly

Abstract

Although the Internet Archive's Wayback Machine is the largest and most well-known web archive, there have been a number of public web archives that have emerged in the last several years. With varying resources, audiences and collection development policies, these archives have varying levels of overlap with each other. While individual archives can be measured in terms of number of URIs, number of copies per URI, and intersection with other archives, to date there has been no answer to the question "How much of the Web is archived?" We study the question by approximating the Web using sample URIs from DMOZ, Delicious, Bitly, and search engine indexes; and, counting the number of copies of the sample URIs exist in various public web archives. Each sample set provides its own bias. The results from our sample sets indicate that range from 35%-90% of the Web has at least one archived…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Web visibility and informetrics · Information Retrieval and Search Behavior