Longitudinal Sampling of URLs From the Wayback Machine
Kritika Garg, Sawood Alam, Dietrich Ayala, Mark Graham, Michele C. Weigle, and Michael L. Nelson

TL;DR
This study presents a large-scale longitudinal sampling of 27.3 million URLs from the Internet Archive's Wayback Machine, analyzing web page longevity and archiving patterns over 26 years to inform future web preservation research.
Contribution
It introduces a comprehensive sampling methodology for archived web pages, addressing biases and limitations in existing datasets, and provides insights into web archiving trends over time.
Findings
More URLs archived in later years due to increased archiving capacity.
Sampling biases towards popular domains like Yahoo and Twitter.
Lessons learned to improve future web sampling strategies.
Abstract
We document strategies and lessons learned from sampling the web by collecting 27.3 million URLs with 3.8 billion archived pages spanning 26 years (1996-2021) from the Internet Archive's (IA) Wayback Machine. Our goal is to revisit fundamental questions regarding the size, nature, and prevalence of the publicly archivable web, in particular, to reconsider the question: "How long does a web page last?" Addressing this question requires obtaining a sample of the web. We proposed several dimensions to sample URLs from the Wayback Machine's holdings: time of first archive, HTML vs. other MIME types, URL depth (top-level pages vs. deep links), and top-level domain (TLD). We sampled 285 million URLs from IA's ZipNum index file, which contains every 6000th line of the CDX index. These indexes also include URLs of embedded resources such as images, CSS, and JavaScript. To limit our sample to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
