Collecting 16K archived web pages from 17 public web archives
Mohamed Aturban, Michael L. Nelson, Michele C. Weigle, Martin Klein, and Herbert Van de Sompel

TL;DR
This paper details the creation of a comprehensive dataset of over 16,000 archived web pages from 17 public archives, using multiple collection methods to facilitate web preservation research.
Contribution
It introduces a novel, multi-method approach for collecting and sampling a large, diverse set of archived web pages from multiple sources and protocols.
Findings
Collected 16,627 mementos from 17 archives
Used four distinct collection methods for comprehensive coverage
Downsampled dataset to manage size and download constraints
Abstract
We document the creation of a data set of 16,627 archived web pages, or mementos, of 3,698 unique live web URIs (Uniform Resource Identifiers) from 17 public web archives. We used four different methods to collect the dataset. First, we used the Los Alamos National Laboratory (LANL) Memento Aggregator to collect mementos of an initial set of URIs obtained from four sources: (a) the Moz Top 500, (b) the dataset used in our previous study, (c) the HTTP Archive, and (d) the Web Archives for Historical Research group. Second, we extracted URIs from the HTML of already collected mementos. These URIs were then used to look up mementos in LANL's aggregator. Third, we downloaded web archives' published lists of URIs of both original pages and their associated mementos. Fourth, we collected more mementos from archives that support the Memento protocol by requesting TimeMaps directly from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Natural Language Processing Techniques · Mass Spectrometry Techniques and Applications
