Collecting 16K archived web pages from 17 public web archives

Mohamed Aturban; Michael L. Nelson; Michele C. Weigle; Martin Klein; and Herbert Van de Sompel

arXiv:1905.03836·cs.DL·May 13, 2019·1 cites

Collecting 16K archived web pages from 17 public web archives

Mohamed Aturban, Michael L. Nelson, Michele C. Weigle, Martin Klein, and Herbert Van de Sompel

PDF

Open Access 1 Repo

TL;DR

This paper details the creation of a comprehensive dataset of over 16,000 archived web pages from 17 public archives, using multiple collection methods to facilitate web preservation research.

Contribution

It introduces a novel, multi-method approach for collecting and sampling a large, diverse set of archived web pages from multiple sources and protocols.

Findings

01

Collected 16,627 mementos from 17 archives

02

Used four distinct collection methods for comprehensive coverage

03

Downsampled dataset to manage size and download constraints

Abstract

We document the creation of a data set of 16,627 archived web pages, or mementos, of 3,698 unique live web URIs (Uniform Resource Identifiers) from 17 public web archives. We used four different methods to collect the dataset. First, we used the Los Alamos National Laboratory (LANL) Memento Aggregator to collect mementos of an initial set of URIs obtained from four sources: (a) the Moz Top 500, (b) the dataset used in our previous study, (c) the HTTP Archive, and (d) the Web Archives for Historical Research group. Second, we extracted URIs from the HTML of already collected mementos. These URIs were then used to look up mementos in LANL's aggregator. Third, we downloaded web archives' published lists of URIs of both original pages and their associated mementos. Fourth, we collected more mementos from archives that support the Memento protocol by requesting TimeMaps directly from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

oduwsdl/mementos-fixity
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Natural Language Processing Techniques · Mass Spectrometry Techniques and Applications