Web Archive Analytics

Michael V\"olske; Janek Bevendorff; Johannes Kiesel; Benno Stein; Maik; Fr\"obe; Matthias Hagen; Martin Potthast

arXiv:2107.00893·cs.DL·July 5, 2021·5 cites

Web Archive Analytics

Michael V\"olske, Janek Bevendorff, Johannes Kiesel, Benno Stein, Maik, Fr\"obe, Matthias Hagen, Martin Potthast

PDF

Open Access

TL;DR

This paper discusses the challenges of web archive analytics, compares the scale of web data to other datasets, and describes infrastructure for processing and analyzing large-scale web archive data from Internet Archive and Common Crawl.

Contribution

It introduces a comprehensive framework for understanding web archive data and details infrastructure for processing and analyzing large-scale web archives for research.

Findings

01

Relation of the Global Datasphere to other data sets

02

Agreement with Internet Archive for data access

03

Infrastructure for hosting 8 PB of web archive data

Abstract

Web archive analytics is the exploitation of publicly accessible web pages and their evolution for research purposes -- to the extent organizationally possible for researchers. In order to better understand the complexity of this task, the first part of this paper puts the entirety of the world's captured, created, and replicated data (the "Global Datasphere") in relation to other important data sets such as the public internet and its web pages, or what is preserved thereof by the Internet Archive. Recently, the Webis research group, a network of university chairs to which the authors belong, concluded an agreement with the Internet Archive to download a substantial part of its web archive for research purposes. The second part of the paper in hand describes our infrastructure for processing this data treasure: We will eventually host around 8 PB of web archive data from the Internet…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Web visibility and informetrics · Research Data Management Practices